For computation to happen, input data must be brought from memory into registers and, sooner or later, output data must be written back to memory.
For now let us make the following assumptions:
Our processor has only one core.
That core only has two levels of memory: registers and main memory.
Moving data between main memory and registers takes time \(\beta \) per double.
The registers can hold 64 doubles.
Performing a flop with data in registers takes time \(\gamma_R \text{.}\)
Data movement and computation cannot overlap.
Later, we will change some of these assumptions.
Given that the registers can hold 64 doubles, let's first consider blocking \(C \text{,}\) \(A \text{,}\) and \(B \) into \(4 \times 4 \) submatrices. In other words, consider the example from Unit 2.2.1 with \(m_R = n_R = k_R = 4
\text{.}\) This means \(3 \times 4 \times 4 = 48 \) registers are being used for storing the submatrices. (Yes, we are not using all registers. We'll get to that later.) Let us call the routine that computes with three blocks "the kernel". Thus, the blocked algorithm is a triple-nested loop around a call to this kernel:
The time required to complete this algorithm can be estimated as
\begin{equation*}
\begin{array}{l}
M N K ( 2 m_R n_R k_R ) \gamma_R+ M N K ( 2 m_R n_R + m_R k_R + k_R n_R ) \beta\\
~~~~= 2 m n k \gamma_R+ m n k ( \frac{2}{k_R} +
\frac{1}{n_R} + \frac{1}{m_R} ) \beta,
\end{array}
\end{equation*}
since \(M = m/m_R\) , \(N = n/n_R \text{,}\) and \(K = k/k_R \text{,}\)
We next recognize that since the loop indexed with \(p\) is the inner-most loop, the loading and storing of \(C_{i,j} \) does not need to happen before every call to the kernel, yielding the algorithm
\begin{equation*}
\begin{array}{l}
M N K ( 2 m_R n_R k_R ) \gamma_R
+ \left[ M N ( 2 m_R n_R )
+ M N K (m_R k_R + k_R n_R )\right] \beta\\
~~~= 2 m n k \gamma_R+ \left[ 2 m n + m n k (
\frac{1}{n_R} + \frac{1}{m_R} ) \right] \beta.
\end{array}
\end{equation*}
Now, if the submatrix \(C_{i,j} \) were larger (i.e., \(m_R \) and \(n_R \) were larger), then the overhead can be reduced. Obviously, we can use more registers (we are only storing 48 of a possible 64 doubles in registers when \(m_R = n_R = k_R = 4 \)). However, let's first discuss how to require fewer than 48 doubles to be stored while keeping \(m_R = n_R = k_R = 4 \text{.}\) Recall from Week 1 that if the P loop of matrix-matrix multiplication is the outer-most loop, then the inner two loops implement a rank-1 update. If we do this for the kernel, then the result is illustrated by
If we now consider the computation
We recognize that after each rank-1 update, the column of \(A_{i,p} \) and row of \(B_{p,j} \) that participate in that rank-1 update are not needed again in the computation \(C_{i,j} := A_{i,p} B_{p,j} + C_{i,j}
\text{.}\) Thus, only room for one such column of \(A_{i,p}\) and one such row of \(B_{p,j} \) needs to be reserved in the registers, reducing the total use of registers to \(m_R \times n_R + m_R + n_R \text{.}\) If \(m_R = n_R = 4 \) this equals \(16 + 4 + 4 = 24 \) doubles. That means in turn that, later, \(m_R \) and \(n_R \) can be chosen to be larger than if the entire blocks of \(A \) and \(B \) were stored. If we now view the entire computation performed by the loop indexed with \(p \text{,}\) this can be illustrated by and implemented, in terms of a call to the BLIS rank-1 update routine, in Figure 2.3.4.
What we now recognize is that matrices \(A \) and \(B\) need not be partitioned into \(m_R \times k_R \) and \(k_R \times n_R \) submatrices (respectively). Instead, \(A\) can be partitioned into \(m_R \times k\) row panels and \(B\) into \(k \times n_R \) column panels:
Another way of looking at this is that the inner loop indexed with \(p \) in the blocked algorithm can be merged with the outer loop of the kernel. The entire update of the \(m_R \times n_R \) submatrix of \(C \) can then be pictured as This is the computation that we will call the microkernel from here on.
Bringing \(A_{i,p} \) one column at a time into registers and \(B_{p,j} \) one row at a time into registers does not change the cost of reading that data. So, the cost of the resulting approach is still estimated as
\begin{equation*}
\begin{array}{l}
M N K ( 2 m_R n_R k_R ) \gamma_R
+ \left[ M N ( 2 m_R n_R )
+ M N K (m_R k_R + k_R n_R )\right] \beta \\
~~~= 2 m n k \gamma_R+ \left[ 2 m n + m n k (
\frac{1}{n_R} + \frac{1}{m_R} ) \right] \beta.
\end{array}
\end{equation*}
We can arrive at the same algorithm by partitioning
\begin{equation*}
B = \left( \begin{array}{c | c | c | c }
B_{0} \amp B_{1} \amp \cdots \amp B_{0,N-1}
\end{array} \right),
\end{equation*}
where \(C_{i,j} \) is \(m_R \times n_R \text{,}\) \(A_i \) is \(m_R \times k \text{,}\) and \(B_j \) is \(k \times n_R \text{.}\) Then computing \(C := A B + C \) means updating \(C_{i,j} := A_i B_j + C_{i,j} \) for all \(i, j \text{:}\)
Obviously, the order of the two loops can be switched. Again, the computation \(C_{i,j} := A_i B_j + C_{i,j} \) where \(C_{i,j} \) fits in registers now becomes what we will call the microkernel.
As the reader should have noticed by now, in matrix-matrix multiplication for every floating point multiplication a corresponding floating point addition is encountered to accumulate the result. For this reason, such floating point computations are usually cast in terms of fused multiply add ( FMA) operations, performed by a floating point unit (FPU) of the core.
What is faster than computing one FMA at a time? Computing multiple FMAs at a time! To accelerate computation, modern cores are equipped with vector registers, vector instructions, and vector FPUs that can simultaneously perform multiple FMA operations. This is called instruction-level parallelism.
In this section, we are going to use Intel's vector intrinsic instructions for Intel's AVX instruction set to illustrate the ideas. Vector intrinsics are supported in the C programming language for performing these vector instructions. We illustrate how the intrinsic operations are used by incorporating them into an implemention of the microkernel for the case where \(m_R
\times n_R = 4 \times 4 \text{.}\) Thus, for the remainder of this unit, \(C \) is \(m_R \times n_R = 4 \times 4
\text{,}\) \(A \) is \(m_R \times k = 4 \times k \text{,}\) and \(B \) is \(k \times n_R = k \times 4 \text{.}\)
Now, \(C +:= A B \) when \(C \) is \(4 \times 4 \) translates to the computation
This, once again, shows how matrix-matrix multiplication can be cast in terms of a sequence of rank-1 updates, and that a rank-1 update can be implemented in terms of axpy operations with columns of \(C \text{.}\)
Now we translate this into code that employs vector instructions, given in Figure~Figure 2.3.7 and illustrated in Figure 2.3.8. To use the intrinsics, we start by including the header file immintrin.h.
creates gamma_0123_0 as a variable that references a vector register with four double precision numbers and loads it with the four numbers that are stored starting at address &gamma( 0,0 ). In other words, it loads the vector register with the original values
\begin{equation*}
\left(
\begin{array}{c c c c}
\gamma_{0,0} \\
\gamma_{1,0} \\
\gamma_{2,0} \\
\gamma_{3,0}
\end{array}
\right).
\end{equation*}
This is repeated for the other three columns of \(C \text{:}\)
leaving the result in the vector registers. Each iteration starts by declaring vector register variable alpha_0123_p and loading it with the contents of
\begin{equation*}
\left(
\begin{array}{c c c c}
\alpha_{0,p} \\
\alpha_{1,p} \\
\alpha_{2,p} \\
\alpha_{3,p}
\end{array}
\right).
\end{equation*}