Subsection 12.2.3 Basics of optimizing matrix-matrix multiplication
ΒΆLet us again consider the computation
where \(A \text{,}\) \(B \text{,}\) and \(C \) are \(m \times m \) matrices. If \(m \) is small enough, then we can read the three matrices into the L1 cache, perform the operation, and write the updated matrix \(C \) back to memory. In this case,
During the computation, the matrices are in a fast memory (the L1 cache), which can keep up with the speed of floating point computation and
The cost of moving each floating point number from main memory into the L1 cache is amortized over \(m/2 \) floating point computations.
If \(m \) is large enough, then the cost of moving the data becomes insignificant. (If carefully orchestrated, some of the movement of data can even be overlapped with computation, but that is beyond our discussion.)
We immediately notice there is a tension: \(m \) must be small so that all three matrices can fit in the L1 cache. Thus, this only works for relatively small matrices. However, for small matrices, the ratio \(m / 2 \) may not be favorable enough to offset the very slow main memory.
Fortunately, matrix-matrix multiplication can be orchestrated by partitioning the matrices that are involved into submatrices, and computing with these submatrices instead. We recall that if we partition
and
where \(C_{i,j} \) is \(m_i \times n_j \text{,}\) \(A_{i,p} \) is \(m_i \times k_p \text{,}\) and \(B_{p,j} \) is \(k_p \times n_j \text{,}\) with \(\sum_{i=0}^{M-1} m_i = m \text{,}\) \(\sum_{j=0}^{N-1} n_i = n \text{,}\) and \(\sum_{p=0}^{K-1} k_i = k \text{,}\) then
If we choose each \(m_i \text{,}\) \(n_j \text{,}\) and \(k_p\) small enough, then the submatrices fit in the L1 cache. This still leaves us with the problem that these sizes must be reasonably small if the ratio of flops to memops is to be sufficient. The answer to that is to block for multiple levels of caches.