Subsection 12.2.3 Basics of optimizing matrix-matrix multiplication
¶Let us again consider the computation
C:=AB+C,
where A, B, and C are m×m matrices. If m is small enough, then we can read the three matrices into the L1 cache, perform the operation, and write the updated matrix C back to memory. In this case,
During the computation, the matrices are in a fast memory (the L1 cache), which can keep up with the speed of floating point computation and
The cost of moving each floating point number from main memory into the L1 cache is amortized over m/2 floating point computations.
C=(C0,0C0,1⋯C0,N−1C1,0C1,1⋯C1,N−1⋮⋮⋮CM−1,0CM−1,1⋯CM−1,N−1),
A=(A0,0A0,1⋯A0,K−1A1,0A1,1⋯A1,K−1⋮⋮⋮AM−1,0AM−1,1⋯AM−1,K−1),
and
B=(B0,0B0,1⋯B0,N−1B1,0B1,1⋯B1,N−1⋮⋮⋮BK−1,0BK−1,1⋯BK−1,N−1),
where Ci,j is mi×nj, Ai,p is mi×kp, and Bp,j is kp×nj, with ∑M−1i=0mi=m, ∑N−1j=0ni=n, and ∑K−1p=0ki=k, then
Ci,j:=K−1∑p=0Ai,pBp,j+Ci,j.
If we choose each mi, nj, and kp small enough, then the submatrices fit in the L1 cache. This still leaves us with the problem that these sizes must be reasonably small if the ratio of flops to memops is to be sufficient. The answer to that is to block for multiple levels of caches.