Unit 2.3.3 Streaming \(A_{i,p} \) and \(B_{p,j} \)
¶Now, if the submatrix \(C_{i,j} \) were larger (i.e., \(m_R \) and \(n_R \) were larger), then the time spent moving data around (overhead),
is reduced:
decreases as \(m_R \) and \(n_R \) increase. In other words, the larger the submatrix of \(C \) that is kept in registers, the lower the overhead (under our model). Before we increase \(m_R \) and \(n_R \text{,}\) we first show how for fixed \(m_R \) and \(n_R \) the number of registers that are needed for storing elements of \(A \) and \(B \) can be reduced, which then allows \(m_R \) and \(n_R \) to (later this week) be increased in size.
Recall from Week 1 that if the \(p \) loop of matrix-matrix multiplication is the outer-most loop, then the inner two loops implement a rank-1 update. If we do this for the kernel, then the result is illustrated by
If we now consider the computation we recognize that after each rank-1 update, the column of \(A_{i,p} \) and row of \(B_{p,j} \) that participate in that rank-1 update are not needed again in the computation of \(C_{i,j} := A_{i,p} B_{p,j} + C_{i,j} \text{.}\) Thus, only room for one such column of \(A_{i,p}\) and one such row of \(B_{p,j} \) needs to be reserved in the registers, reducing the total number of doubles that need to be stored in registers toIf \(m_R = n_R = 4 \) this means that \(16 + 4 + 4 = 24 \) doubles in registers. If we now view the entire computation performed by the loop indexed with \(p \text{,}\) this can be illustrated by
and implemented asHomework 2.3.3.1.
In directory Assignments/Week2/C
examine the file Assignments/Week2/C/Gemm_JIP_P_Ger.c
Execute it with
make JIP_P_Gerand view its performance with
Assignments/Week2/C/data/Plot_register_blocking.mlx
.
This is the performance on Robert's laptop, with MB=NB=KB=4
:
It is tempting to start playing with the parameters MB
, NB
, and KB
. However, the point of this exercise is to illustrate the discussion about casting the multiplication with the blocks in terms of a loop around rank-1 updates.
Remark 2.3.6.
The purpose of this implementation is to emphasize, again, that the matrix-matrix multiplication with blocks can be orchestratred as a loop around rank-1 updates, as you already learned in Unit 1.4.2.