Section 3.3 Contiguous Memory Access is Important
Unit 3.3.1
One would think that once one introduces blocking, the higher performance observed for small matrix sizes will be maintained as the problem size increases. This is not what you observed: there is still a steady decrease in performance. To investigate what is going on, copy the driver routine in driver.c into driver_ldim.c and in that new file change the line
ldA = ldB = ldC = size;to
ldA = ldB = ldC = lastThis sets the leading dimension of matrices \(A\text{,}\) \(B \text{,}\) and \(C \) to the largest problem size to be tested, for all experiments. Collect performance data by executing
make Five_Loops_ldimin that directory. The resulting performance can be viewed with Live Script
Assignments/Week3/C/data/Plot_Five_Loops.mlx
.What is going on here? Modern architectures like the Haswell processor have a feature called "hardware prefetching" that speculatively preloads data into caches based on observed access patterns so that this loading is hopefully overlapped with computation. Processors also organize memory in terms of pages, for reasons that go beyond the scope of this course. For our purposes, a page is a contiguous block of memory of 4 Kbytes (4096 bytes or 512 doubles). Prefetching only occurs within a page. Since the leading dimension is now large, when the computation moves from column to column in a matrix data is not prefetched. You may want to learn more on memory pages by visiting https://en.wikipedia.org/wiki/Page_(computer_memory)
on Wikipedia.
The bottom line: early in the course we discussed that marching contiguously through memory is a good thing. While we claim that we block for the caches, in practice the data is not contiguous and hence we can't really control how the data is brought into caches and where it exists at a particular time in the computation.
Unit 3.3.2
Homework 3.3.2
In the setting of the five loops, the microkernel updates \(C := A B + C \) where \(C \) is \(m_R \times n_R \text{,}\) \(A \) is \(m_R \times k_C \text{,}\) and \(B\) is \(k_C \times n_R \text{.}\) How many matrix elements of \(C \) have to be brought into registers to perform this computation? How many flops are performed with these elements? How many flops are performed with each element of \(C \text{?}\) We seem to have settled on a \(m_R \times n_R = 12 \times 4 \) microtile. Let's use \(k_C = 128 \text{.}\) What is the ratio for these parameters?
The next step towards optimizing the microkernel recognizes that computing with contiguous data (accessing data with a "stride one" access pattern) improves performance. The fact that the \(m_R \times n_R \) microtile of \(C\) is not in contiguous memory is not particularly important. The cost of bringing it into the vector registers from some layer in the memory is mostly inconsequential because a lot of computation is performed before it is written back out. It is the repeated accessing of the elements of \(A \) and \(B \) that can benefit from stride one access.
Two successive rank-1 updates of the microtile can be given by
Since \(A \) and \(B \) are stored with column-major order, the four elements of \(\alpha_{[0:3],p}\) are contiguous in memory, but they are (generally) not contiguously stored with \(\alpha_{[0:3],p+1}\text{.}\) Elements \(\beta_{p,[0:3]} \) are (generally) also not contiguous. The access pattern during the computation by the microkernel would be much more favorable if the \(A \) involved in the microkernel was packed in column-major order with leading dimension \(m_R \text{:}\)
and \(B \) was packed in row-major order with leading dimension \(n_R \text{:}\) If this packing were performed at a strategic point in the computation, so that the packing is amortized over many computations, then a benefit might result. These observations are captured in Figure 3.3.3 and translated into an implementation in Figure 3.3.6. Reference implementations of packing routines can be found in Figure 3.3.4 and Figure 3.3.5. While these implementations can be optimized, the fact is that the cost when packing is in the data movement between main memory and faster memory. As a result, optimizing the packing has relatively little effect.How to modify the five loops to incorporate packing is illustrated in Figure 3.3.6. A microkernel to compute with the packed data when \(m_R \times n_R = 4 \times 4 \) is illustrated in Figure 3.3.7.
Homework 3.3.8
Examine the file Gemm_Five_Loops_Pack_4x4Kernel.c. Collect performance data with
make Five_Loops_Pack_4x4Kerneland view the resulting performance with Live Script Plot_Five_Loops.mlx.
Homework 3.3.9
Copy the file Gemm_Five_Loops_Pack_4x4Kernel.c into file Gemm_Five_Loops_Pack_12x4Kernel.c. Modify that file so that it uses \(m_R \times n_R = 12 \times 4 \text{.}\) Test the result with
make Five_Loops_Pack_12x4Kerneland view the resulting performance with Live Script Plot_Five_Loops.mlx. }