Unit 3.4.6 Prefetching (tricky; seems to confuse the compiler...)
ΒΆElements of \(\widetilde A \) in theory remain in the L2 cache. If you implemented the \(6 \times 8 \) kernel, you are using 15 out of 16 vector registers and the hardware almost surely prefetches the next element of \(\widetilde A \) into the L1 cache while computing with the current one. One can go one step further by using the last unused vector register as well, so that you use two vector registers to load/broadcast two elements of \(\widetilde A \text{.}\) The CPU of most modern cores will execute "out of order" which means that at the hardware level instructions are rescheduled so that a "stall" (a hickup in the execution because, for example, data is not available) is covered by other instructions. It is highly likely that this will result in the current computation overlapping with the next element from \(\widetilde A\) being prefetched (load/broadcast) into a vector register.
A second reason to prefetch is to overcome the latency to main memory for bringing in the next micro-tile of \(C \text{.}\) The Intel instruction set (and vector intrinsic library) includes instructions to prefetch data into an indicated level of cache. This is accomplished by calling the instruction
void _mm_prefetch(char *p, int i)
The argument *p gives the address of the byte (and corresponding cache line) to be prefetched.
-
The value i is the hint that indicates which level to prefetch to.
_MM_HINT_T0: prefetch data into all levels of the cache hierarchy.
_MM_HINT_T1: prefetch data into level 2 cache and higher.
_MM_HINT_T2: prefetch data into level 3 cache and higher, or an implementation-specific choice.
These are actually hints that a compiler can choose to ignore. In theory, with this a next micro-tile of \(C \) can be moved into a more favorable memory layer while a previous micro-tile is being updated. In theory, the same mechanism can be used to bring a next micro-panel of \(\widetilde B \text{,}\) which itself is meant to reside in the L3 cache, into the L1 cache during computation with a previous such micro-panel. In practice, we did not manage to make this work. It may work better to use equivalent instructions with in-lined assembly code.
For additional information, you may want to consult Intel's Intrinsics Guide.