Unit 3.4.1 Alignment (easy and worthwhile)
ΒΆThe vector intrinsic routines that load and/or broadcast vector registers are faster when the data from which one loads is aligned.
Conveniently, loads of elements of \(A \) and \(B \) are from buffers into which the data was packed. By creating those buffers to be aligned, we can ensure that all these loads are aligned. Intel's intrinsic library has a special memory allocation and deallocation routines specifically for this purpose: _mm_malloc and _mm_free. (align should be chosen to equal the length of a cache line, in bytes: 64.)