PfHP Details

Unit 2.4.3 Details

In this unit, we give details regarding the partial code in Figure 2.4.3 and illustrated in Figure 2.4.4.

To use the intrinsic functions, we start by including the header file immintrin.h.

#include <immintrin.h>

The declaration

__m256d gamma_0123_0 = _mm256_loadu_pd( &gamma( 0,0 ) );

creates gamma_0123_0 as a variable that references a vector register with four double precision numbers and loads it with the four numbers that are stored starting at address

&gamma( 0,0 )

In other words, it loads that vector register with the original values

\begin{equation*} \left( \begin{array}{c c c c} \gamma_{0,0} \\ \gamma_{1,0} \\ \gamma_{2,0} \\ \gamma_{3,0} \end{array} \right). \end{equation*}

This is repeated for the other three columns of \(C \text{:}\)

__m256d gamma_0123_1 = _mm256_loadu_pd( &gamma( 0,1 ) );
__m256d gamma_0123_2 = _mm256_loadu_pd( &gamma( 0,2 ) );
__m256d gamma_0123_3 = _mm256_loadu_pd( &gamma( 0,3 ) );

The loop in Figure 2.4.2 implements

\begin{equation*} \begin{array}{l} {\bf for~} p = 0, \ldots, k-1 \\ ~~~ \left( \begin{array}{c | c | c | c} \begin{array}{c c c c} \gamma_{0,0} +:= \alpha_{0,p} \times \beta_{p,0} \\ \gamma_{1,0} +:= \alpha_{1,p} \times \beta_{p,0} \\ \gamma_{2,0} +:= \alpha_{2,p} \times \beta_{p,0} \\ \gamma_{3,0} +:= \alpha_{3,p} \times \beta_{p,0} \end{array} \amp \begin{array}{c c c c} \gamma_{0,1} +:= \alpha_{0,p} \times \beta_{p,1} \\ \gamma_{1,1} +:= \alpha_{1,p} \times \beta_{p,1} \\ \gamma_{2,1} +:= \alpha_{2,p} \times \beta_{p,1} \\ \gamma_{3,1} +:= \alpha_{3,p} \times \beta_{p,1} \end{array} \amp \begin{array}{c c c c} \gamma_{0,2} +:= \alpha_{0,p} \times \beta_{p,2} \\ \gamma_{1,2} +:= \alpha_{1,p} \times \beta_{p,2} \\ \gamma_{2,2} +:= \alpha_{2,p} \times \beta_{p,2} \\ \gamma_{3,2} +:= \alpha_{3,p} \times \beta_{p,2} \end{array} \amp \begin{array}{c c c c} \gamma_{0,3} +:= \alpha_{0,p} \times \beta_{p,3} \\ \gamma_{1,3} +:= \alpha_{1,p} \times \beta_{p,3} \\ \gamma_{2,3} +:= \alpha_{2,p} \times \beta_{p,3} \\ \gamma_{3,3} +:= \alpha_{3,p} \times \beta_{p,3} \end{array} \end{array}\right) \\ {\bf end} \end{array} \end{equation*}

leaving the result in the vector registers. Each iteration starts by declaring vector register variable alpha_0123_p and loading it with the contents of

\begin{equation*} \left( \begin{array}{c c c c} \alpha_{0,p} \\ \alpha_{1,p} \\ \alpha_{2,p} \\ \alpha_{3,p} \end{array} \right). \end{equation*}

__m256d alpha_0123_p = _mm256_loadu_pd( &alpha( 0,p ) );

Next, \(\beta_{p,0} \) is loaded into a vector register, broadcasting (duplicating) that value to each entry in that register:

beta_p_j = _mm256_broadcast_sd( &beta( p, 0) );

This variable is declared earlier in the routine as beta_p_j because it is reused for \(j =0,1,\ldots, \text{.}\)

The command

gamma_0123_0 = _mm256_fmadd_pd( alpha_0123_p, beta_p_j, gamma_0123_0 );

then performs the computation

\begin{equation*} \left( \begin{array}{c c c c} \gamma_{0,0} +:= \alpha_{0,p} \times \beta_{p,0} \\ \gamma_{1,0} +:= \alpha_{1,p} \times \beta_{p,0} \\ \gamma_{2,0} +:= \alpha_{2,p} \times \beta_{p,0} \\ \gamma_{3,0} +:= \alpha_{3,p} \times \beta_{p,0} \end{array} \right) \end{equation*}

illustrated by

in Figure 2.4.3. Notice that we use beta_p_j for \(\beta_{p,0}\) because that same vector register will be used for \(\beta_{p,j} \) with \(j = 0,1,2,3\text{.}\)

We leave it to the reader to add the commands that compute

\begin{equation*} \left( \begin{array}{c c c c} \gamma_{0,1} +:= \alpha_{0,p} \times \beta_{p,1} \\ \gamma_{1,1} +:= \alpha_{1,p} \times \beta_{p,1} \\ \gamma_{2,1} +:= \alpha_{2,p} \times \beta_{p,1} \\ \gamma_{3,1} +:= \alpha_{3,p} \times \beta_{p,1} \end{array} \right), \left( \begin{array}{c c c c} \gamma_{0,2} +:= \alpha_{0,p} \times \beta_{p,2} \\ \gamma_{1,2} +:= \alpha_{1,p} \times \beta_{p,2} \\ \gamma_{2,2} +:= \alpha_{2,p} \times \beta_{p,2} \\ \gamma_{3,2} +:= \alpha_{3,p} \times \beta_{p,2} \end{array} \right), \mbox{and} \left( \begin{array}{c c c c} \gamma_{0,3} +:= \alpha_{0,p} \times \beta_{p,3} \\ \gamma_{1,3} +:= \alpha_{1,p} \times \beta_{p,3} \\ \gamma_{2,3} +:= \alpha_{2,p} \times \beta_{p,3} \\ \gamma_{3,3} +:= \alpha_{3,p} \times \beta_{p,3} \end{array} \right). \end{equation*}

in Assignments/Week2/C/Gemm_4x4Kernel.c. Upon completion of the loop, the results are stored back into the original arrays with the commands

_mm256_storeu_pd( &gamma(0,0), gamma_0123_0 );
_mm256_storeu_pd( &gamma(0,1), gamma_0123_1 );
_mm256_storeu_pd( &gamma(0,2), gamma_0123_2 );
_mm256_storeu_pd( &gamma(0,3), gamma_0123_3 );

Homework 2.4.3.1.

Complete the code in Assignments/Week2/C/Gemm_4x4Kernel.c and execute it with

make JI_4x4Kernel

View its performance with Assignments/Week2/C/data/Plot_register_blocking.mlx

Solution

Assignments/Week2/Answers/Gemm_4x4Kernel.c

This is the performance on Robert's laptop, with MB=NB=KB=4:

Remark 2.4.5.

The form of parallelism illustrated here is often referred to as Single Instruction, Multiple Data (SIMD) parallelism since the same operation (an FMA) is executed with multiple data (four values of \(C \) and \(A \) and a duplicated value from \(B\)).

We are starting to see some progress towards higher performance