PfHP Implementation: packing block \(A_{i,p} \)

Skip to main content

\( \newcommand{\R}{\mathbb R} \newcommand{\Rm}{\mathbb R^m} \newcommand{\Rn}{\mathbb R^n} \newcommand{\Rnxn}{\mathbb R^{n \times n}} \newcommand{\Rmxn}{\mathbb R^{m \times n}} \newcommand{\C}{\mathbb C} \newcommand{\Cm}{\mathbb C^m} \newcommand{\Cmxm}{\mathbb C^{m \times m}} \newcommand{\Cnxn}{\mathbb C^{n \times n}} \newcommand{\Cmxn}{\mathbb C^{m \times n}} \newcommand{\Cn}{\mathbb C^n} \newcommand{\Null}{{\cal N}} \newcommand{\Col}{{\cal C}} \newcommand{\Rowspace}{{\cal R}} \newcommand{\Span}{{\cal Span}} \newcommand{\rank}{{\rm rank}} \newcommand{\FlaTwoByTwo}[4]{ \left( \begin{array}{c | c} #1 \amp #2 \\ \hline #3 \amp #4 \end{array} \right) } \newcommand{\FlaTwoByTwoSingleLine}[4]{ \left( \begin{array}{c c} #1 \amp #2 \\ #3 \amp #4 \end{array} \right) } \newcommand{\FlaTwoByTwoSingleLineNoPar}[4]{ \begin{array}{c c} #1 \amp #2 \\ #3 \amp #4 \end{array} } \newcommand{\FlaOneByTwo}[2]{ \left( \begin{array}{c | c} #1 \amp #2 \end{array} \right) } \newcommand{\FlaOneByTwoSingleLine}[2]{ \left( \begin{array}{c c} #1 \amp #2 \end{array} \right) } \newcommand{\FlaTwoByOne}[2]{ \left( \begin{array}{c} #1 \\ \hline #2 \end{array} \right) } \newcommand{\FlaTwoByOneSingleLine}[2]{ \left( \begin{array}{c} #1 \\ #2 \end{array} \right) } \newcommand{\FlaThreeByOneB}[3]{ \left( \begin{array}{c} #1 \\ \hline #2 \\ #3 \end{array} \right) } \newcommand{\FlaThreeByOneT}[3]{ \left( \begin{array}{c} #1 \\ #2 \\ \hline #3 \end{array} \right) } \newcommand{\FlaOneByThreeR}[3]{ \left( \begin{array}{c | c c} #1 \amp #2 \amp #3 \end{array} \right) } \newcommand{\FlaOneByThreeL}[3]{ \left( \begin{array}{c c | c} #1 \amp #2 \amp #3 \end{array} \right) } \newcommand{\FlaThreeByThreeBR}[9]{ \left( \begin{array}{c | c c} #1 \amp #2 \amp #3 \\ \hline #4 \amp #5 \amp #6 \\ #7 \amp #8 \amp #9 \end{array} \right) } \newcommand{\FlaThreeByThreeTL}[9]{ \left( \begin{array}{c c | c} #1 \amp #2 \amp #3 \\ #4 \amp #5 \amp #6 \\ \hline #7 \amp #8 \amp #9 \end{array} \right) } \newcommand{\diag}[1]{{\rm diag}( #1 )} \newcommand{\URt}{{\sc HQR}} \newcommand{\FlaAlgorithm}{ \begin{array}{|l|} \hline \routinename \\ \hline \partitionings \\ ~~~ \begin{array}{l} \partitionsizes \end{array} \\ {\bf \color{blue} {while}~} \guard \\ ~~~ \begin{array}{l} \repartitionings \end{array} \\ ~~~ \color{red} { \begin{array}{l} \hline \color{black} {\update} \\ \hline \end{array}} \\ ~~~ \begin{array}{l} \moveboundaries \end{array} \\ {\bf \color{blue} {endwhile}} \\ \hline \end{array} } \newcommand{\FlaAlgorithmWithInit}{ \begin{array}{|l|} \hline \routinename \\ \hline \initialize \\ \partitionings \\ ~~~ \begin{array}{l} \partitionsizes \end{array} \\ {\bf \color{blue} {while}~} \guard \\ ~~~ \begin{array}{l} \repartitionings \end{array} \\ ~~~ \color{red} { \begin{array}{l} \hline \color{black} {\update} \\ \hline \end{array}} \\ ~~~ \begin{array}{l} \moveboundaries \end{array} \\ {\bf \color{blue} {endwhile}} \\ \hline \end{array} } \newcommand{\FlaBlkAlgorithm}{ \begin{array}{|l|} \hline \routinename \\ \hline \partitionings \\ ~~~ \begin{array}{l} \partitionsizes \end{array} \\ {\bf \color{blue} {while}~} \guard \\ ~~~ {\bf choose~block~size~} \blocksize \\ ~~~ \begin{array}{l} \repartitionings \end{array} \\ ~~~ ~~~ \repartitionsizes \\ ~~~ \color{red} { \begin{array}{l} \hline \color{black} {\update} \\ \hline \end{array}} \\ ~~~ \begin{array}{l} \moveboundaries \end{array} \\ {\bf \color{blue} {endwhile}} \\ \hline \end{array} } \newcommand{\complexone}{ \begin{array}{|c|}\hline \!\pm\! \\ \hline \end{array}~ } \newcommand{\HQR}{{\rm HQR}} \newcommand{\QR}{{\rm QR}} \newcommand{\st}{{\rm \ s.t. }} \newcommand{\QRQ}{{\rm {\normalsize \bf Q}{\rm \tiny R}}} \newcommand{\QRR}{{\rm {\rm \tiny Q}{\bf \normalsize R}}} \newcommand{\deltaalpha}{\delta\!\alpha} \newcommand{\deltax}{\delta\!x} \newcommand{\deltay}{\delta\!y} \newcommand{\deltaz}{\delta\!z} \newcommand{\deltaw}{\delta\!w} \newcommand{\DeltaA}{\delta\!\!A} \newcommand{\meps}{\epsilon_{\rm mach}} \newcommand{\fl}[1]{{\rm fl( #1 )}} \newcommand{\becomes}{:=} \newcommand{\defrowvector}[2]{ \left(#1_0, #1_1, \ldots, #1_{#2-1}\right) } \newcommand{\tr}[1]{{#1}^T} \newcommand{\LUpiv}[1]{{\rm LU}(#1)} \newcommand{\maxi}{{\rm maxi}} \newcommand{\Chol}[1]{{\rm Chol}( #1 )} \newcommand{\lt}{<} \newcommand{\gt}{>} \newcommand{\amp}{&} \)

Unit 3.3.4 Implementation: packing block \(A_{i,p} \)

We next discuss the packing of the block \(A_{i,p} \) into \(\widetilde A_{i,p} \text{:}\)

We break the implementation, in Assignments/Week3/C/PackA.c, down into two routines. The first loops over all the rows that need to be packed

as illustrated in Figure 3.3.5.

void PackBlockA_MCxKC( int m, int k, double *A, int ldA, double *Atilde ) 
/* Pack a  m x k block of A into a MC x KC buffer.   MC is assumed to
    be a multiple of MR.  The block is packed into Atilde a micro-panel
    at a time. If necessary, the last micro-panel is padded with rows
    of zeroes. */
{
  for ( int i=0; i<m; i+= MR ){
    int ib = min( MR, m-i );

    PackMicro-PanelA_MRxKC( ib, k, &alpha( i, 0 ), ldA, Atilde );
    Atilde += ib * k;
  }
}

Figure 3.3.5. A reference implementation for packing \(A_{i,p} \text{.}\)

That routine then calls a routine that packs the panel

Given in Figure 3.3.6.

void PackMicroPanelA_MRxKC( int m, int k, double *A, int ldA, double *Atilde ) 
/* Pack a micro-panel of A into buffer pointed to by Atilde. 
   This is an unoptimized implementation for general MR and KC. */
{
  /* March through A in column-major order, packing into Atilde as we go. */

  if ( m == MR ) {
    /* Full row size micro-panel.*/
    for ( int p=0; p<k; p++ ) 
      for ( int i=0; i<MR; i++ ) 
        *Atilde++ = alpha( i, p );
  }
  else {
    /* Not a full row size micro-panel.  We pad with zeroes.  To be  added */
  }
}

Figure 3.3.6. A reference implementation for packing a micro-panel of \(A_{i,p} \text{.}\)

Remark 3.3.7.

Again, these routines only work when the sizes are "nice". We leave it as a challenge to generalize all implementations so that matrix-matrix multiplication with arbitrary problem sizes works. To manage the complexity of this, we recommend "padding" the matrices with zeroes as they are being packed. This then keeps the micro-kernel simple.