ALAFF Implementation with the classical BLAS

Subsection 5.4.6 Implementation with the classical BLAS

The Basic Linear Algebra Subprograms (BLAS) are an interface to commonly used fundamental linear algebra operations. In this section, we illustrate how the unblocked and blocked Cholesky factorization algorithms can be implemented in terms of the BLAS. The explanation draws from the entry we wrote for the BLAS in the Encyclopedia of Parallel Computing [39].

Subsubsection 5.4.6.1 What are the BLAS?

The BLAS interface [25] [15] [14] was proposed to support portable high-performance implementation of applications that are matrix and/or vector computation intensive. The idea is that one casts computation in terms of the BLAS interface, leaving the architecture-specific optimization of that interface to an expert.

Subsubsection 5.4.6.2 Implementation in Fortran

We start with a simple implementation in Fortran. A simple algorithm that exposes three loops and the corresponding code in Fortran are given by

\(\begin{array}[t]{l} {\bf for~} j := 0, \ldots, n-1 \\ ~~~ \alpha_{j,j} \becomes \sqrt{ \alpha_{j,j} } \\ ~~~ {\bf for~} i := j+1, \ldots, n-1 \\ ~~~ ~~~ \alpha_{i,j} \becomes \alpha_{i,j} / \alpha_{j,j} \\ ~~~ {\bf end} \\ ~~~ {\bf for~} k := j+1, \ldots, n-1 \\ ~~~ ~~~ {\bf for~} i := k, \ldots, n-1 \\ ~~~ ~~~ ~~~ \alpha_{i,k} \becomes \alpha_{i,k} - \alpha_{i,j} \alpha_{k,j} \\ ~~~ ~~~{\bf endfor}\\ ~~~{\bf endfor}\\ {\bf endfor}\\ \end{array}\)

do j=1, n 
  A(j,j) = sqrt(A(j,j)) 

  do i=j+1,n 
    A(i,j) = A(i,j) / A(j,j) 
  enddo 

  do k=j+1,n 
    do i=k,n 
      A(i,k) = A(i,k) - A(i,j) * A(k,j) 
    enddo 
  enddo 
enddo

Notice that Fortran starts indexing at one when addressing an array.

Next, exploit the fact that the BLAS interface supports a number of "vector-vector" operations known as the Level-1 BLAS. Of these, we will use

dscal( n, alpha, x, incx )

which updates the vector \(x \) stored in memory starting at address x and increment between entries of incx and of size n with \(\alpha x \) where \(\alpha \) is stored in alpha, and

daxpy( n, alpha, x, incx, y, incy )

which updates the vector \(y \) stored in memory starting at address y and increment between entries of incy and of size n with \(\alpha x + y \) where \(x \) is stored at address x with increment incx and \(\alpha \) is stored in alpha. With these, the implementation becomes

\(\begin{array}[t]{l} ~~\\ {\bf for~} j := 0, \ldots, n-1 \\ ~~~ \alpha_{j,j} \becomes \sqrt{ \alpha_{j,j} } \\ ~~~ \alpha_{j+1:n-1,j} := \alpha_{j+1:n-1,j} / \alpha_{j,j} \\ ~~~ {\bf for~} k := j+1, \ldots, n-1 \\ ~~~ ~~~ \alpha_{k:n-1,k} := \alpha_{k:n-1,k} - \alpha_{k,j} \alpha_{k:n-1,j} \\ ~~~{\bf endfor}\\ {\bf endfor}\\ \end{array}\)

do j=1, n 
  A(j,j) = sqrt(A(j,j)) 

  call dscal( n-j, 1.0d00 / a(j,j), a(j+1,j), 1 )

  do k=j+1,n 
    call daxpy( n-k+1, -A(k,j), A(k,j), 1,
                                A(k,k), 1 )   
  enddo 
enddo

Here \(\alpha_{j+1:n-1,j} = \left( \begin{array}{c} \alpha_{j+1,j} \\ \vdots \\ \alpha_{n-1,j} \\ \end{array} \right) \text{.}\)

The entire update \(A_{22} := A_{22} - a_{21} a_{21}^T\) can be cast in terms of a matrix-vector operation (level-2 BLAS call) to

dsyr( uplo, n, alpha, x, A, ldA )

which updates the matrix \(A \) stored in memory starting at address A with leading dimension ldA of size n by n with \(\alpha x x^T + A \) where \(x \) is stored at address x with increment incx and \(\alpha\) is stored in alpha. Since both \(A\) and \(\alpha x x^T + A \) are symmetric, only the triangular part indicated by uplo is updated. This is captured by the below algorithm and implementation.

\(\begin{array}[t]{l} ~~\\ {\bf for~} j := 0, \ldots, n-1 \\ ~~~ \alpha_{j,j} \becomes \sqrt{ \alpha_{j,j} } \\ ~~~ \alpha_{j+1:n-1,j} := \alpha_{j+1:n-1,j} / \alpha_{j,j} \\ ~~~ \alpha_{j+1:n-1,j+1:n-1} := \\ ~~~ ~~~~~\alpha_{j+1:n-1,j+1:n-1} \\ ~~~~~~~~~~- \mbox{tril}( \alpha_{j+1:n-1,j} \alpha_{j+1:n-1,j}^T ) \\ {\bf endfor}\\ \end{array}\)

do j=1, n 
  A(j,j) = sqrt(A(j,j)) 

  call dscal( n-j, 1.0d00 / a(j,j), a(j+1,j), 1 ) 

  call dsyr( "Lower triangular", n-j+1, -1.0d00, 
                  A(j+1,j), 1,  A(j+1,j+1), ldA )   
enddo

Notice how the code that cast computation in terms of the BLAS uses a higher level of abstraction, through routines that implement the linear algebra operations that are encountered in the algorithms.

\begin{equation*} \newcommand{\routinename}{A = \mbox{Chol-blocked-right-looking}( A )} \newcommand{\guard}{ n( A_{TL} ) \lt n( A ) } \newcommand{\partitionings}{ A \rightarrow \FlaTwoByTwo{A_{TL}}{A_{TR}} {A_{BL}}{A_{BR}} } \newcommand{\partitionsizes}{ A_{TL} {\rm ~is~} 0 \times 0 } \newcommand{\repartitionings}{ \FlaTwoByTwo{A_{TL}}{A_{TR}} {A_{BL}}{A_{BR}} \rightarrow \FlaThreeByThreeBR{A_{00}}{A_{01}}{A_{02}} {A_{10}}{A_{11}}{A_{12}^T} {A_{20}}{A_{21}}{A_{22}} } \newcommand{\moveboundaries}{ \FlaTwoByTwo{A_{TL}}{A_{TR}} {A_{BL}}{A_{BR}} \leftarrow \FlaThreeByThreeTL{A_{00}}{A_{01}}{A_{02}} {A_{10}}{A_{11}}{A_{12}^T} {A_{20}}{A_{21}}{A_{22}} } \newcommand{\update}{ \begin{array}{l} A_{11} := L_{11} = \Chol{ A_{11} } \\ \mbox{Solve } L_{21} L_{11}^{H} = A_{21} \mbox{ overwriting } A_{21} \mbox{ with } L_{21} \\ A_{22} := A_{22} - A_{21} A_{21}^H ~~ \mbox{syrk: update only lower triangular part} \end{array} } \FlaAlgorithm \end{equation*}

Figure 5.4.6.1. Blocked Cholesky factorization Variant 3 (right-looking) algorithm. The operation "syrk" refers to "symmetric rank-k update", which performs a rank-k update (matrix-matrix multiplication with a small "k" size), updating only the lower triangular part of the matrix in this algorithm.

Finally, a blocked right-looking Cholesky factorization algorithm, which casts most computation in terms of a matrix-matrix multiplication operation referred to as a "symmetric rank-k update" is given in Figure 5.4.6.1. There, we use FLAME notation to present the algorithm. It translates into Fortran code that exploits the BLAS given below.

do j=1, nb ,n
  jb = min( nb, n-j )
  Chol( jb, A( j, j );

  dtrsm( "Right", "Lower triangular", "Transpose",
        "Unit diag", jb, n-j-jb+1, 1.0d00, A( j,j ), LDA,
         A( j+jb, j ), LDA )

  dsyrk( "Lower triangular", n-j-jb+1, jb, 1.0d00,
           A( j+jb,j ), LDA, 1.0d00, A( j+jb, j+jb ), LDA )
enddo

The routines dtrsm and dsyrk are level-3 BLAS routines:

The call to dtrsm implements \(A_{21} \becomes L_{21} \) where \(L_{21} L_{11}^T = A_{21} \text{.}\)
The call to dsyrk implements \(A_{22} \becomes - A_{21} A_{21}^T + A_{22} \text{,}\) updating only the lower triangular part of the matrix.

The bulk of the computation is now cast in terms of matrix-matrix operations which can achieve high performance.