Subsection 5.4.6 Implementation with the classical BLAS
¶The Basic Linear Algebra Subprograms (BLAS) are an interface to commonly used fundamental linear algebra operations. In this section, we illustrate how the unblocked and blocked Cholesky factorization algorithms can be implemented in terms of the BLAS. The explanation draws from the entry we wrote for the BLAS in the Encyclopedia of Parallel Computing [39].
Subsubsection 5.4.6.1 What are the BLAS?
¶The BLAS interface [25] [15] [14] was proposed to support portable high-performance implementation of applications that are matrix and/or vector computation intensive. The idea is that one casts computation in terms of the BLAS interface, leaving the architecture-specific optimization of that interface to an expert.
Subsubsection 5.4.6.2 Implementation in Fortran
¶We start with a simple implementation in Fortran. A simple algorithm that exposes three loops and the corresponding code in Fortran are given by
\(\begin{array}[t]{l} {\bf for~} j := 0, \ldots, n-1 \\ ~~~ \alpha_{j,j} \becomes \sqrt{ \alpha_{j,j} } \\ ~~~ {\bf for~} i := j+1, \ldots, n-1 \\ ~~~ ~~~ \alpha_{i,j} \becomes \alpha_{i,j} / \alpha_{j,j} \\ ~~~ {\bf end} \\ ~~~ {\bf for~} k := j+1, \ldots, n-1 \\ ~~~ ~~~ {\bf for~} i := k, \ldots, n-1 \\ ~~~ ~~~ ~~~ \alpha_{i,k} \becomes \alpha_{i,k} - \alpha_{i,j} \alpha_{k,j} \\ ~~~ ~~~{\bf endfor}\\ ~~~{\bf endfor}\\ {\bf endfor}\\ \end{array}\)
do j=1, n A(j,j) = sqrt(A(j,j)) do i=j+1,n A(i,j) = A(i,j) / A(j,j) enddo do k=j+1,n do i=k,n A(i,k) = A(i,k) - A(i,j) * A(k,j) enddo enddo enddo
Next, exploit the fact that the BLAS interface supports a number of "vector-vector" operations known as the Level-1 BLAS. Of these, we will use
dscal( n, alpha, x, incx )which updates the vector \(x \) stored in memory starting at address x and increment between entries of incx and of size n with \(\alpha x \) where \(\alpha \) is stored in alpha, and
daxpy( n, alpha, x, incx, y, incy )which updates the vector \(y \) stored in memory starting at address y and increment between entries of incy and of size n with \(\alpha x + y \) where \(x \) is stored at address x with increment incx and \(\alpha \) is stored in alpha. With these, the implementation becomes
\(\begin{array}[t]{l} ~~\\ {\bf for~} j := 0, \ldots, n-1 \\ ~~~ \alpha_{j,j} \becomes \sqrt{ \alpha_{j,j} } \\ ~~~ \alpha_{j+1:n-1,j} := \alpha_{j+1:n-1,j} / \alpha_{j,j} \\ ~~~ {\bf for~} k := j+1, \ldots, n-1 \\ ~~~ ~~~ \alpha_{k:n-1,k} := \alpha_{k:n-1,k} - \alpha_{k,j} \alpha_{k:n-1,j} \\ ~~~{\bf endfor}\\ {\bf endfor}\\ \end{array}\)
do j=1, n A(j,j) = sqrt(A(j,j)) call dscal( n-j, 1.0d00 / a(j,j), a(j+1,j), 1 ) do k=j+1,n call daxpy( n-k+1, -A(k,j), A(k,j), 1, A(k,k), 1 ) enddo enddo
The entire update \(A_{22} := A_{22} - a_{21} a_{21}^T\) can be cast in terms of a matrix-vector operation (level-2 BLAS call) to
dsyr( uplo, n, alpha, x, A, ldA )which updates the matrix \(A \) stored in memory starting at address A with leading dimension ldA of size n by n with \(\alpha x x^T + A \) where \(x \) is stored at address x with increment incx and \(\alpha\) is stored in alpha. Since both \(A\) and \(\alpha x x^T + A \) are symmetric, only the triangular part indicated by uplo is updated. This is captured by the below algorithm and implementation.
\(\begin{array}[t]{l} ~~\\ {\bf for~} j := 0, \ldots, n-1 \\ ~~~ \alpha_{j,j} \becomes \sqrt{ \alpha_{j,j} } \\ ~~~ \alpha_{j+1:n-1,j} := \alpha_{j+1:n-1,j} / \alpha_{j,j} \\ ~~~ \alpha_{j+1:n-1,j+1:n-1} := \\ ~~~ ~~~~~\alpha_{j+1:n-1,j+1:n-1} \\ ~~~~~~~~~~- \mbox{tril}( \alpha_{j+1:n-1,j} \alpha_{j+1:n-1,j}^T ) \\ {\bf endfor}\\ \end{array}\)
do j=1, n A(j,j) = sqrt(A(j,j)) call dscal( n-j, 1.0d00 / a(j,j), a(j+1,j), 1 ) call dsyr( "Lower triangular", n-j+1, -1.0d00, A(j+1,j), 1, A(j+1,j+1), ldA ) enddo
Finally, a blocked right-looking Cholesky factorization algorithm, which casts most computation in terms of a matrix-matrix multiplication operation referred to as a "symmetric rank-k update" is given in Figure 5.4.6.1. There, we use FLAME notation to present the algorithm. It translates into Fortran code that exploits the BLAS given below.
do j=1, nb ,n jb = min( nb, n-j ) Chol( jb, A( j, j ); dtrsm( "Right", "Lower triangular", "Transpose", "Unit diag", jb, n-j-jb+1, 1.0d00, A( j,j ), LDA, A( j+jb, j ), LDA ) dsyrk( "Lower triangular", n-j-jb+1, jb, 1.0d00, A( j+jb,j ), LDA, 1.0d00, A( j+jb, j+jb ), LDA ) enddo
The call to dtrsm implements \(A_{21} \becomes L_{21} \) where \(L_{21} L_{11}^T = A_{21} \text{.}\)
The call to dsyrk implements \(A_{22} \becomes - A_{21} A_{21}^T + A_{22} \text{,}\) updating only the lower triangular part of the matrix.
The bulk of the computation is now cast in terms of matrix-matrix operations which can achieve high performance.