Subsection 12.3.3 Other high-performance dense linear algebra algorithms
ΒΆThroughout this course, we have pointed to papers that discuss the high-performance implementation of various operations. Here we review some of these.Subsubsection 12.3.3.1 High-performance QR factorization
ΒΆWe saw in Week 3 that the algorithm of choice for computing the QR factorization is based on Householder transformations. The reason is that this casts the computation in terms of the application of unitary transformations, which do not amplify error. In Subsection 3.4.1, we discussed how the Householder QR factorization algorithm can be cast in terms of matrix-matrix operations. An important point to note is that in order to cast the computation in terms of matrix-matrix multiplication, one has to form the "block Householder transformation"[24] Thierry Joffrain, Tze Meng Low, Enrique S. Quintana-Orti, Robert van de Geijn, Field G. Van Zee, Accumulating Householder transformations, revisited, ACM Transactions on Mathematical Software, Vol. 32, No 2, 2006.
[26] Per-Gunnar Martinsson, Gregorio Quintana-Orti, Nathan Heavner, Robert van de Geijn, Householder QR Factorization With Randomization for Column Pivoting (HQRRP), SIAM Journal on Scientific Computing, Vol. 39, Issue 2, 2017.
Subsubsection 12.3.3.2 Optimizing reduction to condensed form
ΒΆIn Subsection 10.3.1 and Subsection 11.2.3, we discussed algorithms for reducing a matrix to tridiagonal and bidiagonal form, respectively. These are special cases of the reduction of a matrix to condensed form. The algorithms in those sections cast most of the computation in terms of matrix-vector multiplication and rank-1 or rank-2 updates, which are matrix-vector operations that do not attain high performance. In enrichments in those chapters, we point to papers that cast some of the computation in terms of matrix-matrix multiplication. Here, we discuss the basic issues. When computing the LU, Cholesky, or QR factorization, it is possible to factor a panel of columns before applying an accumulation of the transformations that are encounted to the rest of the matrix. It is this that allows the computation to be mostly cast in terms of matrix-matrix operations. When computing the reduction to tridiagonal or bidiagonal form, this is not possible. The reason is that if we have just computed a unitary transformation, this transformation must be applied to the rest of the matrix both from the left and from the right. What this means is that the next transformation to be computed depends on an update that involves the rest of the matrix. This in turn means that inherently a matrix-vector operation (involving the "rest of the matrix") must be performed at every step at a cost of O(m2) computation per iteration (if the matrix is mΓm). The insight is that O(m3) computation is cast in terms of matrix-vector operations, which is of the same order as the total computation. While this is bad news, there is still a way to cast about half the computation in terms of matrix-matrix multiplication for the reduction to tridiagonal form. Notice that this means the computation is sped up by at most a factor two, since even if the part that is cast in terms of matrix-matrix multiplication takes no time at all relative to the rest of the computation, this only cuts the time to completion in half. The reduction to bidiagonal form is trickier yet. It requires the fusing of a matrix-vector multiplication with a rank-1 update in order to reuse data that is already in cache. Details can be found in, for example,[48] Field G. Van Zee, Robert A. van de Geijn, Gregorio Quintana-OrtΓ, G. Joseph Elizondo, Families of Algorithms for Reducing a Matrix to Condensed Form, ACM Transactions on Mathematical Software (TOMS) , Vol. 39, No. 1, 2012.
Subsubsection 12.3.3.3 Optimizing the implicitly shifted QR algorithm
ΒΆOptimizing the QR algorithm for computing the Spectral Decomposition or Singular Value Decomposition gets even trickier. Part of the cost is in the reduction to condensed form, which we already have noted exhibits limited opportunity for casting computation in terms of matrix-matrix multiplication. Once the algorithm proceeds to the implicitly shifted QR algorithm, most of the computation is in the accumulation of the eigenvectors or singular vectors. In other words, it is in the application of the Givens' rotations from the right to the columns of a matrix Q in which the eigenvectors are being computed. Let us look at one such application to two columns, qi and qj:[47] Field G. Van Zee, Robert A. van de Geijn, Gregorio Quintana-OrtΓ, Restructuring the Tridiagonal and Bidiagonal QR Algorithms for Performance, ACM Transactions on Mathematical Software (TOMS), Vol. 40, No. 3, 2014.