Next: 4 Householder QR Factorization Up: PLAPACK: High Performance through Previous: 2 Cholesky Factorization

Subsections

3 LU Factorization (with pivoting)

Next, we discuss the computation of the LU factorization $\begin{displaymath} P A = L U\end{displaymath}$ where A is an $m \times n$ matrix, L is an $m \times n$ lower trapezoidal matrix with unit diagonal, and U is an $n \times n$ uppertriangular matrix. P is an $m \times m$ permutation matrix which has the property that $P = P_{n-1} \cdots P_1$ where $P_{j}$ is the permutation matrix that swaps the j th row with the pivot row during the j th iteration of the outerloop of the algorithm.

3.1 Basic algorithm

Let us assume that we have already computed permutations $P_1, \ldots , P_j$ such that $\begin{displaymath} P_j \cdots P_1 A = \left( \begin{array} {c \vert\vert c} A_... ... \\ \hline \hline 0 & A_{BR} - L_{BR} U_{TL}\end{array}\right)\end{displaymath}$ where $A_{TL}$ , $L_{TL}$ , and $U_{TL}$ are $j \times j$ ,and the matrix has been overwritten by $\begin{displaymath} \left( \begin{array} {c \vert\vert c} L \backslash U_{TL} & ... ...\hline \hline L_{BL} & A_{BR} - L_{BL} U_{TR}\end{array}\right)\end{displaymath}$

The outer-product based variant, implemented for example in LAPACK, for computing the LU factorization with partial pivoting proceeds as follows:

Partition $A_{BR}$ (which has been overwritten as described above) like $\begin{displaymath} A_{BR} = \left( \begin{array} { c \vert c} a_{B1} & A_{B2} \end{array} \right)\end{displaymath}$ where $a_{B1}$ equals the first column of $A_{BR}$ .
Compute permutation $\hat{P}_{j+1}$ which swaps the first and (j+1) st rows of $A_{BR}$ , where the (j+1) st element in $a_{B1}$ equals the element with maximal absolute value in that vector.
Permute $\left( \begin{array} {c \vert c} A_{BL} & A_{BR} \end{array}\right) \leftarrow... ...{P}_{k+1} \left( \begin{array} {c \vert c} A_{BL} & A_{BR} \end{array}\right)$ and repartition updated $A_{BR}$ like $\begin{displaymath} A_{BR} = \left( \begin{array} { c \vert c} \alpha_{11} & a_{12}^T \\ \hline a_{21} & A_{22} \end{array} \right)\end{displaymath}$
Update $a_{21} \leftarrow l_{21} / \alpha_{11}$ .
Update $A_{22} \leftarrow A_{22} - l_{21} a_{12}^T$ .
Repartition $\begin{displaymath} P_{k+1} P_k \cdots P_1 A = \left( \begin{array} {c \vert c} A_{TL} & A_{TR} \\ \hline A_{BL} & A_{BR} \end{array}\right)\end{displaymath}$ where $A_{TL}$ is now $(j+1) \times (j+1)$ ,and $\begin{displaymath} P_{j+1} = \left( \begin{array} {c \vert c} I_{j} & 0 \\ \hline 0 & \hat{P}_{j+1} \end{array}\right)\end{displaymath}$
Continue with this new partitioning.

It can be easily shown that after the above described steps $\begin{displaymath} P_{j+1} P_j \cdots P_1 A = \left( \begin{array} {c \vert c}... ...} 0 & 0 \\ \hline 0 & A_{BR} - L_{BR} U_{TL}\end{array}\right)\end{displaymath}$ where $A_{TL}$ , $L_{TL}$ , and $U_{TL}$ are now $(j+1) \times (j+1)$ ,and the matrix has been overwritten by $\begin{displaymath} \left( \begin{array} {c \vert c} L \backslash U_{TL} & U_{TR} \\ \hline L_{BL} & A_{BR} - L_{BL} U_{TR}\end{array}\right)\end{displaymath}$ Notice that when j = 0 , the matrix contains the original matrix, while when j = n , it has been overwritten with the L and U factors.

3.2 Blocked algorithm

Given the basic algorithm described above, it is not difficult to derive a blocked algorithm as we show next. The result is given in Fig. 3.

3.3 PLAPACK Implementation

A new parallel algorithm

The implementation of the LU factorization in parallel using PLAPACK closely follows the blocked algorithm discussed in the last section. There are, however, two important differences.

First, pivoting is more complex in parallel. The pivot is determined by a straightforward call to PLA_Iamax(). However, the permutation of rows may require communication. Second, on a parallel machine with cartesian deistribution, the matrix-vector-based factoring of the left panel of columns of the matrix may incur load imbalance (because the panel only exists within a single column of processors.) We circumvent this load imbalance by redistributing the panel as a PLAPACK multivector during the panel factorization, as discussed in the section on Cholesky factorization.

With these points in mind, we present the algorithm for the parallel LU factorization side by side with the PLAPACK code implementation. The code for the column-by-column factorization of the panel as multivector is not presented. It is a straightforward translation of the algorithm presented in Section 3.3.

Figure 3: The blocked LU factorization algorithm in PLAPACK.

$\begin{figure} { \normalsize \begin{tabular} { p{3in} \vert p{3in}} \begin{minip... ...ticolumn{1}{c}{\normalsize (b) PLAPACK implementation}\end{tabular}}\end{figure}$

Next: 4 Householder QR Factorization Up: PLAPACK: High Performance through Previous: 2 Cholesky Factorization

plapack@cs.utexas.edu