Subsection 8.2.5 Preconditioning
¶For a general (appropriately differential) nonlinear function \(f(x) \text{,}\) using the direction of steepest descent as the search direction is often a reasonable choice. For our problem, especially if \(A \) is relatively ill-conditioned, we can do better.
Here is the idea: Let \(A = Q \Sigma Q^T \) be the SVD of SPD matrix \(A \) (or, equivalently for SPD matrices, its spectral decomposition, which we will discuss in Subsection 9.2.4). Then
Using the change of basis \(y = Q^T x \) and \(\widehat b = Q^T b \text{,}\) then
How this relates to the convergence of the Method of Steepest Descent is discussed (informally) in the video. The key insight is that if \(\kappa( A ) = \sigma_0 / \sigma_{n-1} \) (the ratio between the largest and smallest eigenvalues or, equivalently, the ratio between the largest and smallest singular value) is large, then convergence can take many iterations.
What would happen if instead \(\sigma_0 = \cdots = \sigma_{n-1} \text{?}\) Then \(A = Q \Sigma Q^T \) is the SVD/Spectral decomposition of \(A \) and \(A = Q ( \sigma_0 I )Q^T \text{.}\) If we then perform the Method of Steepest Descent with \(y \) (the transformed vector \(x \)) and \(\widehat b \) (the transformed right-hand side), then
which is the solution to \(\sigma_0 I y = \widehat b \text{.}\) Thus, the iteration converges in one step. The point we are trying to (informally) make is that if \(A \) is well-conditioned, then the Method of Steepest Descent converges faster.
Now, \(A x = b \) is equivalent to \(M^{-1} A x = M^{-1} b \text{.}\) Hence, one can define a new problem with the same solution and, hopefully, a better condition number by letting \(\tilde A = M^{-1} A \) and \(\tilde b = M^{-1} b \text{.}\) A better condition number results if \(M \approx A \) since then \(M^{-1} A \approx A^{-1} A \approx I \text{.}\) A constraint is that \(M \) should be chosen so that solving with it is easy/cheap. The matrix \(M \) is called a preconditioner.
A problem is that, in our discussion of descent methods, we restrict ourselves to the case where the matrix is SPD. Generally speaking, \(M^{-1} A \) will not be SPD. To fix this, choose \(M \approx A \) to be SPD and let \(M = L_M L_M^T \) equal its Cholesky factorization. If \(A = L L^T \) is the Cholesky factorization of \(A \text{,}\) then \(L_M^{-1} A L_M^{-T } \approx L_M^{-1} L L^T L_M^{-T} \approx I.\) With this, we can transform our linear system \(A x = b \) in to one that has the same solution:
We note that \(\tilde A \) is SPD and hence one can apply the Method of Steepest Descent to \(\tilde A \tilde x = \tilde b \text{,}\) where \(\tilde A = L_M^{-1} A L_M^{-T} \text{,}\) \(\tilde x = L_M^T x \text{,}\) and \(\tilde b = L_M^{-1} b \text{.}\) Once the method converges to the solution \(\tilde x \text{,}\) one can transform that solution of this back to solution of the original problem by solving \(L_M^T x = \tilde x \text{.}\) If \(M \) is chosen carefully, \(\kappa( L_M^{-1} A L_M^{-T} ) \) can be greatly improved. The best choice would be \(M = A \text{,}\) of course, but that is not realistic. The point is that in our case where \(A \) is SPD, ideally the preconditioner should be SPD.
Some careful rearrangement takes the method of steepest descent on the transformed problem to the much simpler preconditioned algorithm on the right in Figure 8.2.5.1.
Homework 8.2.5.1.
Show that the algorithm in Figure 8.2.5.1 (Middle) computes the same values for \(x^{(k)} \) as does the algorithm to its right.
You will want to do a prove by induction. To start, conjecture a relationship between \(\tilde r^{(k)} \) and \(r^{(k)}\) and then prove that that relationship, and the relationship \(x^{(k)} = L^{-T} x^{(k)} \) hold for all \(k \text{,}\) where \(r^{(k)} \) and \(x^{(k)} \) are as computed by the algorithm on the right.
Notice that \(\tilde A = L^{-1} A L^{-T} \) implies that \(\tilde A L^T = L^{-1} A \text{.}\) We will show that for all \(k \geq 0 \)
\(\tilde x^{(k)} = L^{T} x^{(k)} \)
\(\tilde r^{(k)} = L^{-1} r^{(k)} \text{,}\)
\(\tilde p^{(k)} = L^{T} p^{(k)} \text{,}\)
\(\tilde \alpha_k = \alpha_k\)
via a proof by induction.
-
Base case: \(k = 0 \text{.}\)
\(\tilde x^{(0)} \) is initialized as \(\tilde x^{(0)} := L^T x^{(0)} \text{.}\)
\(\begin{array}[t]{l} \tilde r^{(0)} \\ ~~~=~~~~ \lt \mbox{ algorithm on left } \gt \\ \tilde b - \tilde A \tilde x^{(0)} \\ ~~~ = ~~~~ \lt \mbox{ initialization of } \tilde b \mbox{ and } \tilde x^{(0)} \gt \\ L^{-1} b - \tilde A L^T x^{(0)} \\ ~~~ = ~~~~ \lt \mbox{ initialization of } \tilde A \gt \\ L^{-1} b - L^{-1} A x^{(0)} \\ ~~~ = ~~~~ \lt \mbox{ factor out and initialization of } r^{(0)} \gt \\ L^{-1} r^{(0)} \end{array}\)
\(\begin{array}[t]{l} \tilde p^{(0)} \\ ~~~=~~~~ \lt \mbox{ initialization in algorithm} \gt \\ \tilde r^{(0)} \\ ~~~ = ~~~~ \lt \tilde r^{(0)} = L^{-1} r^{(0)} \gt \\ L^{-1} r^{(0)} \\ ~~~ = ~~~~ \lt \mbox{ from right algorithm: } r^{(k)} = M p^{(k)} \mbox{ and } M = L L^T \gt \\ L^{-1} L L^T p^{(0)} \\ ~~~ = ~~~~ \lt L^{-1} L = I \gt \\ = L^{T} p^{(0)}. \end{array}\)
\(\begin{array}[t]{l} \tilde \alpha_0 \\ ~~~=~~~~ \lt \mbox{ middle algorithm } \gt \\ \frac{ \tilde p^{(0)\,T} \tilde r^{(0)} } { \tilde p^{(0)\,T} \tilde A \tilde p^{(0)} } \\ ~~~ = ~~~~ \lt \tilde p^{(0)} = L^T p^{(0)} \mbox{ etc.} \gt \\ \frac{ ( L^T p^{(0)})^T L^{-1} r^{(0)} } { (L^T p^{(0)})^T L^{-1} A L^{-T} L^T p^{(0)} } \\ ~~~ = ~~~~ \lt \mbox{ transpose and cancel } \gt \\ \frac{ p^{(0)\,T} r^{(0)} } { p^{(0)\,T} A p^{(0)} } \\ ~~~ = ~~~~ \lt \mbox{ right algorithm } \gt \\ \alpha_0. \end{array}\)
-
Inductive Step: Assume that \(\tilde x^{(k)} = L^{T} x^{(k)} \text{,}\) \(\tilde r^{(k)} = L^{-1} r^{(k)} \text{,}\) \(\tilde p^{(k)} = L^T p^{(k)} \text{,}\) and \(\tilde \alpha_k = \alpha_k \text{.}\) Show that \(\tilde x^{(k+1)} = L^{T} x^{(k+1)} \text{,}\) \(\tilde r^{(k+1)} = L^{-1} r^{(k+1)} \text{,}\) \(\tilde p^{(k+1)} = L^T p^{(k+1)} \text{,}\) and \(\tilde \alpha_{k+1} = \alpha_{k+1} \text{.}\)
\(\begin{array}[t]{c} \tilde x^{(k+1)} \\ ~~~ = ~~~~ \mbox{ middle algorithm } \\ \tilde x^{(k)} + \tilde \alpha_k \tilde p^{(k)} \\ ~~~ = ~~~~ \lt \mbox{ I.H. } \gt \\ L^T x^{(k)} + \alpha_k L^T p^{(k)} \\ ~~~ = ~~~~ \lt \mbox{ factor out; right algorithm } \gt \\ L^T x^{(k+1)} \end{array}\)
\(\begin{array}[t]{l} \tilde r^{(k+1)} \\ ~~~=~~~~ \lt \mbox{ middle algorithm } \gt \\ \tilde r^{(k)} - \tilde \alpha_k \tilde A \tilde p^{(k)} \\ ~~~ = ~~~~ \lt \mbox{ I.H. } \gt \\ L^{-1} r^{(k)} - \alpha_k L^{-1} A L^{-T} L^T p^{(k)} \\ ~~~ = ~~~~ \lt L^{-T} L^T = I; \mbox{ factor out; right algorithmn } \gt \\ L^{-1} r^{(k+1)} \end{array}\)
\(\begin{array}[t]{l} \tilde p^{(k+1)} \\ ~~~=~~~~ \lt \mbox{ middle algorithm} \gt \\ \tilde r^{(k+1)} \\ ~~~ = ~~~~ \lt \tilde r^{(k+1)} = L^{-1} r^{(k+1)} \gt \\ L^{-1} r^{(k+1)} \\ ~~~ = ~~~~ \lt \mbox{ from right algorithm: } r^{(k+1)} = M p^{(k+1)} \mbox{ and } M = L L^T \gt \\ L^{-1} L L^T p^{(k+1)} \\ ~~~ = ~~~~ \lt L^{-1} L = I \gt \\ = L^{T} p^{(k+1)}. \end{array}\)
\(\begin{array}[t]{l} \tilde \alpha_{k+1} \\ ~~~=~~~~ \lt \mbox{ middle algorithm } \gt \\ \frac{ \tilde p^{(k+1)\,T} \tilde r^{(k+1)} } { \tilde p^{(k+1)\,T} \tilde A \tilde p^{(k+1)} } \\ ~~~ = ~~~~ \lt \tilde p^{(k+1)} = L^T p^{(k+1)} \mbox{ etc.} \gt \\ \frac{ ( L^T p^{(k+1)})^T L^{-1} r^{(k+1)} } { (L^T p^{(k+1)})^T L^{-1} A L^{-T} L^T p^{(k+1)} } \\ ~~~ = ~~~~ \lt \mbox{ transpose and cancel } \gt \\ \frac{ p^{(k+1)\,T} r^{(k+1)} } { p^{(k+1)\,T} A p^{(k+1)} } \\ ~~~ = ~~~~ \lt \mbox{ right algorithm } \gt \\ \alpha_{k+1}. \end{array}\)
By the Principle of Mathematical Induction the result holds.
Let's start with the algorithm in the middle:
We now notice that \(\tilde A = L^{-1} A L^{-T} \) and we can substitute this into the algorithm:
Next, we notice that \(x^{(k+1)} = L^{-T} \tilde x^{(k+1)} \) or, equivalently,
We substitute that
Now, we exploit that \(\tilde b = L^{-1} b \) and \(\tilde r^{(k)} \) equals the residual \(\tilde b - \tilde A \tilde x^{(k)} = L^{-1} b - L^{-1} A L^{-T} L^T x^{(k)} = L^{-1}( b - A x^{(k)}) = L^{-1} r^{(k)} \text{.}\) Substituting these insights in gives us
Now choose \(\tilde p^{(k)} = L^{T} p^{(k)}\) so that \(A L^{-T}\tilde p^{(k)} \) becomes \(A p^{(k)} \text{:}\)
Finally, if we choose \(L \tilde q^{(k)} = q^{(k)}\) and \(\tilde \alpha_k = \alpha_k \) we end up with