Unit 1.4.1 Rank-1 update (rank-1)
ΒΆAn operation that will become very important in future discussion and optimization of matrix-matrix multiplication is the rank-1 update:
\begin{equation*}
A := x y^T + A .
\end{equation*}
Homework 1.4.1.1.
Fill in the blanks:
Click here to enlarge.
More generally,
\begin{equation*}
A := x y^T + A
\end{equation*}
is computed as
\begin{equation*}
\begin{array}{l}
\left(\begin{array}{c | c | c | c}
\alpha_{0,0} \amp \alpha_{0,1} \amp \cdots \amp \alpha_{0,k-1} \\ \hline
\alpha_{1,0} \amp \alpha_{1,1} \amp \cdots \amp \alpha_{1,k-1} \\ \hline
\vdots \amp \vdots \amp \amp \vdots \\ \hline
\alpha_{m-1,0} \amp \alpha_{m-1,1} \amp \cdots \amp \alpha_{m-1,k-1}
\end{array}
\right):= \\
\left( \begin{array}{c}
\chi_0 \\ \hline
\chi_1 \\ \hline
\vdots \\ \hline
\chi_{m-1}
\end{array}
\right)
\left( \begin{array}{c | c | c | c}
\psi_0 \amp \psi_1 \amp \cdots \amp \psi_{n-1}
\end{array} \right)
+
\left(\begin{array}{c | c | c | c}
\alpha_{0,0} \amp \alpha_{0,1} \amp \cdots \amp \alpha_{0,n-1} \\ \hline
\alpha_{1,0} \amp \alpha_{1,1} \amp \cdots \amp \alpha_{1,n-1} \\ \hline
\vdots \amp \vdots \amp \amp \vdots \\ \hline
\alpha_{m-1,0} \amp \alpha_{m-1,1} \amp \cdots \amp \alpha_{m-1,n-1}
\end{array}
\right)
\\
=
\left(\begin{array}{c | c | c | c}
\chi_0 \psi_0 + \alpha_{0,0} \amp \chi_0 \psi_1 +
\alpha_{0,1} \amp \cdots \amp \chi_0 \psi_{n-1} + \alpha_{0,n-1} \\ \hline
\chi_1 \psi_0 + \alpha_{1,0} \amp \chi_1 \psi_1 +
\alpha_{1,1} \amp \cdots \amp \chi_1 \psi_{n-1} + \alpha_{1,n-1} \\ \hline
\vdots \amp \vdots \amp \amp \vdots \\ \hline
\chi_{m-1} \psi_0 + \alpha_{m-1,0} \amp
\chi_{m-1} \psi_1 + \alpha_{m-1,1} \amp \cdots \amp \chi_{m-1} \psi_{n-1} + \alpha_{m-1,n-1}
\end{array}
\right).
\end{array}
\end{equation*}
so that each entry \(\alpha_{i,j} \) is updated by
\begin{equation*}
\alpha_{i,j} := \chi_i \psi_j + \alpha_{i,j}.
\end{equation*}
If we now focus on the columns in this last matrix, we find that
\begin{equation*}
\begin{array}{l}
\left(\begin{array}{c | c | c | c}
\chi_0 \psi_0 + \alpha_{0,0} \amp \chi_0 \psi_1 +
\alpha_{0,1} \amp \cdots \amp \chi_0 \psi_{n-1} + \alpha_{0,n-1} \\ \hline
\chi_1 \psi_0 + \alpha_{1,0} \amp \chi_1 \psi_1 +
\alpha_{1,1} \amp \cdots \amp \chi_1 \psi_{n-1} + \alpha_{1,n-1} \\ \hline
\vdots \amp \vdots \amp \amp \vdots \\ \hline
\chi_{m-1} \psi_0 + \alpha_{m-1,0} \amp
\chi_{m-1} \psi_1 + \alpha_{m-1,1} \amp \cdots \amp \chi_{m-1} \psi_{n-1} + \alpha_{m-1,n-1}
\end{array}
\right) \\
=
\left(\begin{array}{c | c | c | c}
\left(\begin{array}{c}
\chi_0 \\
\chi_1 \\
\vdots \\
\chi_{m-1}
\end{array}\right)
\psi_0 +
\left(\begin{array}{c}
\alpha_{0,0} \\
\alpha_{1,0} \\
\vdots \\
\alpha_{m-1,0}
\end{array}\right)
\amp
\cdots
\amp
\left(\begin{array}{c}
\chi_0 \\
\chi_1 \\
\vdots \\
\chi_{m-1}
\end{array}\right)
\psi_{n-1} +
\left(\begin{array}{c}
\alpha_{0,n-1} \\
\alpha_{1,n-1} \\
\vdots \\
\alpha_{m-1,n-1}
\end{array}\right)
\end{array}
\right) \\
=
\left(\begin{array}{c | c | c | c}
\psi_0
\left(\begin{array}{c}
\chi_0 \\
\chi_1 \\
\vdots \\
\chi_{m-1}
\end{array}\right)
+
\left(\begin{array}{c}
\alpha_{0,0} \\
\alpha_{1,0} \\
\vdots \\
\alpha_{m-1,0}
\end{array}\right)
\amp
\cdots
\amp
\psi_{n-1}
\left(\begin{array}{c}
\chi_0 \\
\chi_1 \\
\vdots \\
\chi_{m-1}
\end{array}\right)
+
\left(\begin{array}{c}
\alpha_{0,n-1} \\
\alpha_{1,n-1} \\
\vdots \\
\alpha_{m-1,n-1}
\end{array}\right)
\end{array}
\right)
.
\end{array}
\end{equation*}
What this illustrates is that we could have partitioned \(A\) by columns and \(y \) by elements to find that
\begin{equation*}
\begin{array}{l}
\left(\begin{array}{c | c | c | c}
a_0 \amp a_1 \amp \cdots \amp a_{n-1}
\end{array}
\right) \\
~~~:=
x
\left( \begin{array}{c | c | c | c}
\psi_0 \amp \psi_1 \amp \cdots \amp \psi_{n-1}
\end{array} \right)
+
\left(\begin{array}{c | c | c | c}
a_0 \amp a_1 \amp \cdots \amp a_{n-1}
\end{array}
\right)
\\
~~~=
\left(\begin{array}{c | c | c | c}
x \psi_0 + a_0 \amp x \psi_1 + a_1 \amp \cdots \amp x \psi_{n-1} + a_{n-1}
\end{array}
\right)
\\
~~~=
\left(\begin{array}{c | c | c | c}
\psi_0 x + a_0 \amp \psi_1 x + a_1 \amp \cdots \amp \psi_{n-1} x + a_{n-1}
\end{array}
\right).
\end{array}
\end{equation*}
This discussion explains the JI loop ordering for computing \(A := x y^T + A \text{:}\)
\begin{equation*}
\begin{array}{l}
{\bf for~} j := 0, \ldots , n-1 \\[0.05in]
~~~
\left. \begin{array}{l}
{\bf for~} i := 0, \ldots , m-1 \\
~~~ ~~~ \alpha_{i,j} := \chi_{i} \psi_{j} + \alpha_{i,j} \\
{\bf end}
\end{array} \right\} ~~~ a_j := \psi_j x + a_j \\[0.1in]
{\bf end}
\end{array}
\end{equation*}
It demonstrates is how the rank-1 operation can be implemented as a sequence of axpy operations.
Homework 1.4.1.2.
In Assignments/Week1/C/Ger_J_Axpy.c complete the implementation of rank-1 in terms of axpy operations. You can test it by executing
make Ger_J_Axpy
Homework 1.4.1.3.
Fill in the blanks:
Click here to enlarge.
The last homework suggests that there is also an IJ loop ordering that can be explained by partitioning \(A \) by rows and \(x \) by elements:
\begin{equation*}
\left( \begin{array}{c}
\widetilde a_0^T \\ \hline
\widetilde a_1^T \\ \hline
\vdots \\ \hline
\widetilde a_{m-1}^T
\end{array}
\right) :=
\left( \begin{array}{c}
\chi_0 \\ \hline
\chi_1 \\ \hline
\vdots \\ \hline
\chi_{m-1}
\end{array}
\right) y^T +
\left( \begin{array}{c}
\widetilde a_0^T \\ \hline
\widetilde a_1^T \\ \hline
\vdots \\ \hline
\widetilde a_{m-1}^T
\end{array}
\right)
=
\left( \begin{array}{c}
\chi_0 y^T + \widetilde a_0^T \\ \hline
\chi_1 y^T + \widetilde a_1^T \\ \hline
\vdots \\ \hline
\chi_{m-1} y^T + \widetilde a_{m-1}^T
\end{array}
\right)
\end{equation*}
leading to the algorithm
\begin{equation*}
\begin{array}{l}
{\bf for~} i := 0, \ldots , n-1 \\[0.05in]
~~~
\left. \begin{array}{l}
{\bf for~} j := 0, \ldots , m-1 \\
~~~ ~~~ \alpha_{i,j} := \chi_{i} \psi_{j} + \alpha_{i,j} \\
{\bf end}
\end{array} \right\} ~~~ \widetilde a_i^T := \chi_i y^T + \widetilde a_i^T \\[0.1in]
{\bf end}
\end{array}
\end{equation*}
and corresponding implementation.
Homework 1.4.1.4.
In Assignments/Week1/C/Ger_I_Axpy.c complete the implementation of rank-1 in terms of axpy operations (by rows). You can test it by executing
make Ger_I_Axpy