PfHP Strassen's algorithm

Unit 2.5.2 Strassen's algorithm

So far, we have discussed algorithms for computing \(C := A B + C\) via a triple-nested loop that then perform \(2 m n k \) flops. A question is whether this is the best we can do.

A classic result regarding this is Strassen's algorithm [26]. It shows that, for problems of size \(m = n = k = 2^r \) for some integer \(r \text{,}\) matrix-matrix multiplication can be computed in time \(O( n^{\log_2{7}} ) \approx O( n^{2.807}) \text{.}\) Since Strassen proposed this, a succession of results further improved upon the exponent. In this discussion, we stick to the original result.

How can this be? For simplicity, assume \(m = n = k \) are all even. Partition

\begin{equation*} C = \left( \begin{array}{c c} C_{00} \amp C_{01} \\ C_{10} \amp C_{11} \end{array} \right) , A = \left( \begin{array}{c c} A_{00} \amp A_{01} \\ A_{10} \amp A_{11} \end{array} \right) , B = \left( \begin{array}{c c} B_{00} \amp B_{01} \\ B_{10} \amp B_{11} \end{array} \right), \end{equation*}

where \(X_{ij} \) are \(n/2 \times n/2 \) for \(X \in \{ C, A, B \} \) and \(ij \in \{ 00, 01, 10, 11 \} \text{.}\) Now that you understand how partitioned matrix-matrix multiplication works, you know that the following computations compute \(C := A B + C \text{:}\)

\begin{equation*} \begin{array}{l c l} C_{00} \amp=\amp \alpha (A_{00} B_{00} + A_{01} B_{10}) + C_{00} \\ C_{01} \amp=\amp \alpha (A_{00} B_{01} + A_{01} B_{11}) + C_{01} \\ C_{10} \amp=\amp \alpha (A_{10} B_{00} + A_{11} B_{10}) + C_{10} \\ C_{11} \amp=\amp \alpha (A_{10} B_{01} + A_{11} B_{11}) + C_{11} . \end{array} \end{equation*}

Each of the eight matrix-matrix multiplications requires \(2 (n/2)^3 = 1/4 n^3 \) flops and hence the total cost is our usual \(2 n^3 \) flops.

Surprisingly, the following computations also compute \(C := A B + C \text{:}\)

\begin{equation*} \begin{array}{l @{\hspace{1pt}} c @{\hspace{1pt}} l l r} M_0 \amp =\amp ( A_{00} + A_{11} ) ( B_{00} + B_{11} ); % \\ \amp \amp C_{00} +\!\!= M_0; C_{11} +\!\!= M_0; \\ M_1 \amp=\amp ( A_{10} + A_{11} ) B_{00}; % \\ \amp \amp C_{10} +\!\!= M_1 ; C_{11} -\!\!= M_1 ; \\ M_2 \amp=\amp A_{00} ( B_{01} - B_{11} ); % \\ \amp \amp C_{01} +\!\!= M_2 ; C_{11} +\!\!= M_2 ; \\ M_3 \amp=\amp A_{11}( B_{10} - B_{00} ); % \\ \amp \amp C_{00} +\!\!= M_3 ; C_{10} +\!\!= M_3 ; \\ M_4 \amp=\amp ( A_{00} + A_{01}) B_{11}; % \\ \amp \amp C_{01} +\!\!= M_4 ; C_{00} -\!\!= M_4; \\ M_5\amp=\amp (A_{10} - A_{00} )( B_{00} + B_{01} ); % \\ \amp \amp C_{11} +\!\!= M_5; \\ M_6\amp=\amp (A_{01} - A_{11} )( B_{10} + B_{11} ); % \\ \amp \amp C_{00} +\!\!= M_6 . \end{array} \end{equation*}

If you count carefully, this requires 22 additions of \(n/2 \times n/2 \) matrices and 7 multiplications with \(n/2 \times n/2 \) matrices. Adding matrices together requires \(O( n^2) \) flops, which are insignificant if \(n \) is large. Ignoring this, the cost now becomes

\begin{equation*} 7 \times 2 ( n/2 )^3 = 2 \frac{7}{8} n^3 \end{equation*}

flops. The cost of the matrix-matrix multiplication is now \(7/8\) of what it was before!

But it gets better! Each of the matrix-matrix multiplications can themselves be computed via this scheme. If you do that, applying the idea at two levels, the cost is reduced to

\begin{equation*} 7 \times ( 7 \times 2 (n/4)^3 ) = 2 \left( \frac{7}{8} \right)^2 n^3 \end{equation*}

flops. How many times can we do that? If \(n = 2^r \) we can half the size of the matrices \(r = \log_2( n ) \) times. If you do that, the cost becomes

\begin{equation*} 2 \left( \frac{7}{8} \right)^r n^3 {\rm ~flops}. \end{equation*}

Now,

\begin{equation*} \left( {7}/{8} \right)^{\log_2(n)} 2 n^3 = n^{\log_2(7/8)} 2 n^3 = 2 n^{\log_2 7} \approx 2 n^{2.807} . \end{equation*}

Here we ignored the cost of the additions. However, it can be analyzed that this recursive approach requires \(O( n^{2.807} )\) flops.

Remark 2.5.1.

Learn more about the Strassen's algorithm entry on Wikipedia. In Unit 3.5.4 we will discuss how our insights support a practical implementation of Strassen's algorithm.