We now give more details on how using our approach and library leads to elegant, efficient, and scalable implementations of matrix-matrix multiplication on distributed memory architectures.
We will consider the formation of the matrix products
and will use the techniques discussed in
Section 1.6 as well
as the implementations discussed
for the parallel implementation of matrix-vector
multiplication in Section and
rank-1 update in Section
.