Subsection 12.4.4 Parallel high-performance algorithms
ΒΆModern processors achieve high performance by extracting parallelism at the instruction level and by incorporating multiple cores that can collaborate to compute an operation. Beyond that, parallel supercomputers consist of computational nodes, each of which consists of multiple processing cores and local memory, that can communicate through a communication network.
Many of the issues encountered when mapping (dense) linear algebra algorithms to such distributed memory computers can be illustrated by studying matrix-matrix multiplication. A classic paper is
[45] Robert van de Geijn and Jerrell Watts, SUMMA: Scalable Universal Matrix Multiplication Algorithm, Concurrency: Practice and Experience, Volume 9, Number 4, 1997.
The techniques described in that paper are generalized in the more recent paper
[35] Martin D. Schatz, Robert A. van de Geijn, and Jack Poulson, Parallel Matrix Multiplication: A Systematic Journey, SIAM Journal on Scientific Computing, Volume 38, Issue 6, 2016.