This section serves a number of purposes: First, it illustrates the issues that were raised while discussing the parallel implementation of matrix-matrix multiply and its variants. Second, it shows that indeed high performance can be attained using the techniques that underlie PLAPACK. Finally, it demonstrates that high performance can be attained on different architectures without any changes to the infrastructure. However, one should not read too much into the performance graphs: we collected and graphed performance numbers for the part of the infrastructure that was optimized as of the moment that the guide was completed. To see the current performance numbers for a range of algorithms, consult the PLAPACK web page.
All performance numbers center around the implementation
of the most basic case of matrix-matrix multiplication:
. For this case, we will
show the performance of the three variants: panel-panel,
matrix-panel, and panel-matrix.
Performance is reported in MFLOPS/sec per node, or millions of floating point operations per second per node. Given that a problem of with dimensions m , n , and k requires time t(m,n,k) (in seconds) to complete, the MFLOPS/sec per node for matrix-matrix multiplication is computed using the formula
We discuss performance on three current generation
distributed memory parallel computers:
the Intel Paragon with GP nodes (one compute processor
per node),
the Cray T3D, and the IBM SP-2.
In addition, we show performance on the Convex Exemplar S-class system,
in a shared-memory configuration.
In all cases we will report performance per node,
where we use the term node for a compute processor.
All performance is for 64-bit precision (i.e. double precision
real on the Exemplar, the Paragon, and the SP-2, and single precision real on
the T3D).
The peak performance of an individual node limits the
peak performance per node that can be achieved for our
parallel implementations. As of this writing,
the single processor peak performances for 64-bit arithmetic matrix-matrix multiply,
using assembly coded sequential BLAS, are
roughly given by
The speed and topology of the interconnection network affects how fast peak performance can be approached. Since our infrastructure is far from optimized with regards to communication as of this writing, there is no real need to discuss network speeds other than that both the Intel Paragon and Cray T3D have very fast interconnection networks, while the IBM SP-2 has a noticeably slower network, relative to individual processor speeds. On the Exemplar, we use MPI to communicate between processors, so that it is programmed as if it were a distributed memory architecture.
In Figure , we show the performance
on 64 node configurations of the different architectures.
In that figure, the performance of the panel-panel
variant is reported, with the matrix dimensions chosen so that
all three matrices are square. The algorithmic and distribution
blocking sizes,
and nb
were taken to be equal to each other (
),
but on each architecture they equal the value that appears to
give good performance: on the Paragon
and on the T3D and SP-2
.
The different curves approach the
peak performance for the given architecture, eventually leveling
off.
In Figure , we again report the performance on
the different architectures for the panel-panel variant, with
the same choices for the blocking parameters.
PLACE BEGIN HR HERE
PLACE END HR HERE
However, this time we fix dimensions m and n to be large, and we vary dimension k . For the panel-panel based implementation, on all architectures high performance is achieved even for small values of k , which verifies that the panel-panel variant is a good choice when k is small. This was predicted when we discussed implementation of the different variants.
In Figure , we illustrate the performance
for the different variants on the Intel Paragon, by fixing
dimensions m and n to be large ( m = n = 5000 )
and varying k along the x-axis.
Our previous discussion
predicts that the panel-panel implementation is most natural
for this shape of the matrices, and indeed, that variant outperforms
the other two. One fact worth mentioning is that the matrix-panel
and panel-matrix variants level off at a considerably lower level
of performance. This is due to the fact that the sequential dgemm
implementation performs better for a panel-panel update than
a matrix-panel or panel-matrix multiply when the panel width is 20.
(Notice that this reflects a limitation of the implementation of
the sequential dgemm call. Better performance can be attained
by more careful implementation of that kernel.)
Similar behavior is reported in Figure
for the
IBM SP-2.
PLACE BEGIN HR HERE
PLACE END HR HERE
In Figure , we again illustrate the performance
for the different variants on the Intel Paragon, but this time
dimensions m and k are fixed to be large ( m = k = 5000 )
and n is varied.
Our previous discussion
predicts that the matrix-panel variant is the most natural
for this shape of the matrices, and indeed, that variant initially outperforms
the other two, when n is small.
Since the matrix-panel
and panel-matrix variants level off at a considerably lower level
of performance, due to the sequential dgemm implementation,
eventually the panel-panel variant becomes fastest.
Similar behavior is reported in Figure
for the
IBM SP-2.
PLACE BEGIN HR HERE
PLACE END HR HERE
Notice that showing performance numbers for a single configuration
of an architecture gives no insight into the scalability of the
approach.
As nodes are added, the problem size can grow so that
memory use per node remains constant (after all, as nodes
are added, so is memory). For the matrix-matrix multiply
example, the means that if m = n = k , then the dimensions
can grow with the square-root of the number of nodes.
Generally, dense linear algebra algorithms can be
parallelized so that they are scalable in the following sense:
if the problem size is increased
so that memory use per node is kept constant, then the
efficiency (e.g., MFLOPS/sec per node) will remain
(approximately) constant.
To verify this trend for matrix-matrix multiply,
we report in Figure
the MFLOPS/sec per node on the Intel Paragon when
the number of nodes is varied, and the memory per node
is kept constant. Notice that the performance is only
slightly affected as nodes are added.
In Figure , we show that by carefully picking
parameters and matrix dimensions, one can further improve performance.
In that figure, we contrast the performance of the panel-panel
variant on a Cray T3D for two slightly different choices:
The first curve shows the exact same data as used in Figure
:
m = n = 8000 and
.
For that curve, no particular care is taken when picking the k
at which the timings were collected.
For the second curve, we increased the problem slightly, to m = n = 8192
(a power of two),
the distribution blocking size was kept at nb = 64 , and
the algorithmic blocking size was increased to
.
Also, k was sampled at powers of 2 . Notice that performance improves
slightly.
PLACE BEGIN HR HERE
PLACE END HR HERE
In Figure , we examine the performance of PLAPACK
PLACE BEGIN HR HERE
PLACE END HR HERE
on the Convex Exemplar S-series. While all data presented was collected using a single 16 processor shared memory cluster, communication between processors uses MPI, like on a distributed memory architecture. The graph shows performance for square matrices, with the different curves representing different numbers of nodes being used. While trying to meet the deadline for this guide, we did not have time to fully tune PLAPACK for the Exemplar. Nonetheless, very good performance is achieved. It should be noted that for some matrix sizes, performance on the Exemplar was much lower than expected. In order to make the graph less busy, those performance numbers for the Exemplar are not reported. We expect that further experimentation will allow us to more consistently achieve high performance.