Performance Comparison

libflame (SuperMatrix) v. PLASMA


Below we illustrate the performance difference between SuperMatrix (as implemented in libflame) and PLASMA for a few operations supported by both libraries.

Software tested


libflame
PLASMA
version
3.0 (r2991)
1.0.0
storage (outer) block size
192
192
algorithmic (inner) block size
48
48
BLAS implementation
MKL 8.1
MKL 8.1
floating-point arithmetic
double precision real
double precision real

Note: MKL 10.1 was available at the time of these experiments, but was not installed on the specific machine. MKL 10.1 performance for dpotrf() (Cholesky factorization) is considerably better compared to the performance shown below, but dgetrf() (LU factorization) and dgeqrf() (QR factorization) do not improve much.

Hardware tested

Performance results were gathered on a 16 processor (8 socket) ccNUMA architecture with 1.5Ghz Itanium2 processors. Each processor is capable of a theoretical maximum performance of 6 GFLOPS, allowing for a combined system peak performance of 96 GFLOPS. The system contains 32GB of main memory.

Cholesky factorization

chol


LU factorization (with pivoting)

chol

Note: libflame implements LU factorization via a new incremental pivoting algorithm. This algorithm exhibits similar stability properties as dgetrf() but allows better parallelism.


QR factorization

chol

Note: libflame implements an incremental algorithm for QR factorization via the UT transform. The algorithm is similar to that of LU with incremental pivoting, and since it uses Householder transformations the algorithm is quite stable while allowing more opportunities for parallelism.