Performance Comparison

libflame (SuperMatrix) v. PLASMA

Below we illustrate the performance difference between SuperMatrix (as implemented in libflame) and PLASMA for a few operations supported by both libraries.

Software tested

	libflame	PLASMA
version	3.0 (r2991)	1.0.0
storage (outer) block size	192	192
algorithmic (inner) block size	48	48
BLAS implementation	MKL 8.1	MKL 8.1
floating-point arithmetic	double precision real	double precision real

Note: MKL 10.1 was available at the time of these experiments, but was not installed on the specific machine. MKL 10.1 performance for dpotrf() (Cholesky factorization) is considerably better compared to the performance shown below, but dgetrf() (LU factorization) and dgeqrf() (QR factorization) do not improve much.

Hardware tested

Performance results were gathered on a 16 processor (8 socket) ccNUMA architecture with 1.5Ghz Itanium2 processors. Each processor is capable of a theoretical maximum performance of 6 GFLOPS, allowing for a combined system peak performance of 96 GFLOPS. The system contains 32GB of main memory.

Cholesky factorization

LU factorization (with pivoting)

Note: libflame implements LU factorization via a new incremental pivoting algorithm. This algorithm exhibits similar stability properties as dgetrf() but allows better parallelism.

QR factorization

Note: libflame implements an incremental algorithm for QR factorization via the UT transform. The algorithm is similar to that of LU with incremental pivoting, and since it uses Householder transformations the algorithm is quite stable while allowing more opportunities for parallelism.