In this paper, we show that by creating an infrastructure for high level specification of parallel dense linear algebra algorithms, not only is the code greatly simplified, but higher performance is achieved. Due to the simplicity of the specification of the code, more complex implementation can be attempted, which in turn yield higher performance. We demonstrate the validity of his observation through the use of the Parallel Linear Algebra Package (PLAPACK) infrastructure to code high performance parallel implementations of matrix decomposition algorithms like those for the Cholesky, LU, and QR factorizations. What is somewhat surprising is that despite the fact that the literature on implementation of these algorithms is vast (indeed too large to properly cite in this paper), we derive new algorithms for these factorizations, which are shown to yield superior performance.
High performance libraries for matrix operations like the various factorizations first became available with the arrival of the LAPACK [2] library in the early 90's. This library recognized the necessity to code in terms of Basic Linear Algebra Subprograms (BLAS) [12,7,6] for modularity and portable performance. In particular, it explores the benefits of recoding such algorithms in terms of matrix-matrix kernels like matrix-matrix multiplication in order to improve utilization of the cache of a microprocessor.
Extensions of LAPACK to distributed memory architectures like the Cray T3E, IBM SP-2, and Intel Paragon are explored by the ScaLAPACK project [4]. While this package does manage to provide a subset of the functionality of LAPACK, it is our belief that it has also clearly demonstrated that mimicking the coding styles that were effective on sequential and shared memory computers does not create maintainable and flexible code for distributed memory architectures. The PLAPACK infrastructure attempts to show that by adopting an object based coding style, already popularized by the Message-Passing Infrastructure (MPI) [10,14], the coding of parallel linear algebra algorithms is simplified compared to the more traditional sequential coding approaches. We contrast the coding styles in [1].
It is generally believed that by coding at a higher level of abstraction, one pays a price of higher overhead and thus reduced performance. While it is certainly true that the PLAPACK infrastructure does impose such an overhead, we demonstrate that one can overcome this overhead by implementing more complex algorithms. One can argue that these more complex algorithms can always also be implemented using more traditional methods. However, it is the higher level of abstraction that makes the task manageable, especially when a large number of algorithms are to be implemented.
This paper is structured as follows: In Section 2, we explain matrix-vector and matrix-matrix operation based versions of an algorithm for computing the Cholesky factorization of a symmetric positive definite matrix. We show how this algorithm is parallelized and coded using PLAPACK, progressively incorporating more sophistication into the implementation. The final improvement yields a previously unpublished algorithm, which is shown to outperform previously published implementations. In Sections 3-4, we similarly discuss parallel implementation of the LU and QR factorizations. In Section 5, we present initial performance results on the Cray T3E, including a comparison with ScaLAPACK. Concluding remarks are given in the final section.