libflame download page

Source code

libflame is provided as free software, licensed under the GNU Lesser General Public License (LGPL) in two forms:

Nightly snapshots. We provide nightly snapshots of the libflame source tree, identified by their subversion revision numbers. We strongly encourage interested users to download the latest nightly snapshot instead of the most recent milestone release. These snapshots provide the latest set of functionality and bug fixes, but may be slightly more prone to newer, more short-lived bugs when compared to the most recent stable release. This is simply due to the fact that the snapshot may capture recently-introduced bugs or other forms of breakage before a developer can identify and correct the problem. However, we make every effort to keep interim revisions functional and working as much as possible. That said, if you think you've found a bug, please send us feedback!
Previous milestone releases. The most recent milestone release of libflame is version 3.0. This and other milestone releases may be found here. Note: Even the latest milestone release lacks some of our most recent bug fixes and will be SIGNIFCANTLY MORE BUG-PRONE than the nightly snapshots. Please use a nightly snapshot!

Reference guide

We strongly encourage our users to refer to the latest copy of the libflame user's guide for installation instructions and API reference.

FLAME is a methodology for developing dense linear algebra libraries that is radically different from the LINPACK/LAPACK approach that dates back to the 1970s. By libflame we denote the library that has resulted from this project. For addition information, visit the FLAME home page.

What's provided by libflame?

The following libflame features benefit both basic and advanced users, as well as library developers:

A solution based on fundamental computer science. The FLAME project advocates a new approach to developing linear algebra libraries. Algorithms are obtained systematically according to rigorous principles of formal derivation. These methods are based on fundamental theorems of computer science to guarantee that the resulting algorithm is also correct. In addition, the FLAME methodology uses a new, more stylized notation for expressing loop-based linear algebra algorithms. This notation closely resembles how algorithms are naturally illustrated with pictures. (See Figure 1 and Figure 2 (left).)

Object-based abstractions and API. The BLAS, LAPACK, and ScaLAPACK projects place backward compatibility as a high priority, which hinders progress towards adopting modern software engineering principles such as object abstraction. libflame is built around opaque structures that hide implementation details of matrices, such as leading dimensions, and exports object-based programming interfaces to operate upon these structures. Likewise, FLAME algorithms are expressed (and coded) in terms of smaller operations on sub-partitions of the matrix operands. This abstraction facilitates programming without array or loop indices, which allows the user to avoid painful index-related programming errors altogether. Figure 2 compares the coding styles of libflame and LAPACK, highlighting the inherent elegance of FLAME code and its striking resemblance to the corresponding FLAME algorithm shown in Figure 1. This similarity is quite intentional, as it preserves the clarity of the original algorithm as it would be illustrated on a white-board or in a publication.

Educational value. Aside from the potential to introduce students to formal algorithm derivation, FLAME serves as an excellent vehicle for teaching linear algebra algorithms in a classroom setting. The clean abstractions afforded by the API also make FLAME ideally suited for instruction of high-performance linear algebra courses at the undergraduate and graduate level. Robert van de Geijn routinely uses FLAME in his linear algebra and numerical analysis courses. Some colleagues of the FLAME project are even beginning to use the notation to teach classes elsewhere around the country, including Timothy Mattson of Intel Corporation. Historically, the BLAS/LAPACK style of coding has been used in these settings. However, coding in this manner tends to obscure the algorithms; students often get bogged down debugging the frustrating errors that often result from indexing directly into arrays that represent the matrices. (See Figure 2.)

A complete dense linear algebra framework. Like LAPACK, libflame provides ready-made implementations of common linear algebra operations. The implementations found in libflame mirror many of those found in the BLAS and LAPACK packages. However, unlike LAPACK, libflame provides a framework for building complete custom linear algebra codes. We believe such an environment is more useful as it allows the user to quickly prototype a linear algebra solution to fit the needs of his application. We are currently writing a complete user's guide for libflame. In the meantime, users may browse the full list of routines available in libflame through our online doxygen documentation.

High performance. In our publications and performance graphs, we do our best to dispel the myth that user- and programmer-friendly linear algebra codes cannot yield high performance. Our FLAME implementations of operations such as Cholesky factorization and Triangular Inversion often outperform the corresponding implementations available in the LAPACK library. Figure 3 shows an example of the performance increase possible by using libflame compared to LAPACK. Many instances of the libflame performance advantage result from the fact that LAPACK provides only one variant (algorithm) of every operation, while libflame provides all known variants. This allows the user and/or library developer to choose which algorithmic variant is most appropriate for a given situation. libflame relies only on the presence of a core set of highly optimized unblocked routines to perform the small sub-problems found in FLAME algorithm codes. Additional performance results may be found here, at our linear algebra wiki.

Dependency-aware multithreaded parallelism. Until recently, the authors of the BLAS and LAPACK advocated getting shared-memory parallelism from LAPACK routines by simply linking to multithreaded BLAS. This low-level solution requires no changes to LAPACK code but also suffers from sharp limitations in terms of efficiency and scalability for small- and medium-sized matrix problems. The fundamental bottleneck to introducing parallelism directly within many algorithms is the web of data dependencies that inevitably exists between sub-problems. The libflame project has developed a runtime system, SuperMatrix, to detect and analyze dependencies found within FLAME algorithms-by-blocks (algorithms whose sub-problems operate only on block operands). Once dependencies are known, the system schedules sub-operations to independent threads of execution. This system is completely abstracted from the algorithm that is being parallelized and requires virtually no change to the algorithm code, but at the same time exposes abundant high-level parallelism. We have observed that this method provides increased performance for a range of small- and medium-sized problems, as shown in Figure 4. The most recent version of LAPACK does not offer any similar mechanism.

Support for hierarchical storage-by-blocks. Storing matrices by blocks, a concept advocated years ago by Fred Gustavson of IBM, often yields performance gains through improved spatial locality. Instead of representing matrices as a single linear array of data with a prescribed leading dimension as legacy libraries require (for column- or row-major order), the storage scheme is encoded into the matrix object. Here, internal elements refer recursively to child objects that represent sub-matrices. Currently, libflame provides a subset of the conventional API that supports hierarchical matrices, allowing users to create and manage such matrix objects as well as convert between storage-by-blocks and conventional "flat" storage schemes.

Advanced build system. From its early revisions, libflame distributions have been bundled with a robust build system, featuring automatic makefile creation and a configuration script conforming to GNU standards (allowing the user to run the ./configure; make; make install sequence common to many open source software projects). Without any user input, the configure script searches for and chooses compilers based on a pre-defined preference order for each architecture. The user may request specific compilers via the configure interface, or enable other non-default features of libflame such as custom memory alignment, multithreading (via POSIX threads or OpenMP), compiler options (debugging symbols, warnings, optimizations), and memory leak detection. The reference BLAS and LAPACK libraries provide no configuration support and require the user to manually modify a makefile with appropriate references to compilers and compiler options depending on the host architecture.

Windows support. While libflame was originally developed for GNU/Linux and UNIX environments, we have in the course of its development had the opportunity to port the library to Microsoft Windows. The Windows port features a separate build system implemented with Python and nmake, the Microsoft analogue to the make utility found in UNIX-like environments. As of this writing, the port is still very new and therefore should be considered experimental. However, we feel libflame for Windows is very close to useable for many in our audience, particularly those who consider themselves experts. We invite interested users to try the software and, of course, we welcome feedback to help improve our Windows support, and libflame in general.

Independence from Fortran and LAPACK. The libflame development team is pleased to offer a high-performance linear algebra solution that is 100% Fortran-free. libflame is a C-only implementation and does not depend on any external Fortran libraries, such as LAPACK. That said, we happily provide an optional backward compatibility layer, lapack2flame, that maps legacy LAPACK routine invocations to their corresponding native C implementations in libflame. This allows legacy applications to start taking advantage of libflame with virtually no changes to their source code. Furthermore, we understand that some users wish to leverage highly-optimized implementations that conform to the LAPACK interface, such as Intel's Math Kernel Library (MKL). As such, we allow those users to configure libflame such that their external LAPACK implementation is called for the small, performance-sensitive unblocked subproblems that arise within libflame's blocked algorithms and algorithms-by-blocks.

Figure 1: Blocked Cholesky Factorization (variant 2) expressed as a FLAME algorithm.

      SUBROUTINE DPOTRF( UPLO, N, A, LDA, INFO )

      CHARACTER          UPLO
      INTEGER            INFO, LDA, N
      DOUBLE PRECISION   A( LDA, * )

      DOUBLE PRECISION   ONE
      PARAMETER          ( ONE = 1.0D+0 )
      LOGICAL            UPPER
      INTEGER            J, JB, NB
      LOGICAL            LSAME
      INTEGER            ILAENV
      EXTERNAL           LSAME, ILAENV
      EXTERNAL           DGEMM, DPOTF2, DSYRK, DTRSM, XERBLA
      INTRINSIC          MAX, MIN

      INFO = 0
      UPPER = LSAME( UPLO, 'U' )
      IF( .NOT.UPPER .AND. .NOT.LSAME( UPLO, 'L' ) ) THEN
         INFO = -1
      ELSE IF( N.LT.0 ) THEN
         INFO = -2
      ELSE IF( LDA.LT.MAX( 1, N ) ) THEN
         INFO = -4
      END IF
      IF( INFO.NE.0 ) THEN
         CALL XERBLA( 'DPOTRF', -INFO )
         RETURN
      END IF

      INFO = 0
      UPPER = LSAME( UPLO, 'U' )

      IF( N.EQ.0 )
     $   RETURN

      NB = ILAENV( 1, 'DPOTRF', UPLO, N, -1, -1, -1 )
      IF( NB.LE.1 .OR. NB.GE.N ) THEN
         CALL DPOTF2( UPLO, N, A, LDA, INFO )
      ELSE
         IF( UPPER ) THEN
*********** Upper triangular case omited for purposes of fair comparison.
         ELSE
            DO 20 J = 1, N, NB
               JB = MIN( NB, N-J+1 )
               CALL DSYRK( 'Lower', 'No transpose', JB, J-1, -ONE,
     $                     A( J, 1 ), LDA, ONE, A( J, J ), LDA )
               CALL DPOTF2( 'Lower', JB, A( J, J ), LDA, INFO )
               IF( INFO.NE.0 )
     $            GO TO 30
               IF( J+JB.LE.N ) THEN
                  CALL DGEMM( 'No transpose', 'Transpose', N-J-JB+1, JB,
     $                        J-1, -ONE, A( J+JB, 1 ), LDA, A( J, 1 ),
     $                        LDA, ONE, A( J+JB, J ), LDA )
                  CALL DTRSM( 'Right', 'Lower', 'Transpose', 'Non-unit',
     $                        N-J-JB+1, JB, ONE, A( J, J ), LDA,
     $                        A( J+JB, J ), LDA )
               END IF
   20       CONTINUE
         END IF
      END IF
      GO TO 40
   30 CONTINUE
      INFO = INFO + J - 1
   40 CONTINUE
      RETURN
      END

Figure 2: FLAME/C code for algorithm shown in Figure 1 (left) representing the style of coding found in libflame, and Fortran-77 LAPACK code (right) implementing the same algorithm.

Figure 3: Cholesky Factorization implementations compared on an 8-core Opteron system. Notes: For FLAME experiments, LAPACK was used only for the small unblocked Cholesky subproblem. GotoBLAS was configured to provide multithreaded parallelism for level-3 BLAS operations. Peak system performance is 38.4 GFLOPS.

Figure 4: Cholesky Factorization implementations compared on a 16 core Itanium2 system. Notes: libflame uses variant 3 while LAPACK uses variant 2. For non-SuperMatrix experiments, GotoBLAS was configured to provide multithreaded parallelism for level-3 BLAS operations. For SuperMatrix experiments, GotoBLAS parallelism was disabled. Theoretical peak system performance is 96 GFLOPS.

What's new in libflame?

We've added lots of functionality since libflame 3.0 was released on May 1, 2009. Here is a basic summary:

Library API and implementations

Added optmized unblocked variants for all LAPACK-level operations, allowing the user to build libflame such that it is completely independent of LAPACK implementations. libflame now relies only upon the presence of a BLAS library at link-time.
Added a new layer of abstraction, the BLAS-like Interface Subprograms (BLIS), allowing us to bury much of the complexity that arises from needing to define BLAS-like operations (such as gemv where both the matrix and vector can be conjugated) in terms of conventional BLAS operations.
The lapack2flame compatibility layer has been completely rewritten in C and is now bundled with libflame instead of being a separate library.
Implemented an algorithm-by-blocks for LU with partial pivoting, with full support in SuperMatrix for dependency-aware shared-memory parallelism
Deprecated old Fortran wrappers to native libflame APIs.
Many minor interface and implementation improvements.
Many other bug fixes and cleanups.

Build system

Removed Fortran as a requirement to building libflame for both GNU/Linux and Microsoft Windows build systems.
Added the external-lapack-interfaces option, which allows the user to activate wrappers to LAPACK routines.
Added the external-lapack-for-subproblems option, which allows the user to utilize an external LAPACK impelementation within libflame, but only for the smallest unblocked subproblems that arise within libflame's blocked algorithms and algorithms-by-blocks.

Status of operation support

libflame contains implementations of many operations that are provided by the BLAS and LAPACK libraries. However, not all FLAME implemenations support every datatype. Also, in many cases, we use a different naming convention for our routine names. The following table summarizes which routines are supported within libflame and also provides their corresponding netlib name for reference.

Notes:

y These routines are provided by libflame.
? Expands to one of {sdcz}.
~ These routines are not provided by LAPACK.
+ The LAPACK routine ?potri() differs from FLA_SPDinv() and FLASH_SPDinv() in that ?potri() require the user to invoke the Cholesky factorization manually and then pass in the result as input, whereas the FLAME implementations perform the Cholesky factorization internally and automatically.
^ LAPACK provides only an unblocked implementation of the triangular Sylvester equation solver. The lapack2flame compatibility interface maps invocations of ?trsyl() to the blocked implementation in libflame.
* Invocations of routines with the FLASH_ prefix call SuperMatrix by default. If SuperMatrix was not enabled at configure-time, or it was disabled at runtime with FLASH_Queue_disable(), then FLASH_ routines execute sequentially, though they will still use hierarchical storage.

operation name	netlib routine name	libflame routine name	FLAME/C	FLASH	SuperMatrix	type support	l2f support
libflame routine prefix			FLA_	FLASH_*	FLASH_
Level-3 BLAS
general matrix-matrix multiply	?gemm	Gemm	y	y	y	sdcz	N/A
hermitian matrix-matrix multiply	?hemm	Hemm	y	y	y	sdcz	N/A
hermitian rank-k update	?herk	Herk	y	y	y	sdcz	N/A
hermitian rank-2k update	?her2k	Her2k	y	y	y	sdcz	N/A
symmetric matrix-matrix multiply	?symm	Symm	y	y	y	sdcz	N/A
symmetric rank-k update	?syrk	Syrk	y	y	y	sdcz	N/A
symmetrix rank-2k update	?syr2k	Syr2k	y	y	y	sdcz	N/A
triangular matrix-matrix multiply	?trmm	Trmm	y	y	y	sdcz	N/A
triangular solve with multiple right-hand sides	?trsm	Trsm	y	y	y	sdcz	N/A
LAPACK
Cholesky factorization	?potrf	Chol	y	y	y	sdcz	sdcz
LU factorization with no pivoting	~	LU_nopiv	y	y	y	sdcz	N/A
LU factorization with partial pivoting	?getrf	LU_piv	y	y	y	sdcz	sdcz
LU factorization with incremental pivoting	~	LU_incpiv		y	y	sdcz	N/A
QR factorization via the UT transform	~	QR_UT	y			sdcz	sdcz
QR factorization (incremental) via the UT transform	~	QR_UT_inc		y	y	sdcz	N/A
Apply Q from QR_UT or LQ_UT to a right-hand side matrix	~	Apply_Q_UT	y			sdcz	sdcz
Apply Q from QR_UT_inc to a right-hand side matrix	~	Apply_Q_UT_inc		y	y	sdcz	N/A
LQ factorization via the UT transform	~	LQ_UT	y			sdcz	sdcz
Trinagular matrix inversion	?trtri	Trinv	y	y	y	sdcz	sdcz
triangular transpose matrix-matrix multiply	?laaum	Ttmm	y	y	y	sdcz	sdcz
SPD matrix inversion	?dpotri +	SPDinv	y	y	y	sdcz	sdcz
Triangular Sylvester equation solve	?trsyl ^	Sylv	y	y	y	sdcz	sdcz

LAPACK compatibility support in libflame

We provide an interface, lapack2flame, which allows legacy codes that link to LAPACK to utilize libflame without any code changes. However, lapack2flame does not provide interfaces to all routines within LAPACK. The column labeled "l2f support" in the above table shows which datatypes are supported for each operation.

System and software requirements

Please see the libflame user's guide for the latest system requirements for both GNU/Linux and UNIX, and Windows platforms.

Building and Installing libflame

Please see the libflame user's guide for the latest instructions on downloading, configuring, compiling, and installing libflame.

Building and installing GotoBLAS

The developers of libflame enthusiastically encourage users to use the GotoBLAS implementation of the Basic Linear Algebra Subprograms (BLAS). To obtain the source code for GotoBLAS, please visit the Texas Advanced Computing Center software site. After downloading perform the following steps:

tar xzf GotoBLAS-1.22.tar.gz
cd GotoBLAS
Please read the documentation that accompanies the GotoBLAS source.
Most users may build the GotoBLAS library by running quickbuild.32bit or quickbuild.64bit. Alternately, advanced users may instead view and edit Makefile.rule and then execute:

make lib
Copy the library archive to a more permanent directory. You should also symbolically link the libgoto library to an abbreviated name:
ln -s libgoto_ITANIUM2-r1.10.a libgoto.a
If multiple architecture builds of libgoto share the same directory, then you should include an architecture substring in the symbolic link name to differentiate the builds:
ln -s libgoto_ITANIUM2-r1.10.a libgoto_ia64.a

We highly recommend using libflame with GotoBLAS! However, libflame will work with any BLAS library. If you want to use libflame with a different BLAS, use the configure-time option --disable-goto-interfaces before building libflame. If you have further questions about interfacing libflame with your preferred BLAS library, contact flame@cs.utexas.edu.

Linking your LAPACK dependent application to libflame

Please see the libflame user's guide for the latest instructions on linking your legacy, LAPACK-dependent application to libflame.

Running an Example

We offer a step-by-step walkthrough for running two example programs included in the libflame source distribution: the first executes a sequential Cholesky factorization with conventional ("flat") matrix storage; the second executes a multithreaded Cholesky factorization using SuperMatrix and hierarchical storage.

We also encourage potential users to browse the code examples provided at our linear algebra wiki.

Beyond LAPACK

We have functionality beyond LAPACK. For example, we have routines for updating an LU factorization with pivoting. Adding additional operations is not our top priority at the moment. However, if you have an operation that you would like to see supported, it doesn't hurt to contact us with your request!

Thank us!

We are very insecure people. So, if you like the libraries and find them useful, send us a message! We even make it easy. In the top-level directory of the libflame distribution, execute:

make send-thanks

This will automatically e-mail us a message!

Questions?

Contact flame@cs.utexas.edu.

Last Updated on 29 September 2009 by Field G. Van Zee.