Publications Related to the FLAME Project

The BLIS Framework: Experiments in Portability

Field G. Van Zee
University of Texas at Austin, Austin, TX
,
Tyler M. Smith
University of Texas at Austin, Austin, TX
,
Bryan Marker
University of Texas at Austin, Austin, TX
,
Tze Meng Low
University of Texas at Austin, Austin, TX
,
Robert A. Van De Geijn
University of Texas at Austin, Austin, TX
,
Francisco D. Igual
Complutense University of Madrid, Madrid, Spain
,
Mikhail Smelyanskiy
Intel Corporation, Santa Clara, CA
,
Xianyi Zhang
Chinese Academy of Sciences, Beijing, China
,
Michael Kistler
IBM Corporation, Austin, TX
,
Vernon Austel
IBM Corporation, Yorktown Heights, NY
,
John A. Gunnels
IBM Corporation, Yorktown Heights, NY
,
Lee Killough
Cray Inc., Seattle, WA

June 2016pp 1-19 https://doi.org/10.1145/2755561

BLIS is a new software framework for instantiating high-performance BLAS-like dense linear algebra libraries. We demonstrate how BLIS acts as a productivity multiplier by using it to implement the level-3 BLAS on a variety of current architectures. The ...

research-article

BLIS: A Framework for Rapidly Instantiating BLAS Functionality

Field G. Van Zee
The University of Texas at Austin, Austin, TX
,
Robert A. van de Geijn
The University of Texas at Austin, Austin, TX

June 2015pp 1-33 https://doi.org/10.1145/2764454

The BLAS-like Library Instantiation Software (BLIS) framework is a new infrastructure for rapidly instantiating Basic Linear Algebra Subprograms (BLAS) functionality. Its fundamental innovation is that virtually all computation within level-2 (matrix-...

research-article

Understanding performance stairs: elucidating heuristics

Bryan Marker
The University of Texas at Austin, Austin, TX, USA
,
Don Batory
The University of Texas at Austin, Austin, TX, USA
,
Robert van de Geijn
The University of Texas at Austin, Austin, TX, USA

September 2014pp 301-312 https://doi.org/10.1145/2642937.2642975

How do experts navigate the huge space of implementations for a given specification to find an efficient choice with minimal searching? Answer: They use "heuristics" -- rules of thumb that are more street wisdom than scientific fact. We provide a ...

research-article

Restructuring the Tridiagonal and Bidiagonal QR Algorithms for Performance

Field G. Van Zee
The University of Texas at Austin
,
Robert A. van de Geijn
The University of Texas at Austin
,
Gregorio Quintana-Ortí
Universitat Jaume I

April 2014pp 1-34 https://doi.org/10.1145/2535371

We show how both the tridiagonal and bidiagonal QR algorithms can be restructured so that they become rich in operations that can achieve near-peak performance on a modern processor. The key is a novel, cache-friendly algorithm for applying multiple ...

research-article

Interfaces are key

Bryan Marker
The University of Texas, Austin, Texas
,
Robert van de Geijn
The University of Texas, Austin, Texas
,
Don Batory
The University of Texas, Austin, Texas

November 2013pp 21-24 https://doi.org/10.1145/2532352.2532359

Many dense linear algebra (DLA) operations are easy to understand at a high level and users get functional DLA code on new hardware relatively quickly. As a result, many people consider DLA to be a "solved domain." The truth is that DLA is not solved. ...

research-article

Elemental: A New Framework for Distributed Memory Dense Matrix Computations

Jack Poulson
University of Texas at Austin
,
Bryan Marker
University of Texas at Austin
,
Robert A. van de Geijn
University of Texas at Austin
,
Jeff R. Hammond
Argonne Leadership Computing Facility
,
Nichols A. Romero
Argonne Leadership Computing Facility

February 2013pp 1-24 https://doi.org/10.1145/2427023.2427030

Parallelizing dense matrix computations to distributed memory architectures is a well-studied subject and generally considered to be among the best understood domains of parallel computing. Two packages, developed in the mid 1990s, still enjoy regular ...

research-article

Families of Algorithms for Reducing a Matrix to Condensed Form

Field G. Van Zee
The University of Texas at Austin
,
Robert A. van de Geijn
The University of Texas at Austin
,
Gregorio Quintana-Ortí
Universidad Jaume I
,
G. Joseph Elizondo
The University of Texas at Austin

November 2012pp 1-32 https://doi.org/10.1145/2382585.2382587

In a recent paper it was shown how memory traffic can be diminished by reformulating the classic algorithm for reducing a matrix to bidiagonal form, a preprocess when computing the singular values of a dense matrix. The key is a reordering of the ...

research-article

A Runtime System for Programming Out-of-Core Matrix Algorithms-by-Tiles on Multithreaded Architectures

Gregorio Quintana-Ortí
Universidad Jaume I
,
Francisco D. Igual
Universidad Jaume I
,
Mercedes Marqués
Universidad Jaume I
,
Enrique S. Quintana-Ortí
Universidad Jaume I
,
Robert A. van de Geijn
The University of Texas at Austin

August 2012pp 1-25 https://doi.org/10.1145/2331130.2331133

Out-of-core implementations of algorithms for dense matrix computations have traditionally focused on optimal use of memory so as to minimize I/O, often trading programmability for performance. In this article we show how the current state of hardware ...

poster

Mechanizing the expert dense linear algebra developer

Bryan Marker
The University of Texas at Austin, Austin, TX, USA
,
Andy Terrel
Texas Advanced Computing Center, Austin, TX, USA
,
Jack Poulson
The University of Texas at Austin, Austin, TX, USA
,
Don Batory
The University of Texas at Austin, Austin, TX, USA
,
Robert van de Geijn
The University of Texas at Austin, Austin, TX, USA

February 2012pp 289-290 https://doi.org/10.1145/2145816.2145858

The efforts of an expert to parallelize and optimize a dense linear algebra algorithm for distributed-memory targets are largely mechanical and repetitive. We demonstrate that these efforts can be encoded and automatically applied to obviate the manual ...

research-article

High-performance up-and-downdating via householder-like transformations

Robert A. Van De Geijn
The University of Texas at Austin, Austin, TX
,
Field G. Van Zee
The University of Texas at Austin, Austin, TX

December 2011pp 1-17 https://doi.org/10.1145/2049662.2049666

We present high-performance algorithms for up-and-downdating a Cholesky factor or QR factorization. The method uses Householder-like transformations, sometimes called hyperbolic Householder transformations, that are accumulated so that most computation ...

research-article

Managing the complexity of lookahead for LU factorization with pivoting

Ernie Chan
The University of Texas at Austin, Austin, TX, USA
,
Robert van de Geijn
The University of Texas at Austin, Austin, TX, USA
,
Andrew Chapman
Microsoft Corporation, Redmond, WA, USA

June 2010pp 200-208 https://doi.org/10.1145/1810479.1810520

We describe parallel implementations of LU factorization with pivoting for multicore architectures. Implementations that differ in two different dimensions are discussed: (1) using classical partial pivoting versus recently proposed incremental pivoting ...

research-article

Programming matrix algorithms-by-blocks for thread-level parallelism

Gregorio Quintana-Ortí
Universidad Jaume I, Castellón, Spain
,
Enrique S. Quintana-Ortí
Universidad Jaume I, Castellón, Spain
,
Robert A. Van De Geijn
The University of Texas at Austin, Austin, TX
,
Field G. Van Zee
The University of Texas at Austin, Austin, TX
,
Ernie Chan
The University of Texas at Austin, Austin, TX

July 2009pp 1-26 https://doi.org/10.1145/1527286.1527288

With the emergence of thread-level parallelism as the primary means for continued performance improvement, the programmability issue has reemerged as an obstacle to the use of architectural advances. We argue that evolving legacy libraries for dense and ...

research-article

Solving dense linear systems on platforms with multiple hardware accelerators

Gregorio Quintana-Ortí
Universidad Jaime I, Castellon, Spain
,
Francisco D. Igual
Universidad Jaime I, Castellon, Spain
,
Enrique S. Quintana-Ortí
Universidad Jaime I, Castellon, Spain
,
Robert A. van de Geijn
The University of Texas at Austin, Austin, TX, USA

February 2009pp 121-130 https://doi.org/10.1145/1504176.1504196

In a previous PPoPP paper we showed how the FLAME methodology, combined with the SuperMatrix runtime system, yields a simple yet powerful solution for programming dense linear algebra operations on multicore platforms. In this paper we provide further ...

research-article

High-performance implementation of the level-3 BLAS

Kazushige Goto
The University of Texas at Austin, Austin, TX
,
Robert Van De Geijn
The University of Texas at Austin, Austin, TX

July 2008pp 1-14 https://doi.org/10.1145/1377603.1377607

A simple but highly effective approach for transforming high-performance implementations on cache-based architectures of matrix-matrix multiplication into implementations of other commonly used matrix-matrix computations (the level-3 BLAS) is presented. ...

research-article

Families of algorithms related to the inversion of a Symmetric Positive Definite matrix

Paolo Bientinesi
RWTH Aachen University, Aachen, Germany
,
Brian Gunter
Delft University of Technology, Delft, The Netherlands
,
Robert A. van de Geijn
The University of Texas at Austin, Austin, TX

July 2008pp 1-22 https://doi.org/10.1145/1377603.1377606

We study the high-performance implementation of the inversion of a Symmetric Positive Definite (SPD) matrix on architectures ranging from sequential processors to Symmetric MultiProcessors to distributed memory parallel computers. This inversion is ...

research-article

Updating an LU Factorization with Pivoting

Enrique S. Quintana-Ortí
Universidad Jaime I
,
Robert A. Van De Geijn
The University of Texas at Austin

July 2008pp 1-16 https://doi.org/10.1145/1377612.1377615

We show how to compute an LU factorization of a matrix when the factors of a leading principle submatrix are already known. The approach incorporates pivoting akin to partial pivoting, a strategy we call incremental pivoting. An implementation using the ...

research-article

Anatomy of high-performance matrix multiplication

Kazushige Goto
The University of Texas at Austin, Austin, TX
,
Robert A. van de Geijn
The University of Texas at Austin, Austin, TX

May 2008pp 1-25 https://doi.org/10.1145/1356052.1356053

We present the basic principles that underlie the high-performance implementation of the matrix-matrix multiplication that is part of the widely used GotoBLAS library. Design decisions are justified by successively refining a model of architectures with ...

research-article

Scalable parallelization of FLAME code via the workqueuing model

Field G. Van Zee
The University of Texas at Austin, Austin, TX
,
Paolo Bientinesi
Duke University, Durham, NC
,
Tze Meng Low
The University of Texas at Austin, Austin, TX
,
Robert A. van de Geijn
The University of Texas at Austin, Austin, TX

March 2008pp 1-29 https://doi.org/10.1145/1326548.1326552

We discuss the OpenMP parallelization of linear algebra algorithms that are coded using the Formal Linear Algebra Methods Environment (FLAME) API. This API expresses algorithms at a higher level of abstraction, avoids the use loop and array indices, and ...

research-article

SuperMatrix: a multithreaded runtime scheduling system for algorithms-by-blocks

Ernie Chan
The University of Texas at Austin, Austin, TX, USA
,
Field G. Van Zee
The University of Texas at Austin, Austin, TX, USA
,
Paolo Bientinesi
Duke University, Durham, NC, USA
,
Enrique S. Quintana-Ortí
Universidad Jaume I, Castellon, Spain
,
Gregorio Quintana-Ortí
Universidad Jaume I, Castellon, Spain
,
Robert van de Geijn
The University of Texas at Austin, Austin, TX, USA

February 2008pp 123-132 https://doi.org/10.1145/1345206.1345227

This paper describes SuperMatrix, a runtime system that parallelizes matrix operations for SMP and/or multi-core architectures. We use this system to demonstrate how code described at a high level of abstraction can achieve high performance on such ...

research-article

High performance dense linear algebra on a spatially distributed processor

Jeffrey R. Diamond
The University of Texas at Austin, Austin, TX, USA
,
Behnam Robatmili
The University of Texas at Austin, Austin, TX, USA
,
Stephen W. Keckler
The University of Texas at Austin, Austin, TX, USA
,
Robert van de Geijn
The University of Texas at Austin, Austin, TX, USA
,
Kazushige Goto
The University of Texas at Austin, Austin, TX, USA
,
Doug Burger
The University of Texas at Austin, Austin, TX, USA

February 2008pp 63-72 https://doi.org/10.1145/1345206.1345218

As technology trends have limited the performance scaling of conventional processors, industry and academic research has turned to parallel architectures on a single chip, including distributed uniprocessors and multicore chips. This paper examines how ...

chapter

Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures

Ernie Chan
The University of Texas at Austin, Austin, TX
,
Enrique S. Quintana-Ortí
Universidad Jaume I, Castellon, Spain
,
Gregorio Quintana-Ortí
Universidad Jaume I, Castellon, Spain
,
Robert van de Geijn
The University of Texas at Austin, Austin, TX

June 2007pp 116-125 https://doi.org/10.1145/1248377.1248397

We discuss the high-performance parallel implementation and execution of dense linear algebra matrix operations on SMP architectures, with an eye towards multi-core processors with many cores. We argue that traditional implementations, as those ...

article

Improving the performance of reduction to Hessenberg form

Gregorio Quintana-Ortí
Universidad Jaume I, Castellón, Spain
,
Robert van de Geijn
The University of Texas at Austin, Austin, TX

June 2006pp 180-194 https://doi.org/10.1145/1141885.1141887

In this article, a modification of the blocked algorithm for reduction to Hessenberg form is presented that improves performance by shifting more computation from less efficient matrix-vector operations to highly efficient matrix-matrix operations. ...

article

Accumulating Householder transformations, revisited

Thierry Joffrain
The University of Texas at Austin, Austin, TX
,
Tze Meng Low
The University of Texas at Austin, Austin, TX
,
Enrique S. Quintana-Ortí
Universidad Jaume I, Castellón, Spain
,
Robert van de Geijn
The University of Texas at Austin, Austin, TX
,
Field G. Van Zee
The University of Texas at Austin, Austin, TX

June 2006pp 169-179 https://doi.org/10.1145/1141885.1141886

A theorem related to the accumulation of Householder transformations into a single orthogonal transformation known as the compact WY transform is presented. It provides a simple characterization of the computation of this transformation and suggests an ...

chapter

Collective communication on architectures that support simultaneous communication over multiple links

Ernie Chan
The University of Texas at Austin
,
Robert van de Geijn
The University of Texas at Austin
,
William Gropp
Argonne National Laboratory
,
Rajeev Thakur
Argonne National Laboratory

March 2006pp 2-11 https://doi.org/10.1145/1122971.1122975

Traditional collective communication algorithms are designed with the assumption that a node can communicate with only one other node at a time. On new parallel architectures such as the IBM Blue Gene/L, a node can communicate with multiple nodes ...

chapter

Extracting SMP parallelism for dense linear algebra algorithms from high-level specifications

Tze Meng Low
University of Texas at Austin, Austin, TX
,
Robert A. van de Geijn
University of Texas at Austin, Austin, TX
,
Field G. Van Zee
University of Texas at Austin, Austin, TX

June 2005pp 153-163 https://doi.org/10.1145/1065944.1065965

We show how to exploit high-level information, available as part of the derivation of provably correct algorithms, so that SMP parallelism can be systematically identified. Recent research has shown that loop-based dense linear algebra algorithms can be ...

article

Parallel out-of-core computation and updating of the QR factorization

Brian C. Gunter
The University of Texas at Austin, Austin, TX
,
Robert A. Van De Geijn
The University of Texas at Austin, Austin, TX

March 2005pp 60-78 https://doi.org/10.1145/1055531.1055534

This article discusses the high-performance parallel implementation of the computation and updating of QR factorizations of dense matrices, including problems large enough to require out-of-core computation, where the matrix is stored on disk. The ...

article

Representing linear algebra algorithms in code: the FLAME application program interfaces

Paolo Bientinesi
The University of Texas at Austin, Austin, TX
,
Enrique S. Quintana-Ortí
Universidad Jaume I, Spain
,
Robert A. van de Geijn
The University of Texas at Austin, Austin, TX

March 2005pp 27-59 https://doi.org/10.1145/1055531.1055533

In this article, we present a number of Application Program Interfaces (APIs) for coding linear algebra algorithms. On the surface, these APIs for the MATLAB M-script and C programming languages appear to be simple, almost trivial, extensions of those ...

article

The science of deriving dense linear algebra algorithms

Paolo Bientinesi
The University of Texas at Austin, Austin, TX
,
John A. Gunnels
IBM T.J. Watson Research Center, Yorktown Heights, NY
,
Margaret E. Myers
The University of Texas at Austin, Austin, TX
,
Enrique S. Quintana-Ortí
Universidad Jaume I, Spain
,
Robert A. van de Geijn
The University of Texas at Austin, Austin, TX

March 2005pp 1-26 https://doi.org/10.1145/1055531.1055532

In this article we present a systematic approach to the derivation of families of high-performance algorithms for a large set of frequently encountered dense linear algebra operations. As part of the derivation a constructive proof of the correctness of ...

article

Formal derivation of algorithms: The triangular Sylvester equation

Enrique S. Quintana-Ortí
Universidad Jaume I, Castellón, Spain
,
Robert A. van de Geijn
The University of Texas at Austin, Austin, TX

June 2003pp 218-243 https://doi.org/10.1145/779359.779365

In this paper we apply a formal approach for the derivation of dense linear algebra algorithms to the triangular Sylvester equation. The result is a large family of provably correct algorithms. By using a coding style that reflects the algorithms as ...

article

FLAME: Formal Linear Algebra Methods Environment

John A. Gunnels
The University of Texas at Austin, Yorktown Heights, NY
,
Fred G. Gustavson
IBM T.J. Watson Research Center, Yorktown Heights, NY
,
Greg M. Henry
Intel Corporation, Hillsboro, OR
,
Robert A. van de Geijn
The University of Texas at Austin, Austin, TX

December 2001pp 422-455 https://doi.org/10.1145/504210.504213

Since the advent of high-performance distributed-memory parallel computing, the need for intelligible code has become ever greater. The development and maintenance of libraries for these architectures is simply too complex to be amenable to conventional ...

chapter

PLAPACK: parallel linear algebra package design overview

Philip Alpatov
The University of Texas, Austin, Texas
,
Greg Baker
The University of Texas, Austin, Texas
,
Carter Edwards
The University of Texas, Austin, Texas
,
John Gunnels
The University of Texas, Austin, Texas
,
Greg Morrow
The University of Texas, Austin, Texas
,
James Overfelt
The University of Texas, Austin, Texas
,
Robert van de Geijn
The University of Texas, Austin, Texas
,
Yuan-Jye J. Wu
Argonne National Laboratory, Argonne, IL

November 1997pp 1-16 https://doi.org/10.1145/509593.509622

Over the past twenty years, dense linear algebra libraries have gone through three generations of public domain general purpose packages. In the seventies, the first generation of packages were EISPACK and LINPACK, which implemented a broad spectrum of ...

chapter

Distributed memory matrix-vector multiplication and conjugate gradient algorithms

J. G. Lewis
Mathematics and Engineering Analysis, Boeing Computer Services, M/S 7L-22, P.O. Box 24346, Seattle, WA
,
R. A. van de Geijn
Department of Computer Sciences, University of Texas, Austin TX

December 1993pp 484-492 https://doi.org/10.1145/169627.169788

ACM Publications Related to the FLAME Project