BLIS Retreat 2014

Contributed talks

The MPI+MPI programming model and why we need shared-memory MPI libraries

Jeff Hammond

The MPI-3 standard provides a portable interface to interprocess shared-memory through the RMA functionality. This allow applications to leverage shared-memory programming within a strictly MPI paradigm, which mitigates some of the challenges of MPI+X programming using threads associated with shared-by-default behavior and race conditions, NUMA and Amdahl's law. I will describe the MPI shared-memory capability and how it might be targeted by existing multithreaded libraries.

A prospectus for Elemental

Jack Poulson

Elemental is a distributed-memory library for dense and sparse-direct linear algebra that is rapidly growing in functionality. The talk will begin with an overview of the library's current design, with a focus on the DistMatrix class, and a description of its current functionality (e.g., computing pseudospectra, dense and sparse-direct symmetric factorizations, low-rank updates to factorizations, dense ADMM routines, and a C interface) Afterwards, a brief overview of the library's experimental and planned features will be discussed (e.g., accurate symmetric sparse-direct factorizations, simplex algorithms and Interior Point Methods, support for computation over finite fields, etc.).

BLIS on the Web

Marat Dukhan

The recent trend to move applications into web browsers provoked interest in technologies that promise to improve performance of codes running in a Web browser. This talk overviews the restrictions imposed on codes by Web browsers, summarizes the technologies that aim to deliver native performance for Web applications and shares the author's experience in porting BLIS to two Web platforms: Portable Native Client (PNaCl) and Emscripten/Asm.js

A Framework for Practical Fast Matrix Multiplication

Austin Benson

In this work, we show that novel fast matrix multiplication algorithms can significantly outperform vendor implementations of the classical algorithm and Strassen's fast algorithm on modest problem sizes and shapes. Furthermore, we show that the best choice of fast algorithm depends not only on the size of the matrices but also the shape. We develop a code generation tool to automatically implement multiple sequential and shared-memory parallel variants of each fast algorithm, and this allows us to rapidly benchmark over 20 fast algorithms. We will discuss a number of practical implementation issues for these algorithms on shared-memory machines that can direct further research on making fast algorithms practical.

Related paper:
Austin R. Benson and Grey Ballard
A Framework for Practical Parallel Fast Matrix Multiplication.
Arxiv.org.

An Analytical Model for BLIS

Tze Meng Low

We show how the BLAS-like Library Instantiation Software (BLIS) framework, which provides a more de- tailed layering of the GotoBLAS (now maintained as OpenBLAS) implementation, allows one to analytically determine optimal tuning parameters for high-end instantiations of the matrix-matrix multiplication. This is of both practical and scientific importance, as it greatly reduces the development effort required for the implementation of the level-3 BLAS while also advancing our understanding of how hierarchically layered memories interact with high performance software. This allows the community to move on from valuable engineering solutions (empirically autotuning) to scientic understanding (analytical insight).

Related paper:
Tze Meng Low, Francisco D. Igual, Tyler Smith, Enrique Quintana-Orti.
Analytical Modeling is Enough for High Performance BLIS.
Submitted to ACM TOMS

BLIS matrix multiplication: from real to complex

Field Van Zee

Integrating DMA capabilities into BLIS for on-chip data movement

Devangi Parikh

BLIS (BLAS-like Library Instantiation Software) can be easily ported to the Texas Instruments C66x DSPs, which has been shown in previous works. However, the performance of this initial port can be improved by integrating the DMA capabilities of DSP architecture to move data efficiently. In this talk, I will discuss the goals and strategy of integrating DMA capabilities into BLIS, and show preliminary results of this integration.

Making LAPACK and libflame live in harmony

Kyungjoo Kim

Managing many threads for multicore and many-core architectures: Level-3 BLAS

Tyler Smith

Related paper:
Tyler M. Smith, Robert van de Geijn, Mikhail Smelyanskiy, Jeff R. Hammond, and Field G. Van Zee.
Anatomy of High-Performance Many-Threaded Matrix Multiplication
Presented at the International Parallel and Distributed Processing Symposium 2014.

Beyond GEMM: How Can We Make Quantum Chemistry Fast?

Devin Matthews

Code Generation with DxT: Improved Prototyping and Development

Bryan Marker

3D-Stacked Logic-in-Memory Hardware For Sparse Matrix Operations

Franz Franchetti

This talk introduces a 3D-stacked logic-in-memory (LiM) system to accelerate the processing of sparse matrix data that is held in a 3D DRAM system. We build a customized content addressable memory (CAM) hardware structure to exploit the inherent sparse data patterns and model the LiM based hardware accelerator layers that are stacked in between DRAM dies for the efficient sparse matrix operations. Through silicon vias (TSVs) are used to provide the required high inter-layer bandwidth. Furthermore, we adapt the algorithm and data structure to fully leverage the underlying hardware capabilities, and develop the necessary design framework to facilitate the design space evaluation and LiM hardware synthesis. Our simulation demonstrates more than two orders of magnitude of performance and energy efficiency improvements compared with the traditional multithreaded software implementation on modern processors.

Related paper:
Q. Zhu, H. E. Sumbul, F. Sadi, J. Hoe, L. Pileggi, F. Franchetti.
Accelerating Sparse Matrix-Matrix Multiplication with 3D-Stacked Logic-in-Memory Hardware
IEEE High Performance Extreme Computing Conference (HPEC), 2013, pages 1-6. Best Paper Award

Evgeny Epifanovsky

A practical view on linear algebra tools

This talk will provide a view on current linear algebra tools from a standpoint of maintaining and developing a large quantum chemistry software package. I will outline the capabilities demanded by modern computational methodologies and will attempt to construct a list of requirements for linear algebra tools in terms of both functionality and software design. Quantum chemistry models are naturally formulated in terms of multi-dimensional linear algebra. For many of them a simple conversion to one and two dimensions exists, and the problem can be written as manipulations on vector and matrices. For most accurate methods, however, such conversion is impractical, and tools that are capable of dealing with multi-dimensional objects are required. Using higher-dimensional objects increases the scaling of computational cost, which contradicts one's physical intuition: the number of interactions should grow linearly with the size of the system. Lowering the scaling of computational methods is an area of active research in quantum chemistry, and there are two common approaches: reduction of amounts of data based on a physical model, which introduces sparsity, or using decomposition of high-dimensional objects to low-dimensional ones. Finally, I will discuss some important software requirements, which, if not properly taken into account, can render very promising new tools useless in practice.

Martin Schatz

A Framework for Distributed Tensor Computations

Recently, data models have become more complex leading to the need for multi-dimensional representations to express the data in a more meaningful way. Commonly, tensors are used to represent such data and multi-linear algebra, the math associated with tensors, has become essential for tackling problems in big data and scientific computing. Up to now, the main approach for problems of multi-linear algebra has been based on mapping the multi-linear algebra to linear algebra and relying on highly efficient linear algebra libraries to perform the equivalent computation. Unfortunately, there are inherent inefficiencies associated with this approach. In this talk, we define a domain-specific language for tensor computations performed on distributed-memory architectures. Additionally, through a process akin to constraint propagation, we show how, using the language, algorithms can be systematically derived, required collective communications identified, and approximate costs analyzed for a given tensor contraction operation.

Last modified: Sun Sept. 21 09:42:56 CDT 2014