BLIS Retreat 2017

Contributed talks

Andrew Chapman, AMD
Title: BLAS on AMD GPUs
Marat Dukhan, Facebook
Title: Integer GEMM (under) performance
Abstract:
Integer matrix-matrix multiplication, and in particular its 8-bit version, have potential to accelerate dense linear algebra computations by 4x relative to SGEMM-based versions. In this talk we present benchmarks and modelling results from low-end ARM microarchitectures, and analyse why 8-bit integer GEMM does not yet reveal its full potential on mobile phones.
Gianluca Frison, University of Freiburg
Title: BLASFEO: basic linear algebra subroutines for embedded optimization

Abstract:
BLASFEO is a dense linear algebra library providing high-performance implementations of BLAS- and LAPACK-like routines for use in embedded optimization. A key difference with respect to existing high-performance implementations of BLAS is that the computational performance is optimized for small to medium scale matrices, i.e., for sizes up to a few hundred. BLASFEO comes with three different implementations: a high-performance implementation aiming at providing the highest performance for matrices fitting in cache, a reference implementation providing portability and embeddability and optimized for very small matrices, and a wrapper to standard BLAS and LAPACK providing high-performance on large matrices. The three implementations of BLASFEO together provide high-performance dense linear algebra routines for matrices ranging from very small to large. Compared to both open-source and proprietary highly-tuned BLAS libraries, for matrices of size up to about one hundred the high-performance implementation of BLASFEO is about 20-30\% faster than the corresponding level 3 BLAS routines and 2-3 times faster than the corresponding LAPACK routines.

Related paper: https://arxiv.org/abs/1704.02457
Koby Hayashi, Wake Forest University
Title: Shared-Memory Parallelization of MTTKRP for Dense Tensors
Abstract:
The matricized-tensor times Khatri-Rao product (MTTKRP) is the computational bottleneck for algorithms computing CP decompositions of tensors. In this paper, we develop shared-memory parallel algorithms for MTTKRP involving dense tensors. The algorithms cast nearly all of the computation as matrix operations in order to use optimized BLAS subroutines, and they avoid reordering tensor entries in memory. We benchmark sequential and parallel performance of our implementations, demonstrating high sequential performance and efficient parallel scaling. We use our parallel implementation to compute a CP decomposition of a neuroimaging data set and achieve a speedup of up to 7.4× over existing parallel software.
Related paper: https://arxiv.org/abs/1708.08976
Greg Henry, Intel
Title: Open BLAS G2 Standardization Issues
Related document: https://docs.google.com/document/d/1DY4ImZT1coqri2382GusXgBTTTVdBDvtD5I14QHp9OE/edit?usp=sharing
Jianyu Huang, UT-Austin
Title: Strassen's Algorithm for Tensor Contraction
Abstract:
Tensor contraction (TC) is an important computational kernel widely used in numerous applications. It is a multi-dimensional generalization of matrix multiplication (GEMM). While Strassen's algorithm for GEMM is well studied in theory and practice, extending it to accelerate TC has not been previously pursued. Thus, we believe this to be the first paper to demonstrate how one can in practice speed up tensor contraction with Strassen's algorithm. By adopting a Block-Scatter-Matrix format, a novel matrix-centric tensor layout, we can conceptually view TC as GEMM for a general stride storage, with an implicit tensor-to-matrix transformation. This insight enables us to tailor a recent state-of-the-art implementation of Strassen's algorithm to TC, avoiding explicit transpositions (permutations) and extra workspace, and reducing the overhead of memory movement that is incurred. Performance benefits are demonstrated with a performance model as well as in practice on modern single core, multicore, and distributed memory parallel architectures, achieving up to 1.3x speedup. The resulting implementations can serve as a drop-in replacement for various applications with significant speedup.
Related paper: https://arxiv.org/abs/1704.03092
Edward Hutter, UIUC
Title: Parallel 3D Cholesky-QR2 for square and rectangular matrices
Ilya Kaliman, USC
Title: Leveraging modern supercomputing infrastructure for tensor contractions in large electronic structure calculations
Kyungjoo Kim, Sandia
Title: Performance portable line smoother for multiphysics problems using compact batched BLAS
Abstract:
We present performance portable implementation of line smoother that requires to solve many block tridiagonal systems. A challenge for performance comes from its small blocksize e.g., 3,5,10 and 15. These blocksizes are too small to be performant even by optimized vendor dense linear algebra libraries. We propose the compact data layout (also called SIMD type) that interleaves elements of a small batch of matrices. Then, the code is vectorized across the small batch. For the blocksizes of our interest, substantial performance improvements are obtained by vectorizing the code with the compact data layout. Numerical experiments are presented for Intel Knight Landing and NVIDIA Pascal architectures.
Tze Meng Low, CMU
Title: Packing - the next BLIS frontier?
Devin Matthews, UT-Austin
Title: I Don't Care About BLAS
Abstract:
For more than two decades, matrix multiplication--primarily through the BLAS interface--has been the workhorse computational kernel in quantum chemical calculations of electronic structure. However, the overheads of using BLAS for this purpose, not only in terms of the increasing cost of data movement, but also in terms of added complexity, lost optimization opportunities, and reduced algorithmic flexibility, have become onerous. Instead, I show how understanding optimal approaches to matrix multiplication a la the BLIS framework can be used to break through the BLAS barrier and implement efficient new algorithms tailored to quantum chemistry.
Maggie Myers, UT-Austin
Title: Overreaching with Outreach: Is HPC in Reach?
Abstract:
This year, we have developed and launched a new Massive Open Online Course (MOOC), LAFF-On Programming for Correctness as well as continue to offer LAFF. Here we share information about these, in addition to our future plans. We will also disclose results from our BLIS Survey 2017.
Minh Quan HO, Kalray and University of Grenoble Alps
Title: A portable DMA support for Level-3 BLIS on DMA-enabled embedded hardware

Abstract:
BLIS produces a nice way to instantiate BLAS for any platform. As numerical computing goes, we observe more and more novel and exotic hardware, usually DMA-based and with a lot of cores. In this talk, we present a generic and portable DMA extension for the level-3 of BLIS. This DMA implementation exposes a backend API of six functions, inspired from the well-known asynchronous put/get one-sided communication model. This backend is aimed to be implemented by any vendor to exploit their DMA hardware. Once plugged into BLIS, this backend provides automatically a highly-optimized level-3 BLIS with computation-communication overlapping by asynchronous DMA transfers.
Devangi Parikh, UT-Austin
Title: Mixing Flavors using BLIS
Abstract:
The BLIS framework was designed to enable mixed-domain and mixed-precision BLAS operations. In this talk, we describe the details of implementing mixed-domain and mixed-precision BLAS operations. In other words, we will show how to mix the various flavors of BLAS operators using the BLIS framework.
Tyler Smith, UT-Austin
Title: The MOMMS Family of Algorithms for Classical Matrix-Matrix Multiplication for Hierarchical Memory Architectures
Santanu Thangaraj, AMD India
Title: BLIS optimizations for AMD EPYC(TM) Processors
Field Van Zee, UT-Austin
Title: Induced methods for complex matrix multiplication
Abstract:
In this talk, we review work spanning two papers pertaining to the implementation of complex domain matrix multiplication. Specifically, this effort sought to uncover matrix multiplication algorithms that do not rely on matrix kernels that encode complex arithmetic, but rather recycle existing real domain kernels (or, more generally, "primitives"). The work has yielded a family of methods, each with its own variants, where each method exhibits different costs, benefits, and performance properties. We focus on the recently-developed 1m method, which circumvents virtually all of the shortcomings observed in a naive implementation of an earlier method (4m) based on the classic definition of complex multiplication and addition.
Richard Veras, CMU
Title: Leveraging Dense Linear Algebra Performance for Graphs, Stencils and Sparse Linear Algebra
Abstract:
n this talk, we discuss the automatic generation of expert level high-performance scientific codes for Dense Linear Algebra (DLA), Structured Mesh (Stencil), Sparse Linear Algebra and Graph Analytic. In particular, this talk seeks to address the issue of obtaining performance on many complex platforms for a certain class of matrix-like operations that span across many scientific, engineering and social fields. We do this by automating a method used for obtaining high performance in DLA and extending it to structured, sparse and scale-free domains. We argue that it is through the use of the underlying structure found in the data from these domains that enables this process. Thus, obtaining performance for most operations does not occur in isolation of the data being operated on, but instead depends significantly on the structure of the data.
Chenhan Yu, UT-Austin
Title: High Performance Computing Kernels in N-body Problems

Last modified: Sat Aug 26 13:26:57 CDT 2017