BLISRetreat Talks

Section 5.2 Talks

Subsection 5.2.1 Nikoli Dryden, "A Distributed Multilinear Algebra Library for Deep Learning"

Lawrence Livermore National Laboratory

Access to recording available upon request. Send email to rvdg@cs.utexas.edu

Slides ¹

Subsection 5.2.2 Carl Kwan, "The Cholesky Factorization Theorem in ACL2"

UT Austin

Subsection 5.2.3 Joe Dobson, "Strategy Selection in the Arm Performance Libraries"

Arm

Slides ²

Subsection 5.2.4 Elliott Binder, "FAST Attention for Small Tensors"

Carnegie Mellon University

Abstract:

Matrix multiplication (MM) is a primary building block of attention layers found in transformer language models. As the MMs are typically small in one or two dimensions, these operations are considered memory bound on many of today’s architectures. Yet, MM libraries rarely achieve this bound, particularly when leveraging low precision data types. By identifying inefficiencies in state-of-the-art approaches to matrix multiplication, we redesign our approach to these MMs in order to reduce unnecessary data movement, improve bandwidth efficiency, and provide greater memory level parallelism. We show that this approach can achieve [some improvement] on memory performance compared to vendor libraries, translating to [some improvement] in end-to-end inference time.

Subsection 5.2.5 Upasana Sridhar, "Layer fusion with composable abstraction"

Carnegie Mellon University

Abstract:

Layer fusion, i.e. fusing multiple layers in a deep neural net (DNN) into a single layer, is often performed to reduce memory usage, and improve the performance. However, the implementation of fused layers can be limited, particularly in frameworks that rely on libraries built on expert-written kernels. This work seeks to ease the burden of fusion by presenting a general template for fusing operations, without requiring fused expert-kernels. Using this template, many pre-existing types of fusion between two operations can be systematically implemented. Furthermore, we show using the SMaLL framework, that fusion can yield performance and memory benefits on operations previously considered difficult to fuse.

Subsection 5.2.6 Cem Bassoy, "Fast and layout-oblivious tensor-matrix multiplication with BLAS"

Technical University of Hamburg

Subsection 5.2.7 Jim Demmel, "How to grade the accuracy of an implementation of the BLAS; Short update on Exception Handling"

University of California, Berkeley

Slides ³

Subsection 5.2.8 Grace Dinh, "Cost Estimation and Bounds for Sparse Kernels "

Cornell University

Subsection 5.2.9 Evarist Fomenko, "NVPL BLAS Architecture and Implementation Overview"

Nvidia

Slides ⁴

Links:

Subsection 5.2.10 Thijs Steel, "Communication efficient application of sequences of rotations to a matrix"

KU Leuven

Abstract:

Applying a sequence of rotations to a matrix is an important component in several linear algebra algorithm for eigenvalues. I will present a new algorithm that focuses on minimizing the cost of the memory operations involved and show that it achieves a flop rate close to the theoretical peak on modern hardware.