Skip to main content
Logo image

Section 5.2 Talks

Subsection 5.2.1 Nikoli Dryden, "A Distributed Multilinear Algebra Library for Deep Learning"

Lawrence Livermore National Laboratory

Access to recording available upon request. Send email to rvdg@cs.utexas.edu

Subsection 5.2.2 Carl Kwan, "The Cholesky Factorization Theorem in ACL2"

UT Austin

Subsection 5.2.3 Joe Dobson, "Strategy Selection in the Arm Performance Libraries"

Arm

Subsection 5.2.4 Elliott Binder, "FAST Attention for Small Tensors"

Carnegie Mellon University
Abstract:
Matrix multiplication (MM) is a primary building block of attention layers found in transformer language models. As the MMs are typically small in one or two dimensions, these operations are considered memory bound on many of today’s architectures. Yet, MM libraries rarely achieve this bound, particularly when leveraging low precision data types. By identifying inefficiencies in state-of-the-art approaches to matrix multiplication, we redesign our approach to these MMs in order to reduce unnecessary data movement, improve bandwidth efficiency, and provide greater memory level parallelism. We show that this approach can achieve [some improvement] on memory performance compared to vendor libraries, translating to [some improvement] in end-to-end inference time.

Subsection 5.2.5 Upasana Sridhar, "Layer fusion with composable abstraction"

Carnegie Mellon University
Abstract:
Layer fusion, i.e. fusing multiple layers in a deep neural net (DNN) into a single layer, is often performed to reduce memory usage, and improve the performance. However, the implementation of fused layers can be limited, particularly in frameworks that rely on libraries built on expert-written kernels. This work seeks to ease the burden of fusion by presenting a general template for fusing operations, without requiring fused expert-kernels. Using this template, many pre-existing types of fusion between two operations can be systematically implemented. Furthermore, we show using the SMaLL framework, that fusion can yield performance and memory benefits on operations previously considered difficult to fuse.

Subsection 5.2.6 Cem Bassoy, "Fast and layout-oblivious tensor-matrix multiplication with BLAS"

Technical University of Hamburg

Subsection 5.2.7 Jim Demmel, "How to grade the accuracy of an implementation of the BLAS; Short update on Exception Handling"

University of California, Berkeley

Subsection 5.2.8 Grace Dinh, "Cost Estimation and Bounds for Sparse Kernels "

Cornell University

Subsection 5.2.9 Evarist Fomenko, "NVPL BLAS Architecture and Implementation Overview"

Nvidia

Subsection 5.2.10 Thijs Steel, "Communication efficient application of sequences of rotations to a matrix"

KU Leuven
Abstract:
Applying a sequence of rotations to a matrix is an important component in several linear algebra algorithm for eigenvalues. I will present a new algorithm that focuses on minimizing the cost of the memory operations involved and show that it achieves a flop rate close to the theoretical peak on modern hardware.

Subsection 5.2.11 Bhaskar, Nallani, "LPGEMM Enhancements in AOCL BLAS"

AMD

Subsection 5.2.12 Arnav Sharma, "BLAS Extension APIs"

AMD

Subsection 5.2.13 Eleni Vlachopoulou, "CMake Build System in AOCL BLAS"

AMD

Subsection 5.2.14 Sridhar Govindaswamy, "Close coupling of AOCL BLAS in AOCL LAPACK"

AMD

Subsection 5.2.15 Stepan Nassyr, "Simulating Parameterized Kernels on Parameterized Architectures"

Juelich Supercomputing Center

Subsection 5.2.16 Devin Matthews, "The state of BLIS 1.0 and 2.0"

Southern Methodist University
Related SIAM Article (Sept. 2024):

Subsection 5.2.17 Devin Matthews and Robert van de Geijn, "Vertical integration of the linear and multilinear software stack"

Southern Methodist University and UT Austin

Subsection 5.2.18 Devangi Parikh and Greg Henry, "Accuracy study of Cascading GEMM"

UT Austin and Intel

Subsection 5.2.19 Chao Yin, "LTLT Decomposition of a Skew-Symmetric Matrix - High Performance Implementation"

Southern Methodist University

Subsection 5.2.20 Ishna Satyarth, "LTLT Decomposition of a Skew-Symmetric Matrix - Derivation"

Southern Methodist University