Skip to main content Contents
Prev Up Next \(
\usepackage{array}
\setlength{\oddsidemargin}{-0.0in}
\setlength{\evensidemargin}{-0.0in}
\setlength{\textheight}{8.75in}
\setlength{\textwidth}{6.5in}
\setlength{\topmargin}{-0.25in}
\newcommand{\R}{\mathbb R}
\newcommand{\Rm}{\mathbb R^m}
\newcommand{\Rn}{\mathbb R^n}
\newcommand{\Rnxn}{\mathbb R^{n \times n}}
\newcommand{\Rmxn}{\mathbb R^{m \times n}}
\newcommand{\Rmxm}{\mathbb R^{m \times m}}
\newcommand{\Rmxk}{\mathbb R^{m \times k}}
\newcommand{\Rkxn}{\mathbb R^{k \times n}}
\newcommand{\C}{\mathbb C}
\newcommand{\Cm}{\mathbb C^m}
\newcommand{\Cmxm}{\mathbb C^{m \times m}}
\newcommand{\Cnxn}{\mathbb C^{n \times n}}
\newcommand{\Cmxn}{\mathbb C^{m \times n}}
\newcommand{\Ckxk}{\mathbb C^{k \times k}}
\newcommand{\Cn}{\mathbb C^n}
\newcommand{\Ck}{\mathbb C^k}
\newcommand{\Null}{{\cal N}}
\newcommand{\Col}{{\cal C}}
\newcommand{\Rowspace}{{\cal R}}
\newcommand{\Span}{{\rm {Span}}}
\newcommand{\rank}{{\rm rank}}
\newcommand{\triu}{{\rm triu}}
\newcommand{\tril}{{\rm tril}}
\newcommand{\sign}{{\rm sign}}
\newcommand{\FlaTwoByTwo}[4]{
\left( \begin{array}{c | c}
#1 \amp #2 \\ \hline
#3 \amp #4
\end{array}
\right)
}
\newcommand{\FlaTwoByTwoSingleLine}[4]{
\left( \begin{array}{c c}
#1 \amp #2 \\
#3 \amp #4
\end{array}
\right)
}
\newcommand{\FlaTwoByTwoSingleLineNoPar}[4]{
\begin{array}{c c}
#1 \amp #2 \\
#3 \amp #4
\end{array}
}
\newcommand{\FlaOneByTwo}[2]{
\left( \begin{array}{c | c}
#1 \amp #2
\end{array}
\right)
}
\newcommand{\FlaOneByTwoSingleLine}[2]{
\left( \begin{array}{c c}
#1 \amp #2
\end{array}
\right)
}
\newcommand{\FlaTwoByOne}[2]{
\left( \begin{array}{c}
#1 \\ \hline
#2
\end{array}
\right)
}
\newcommand{\FlaTwoByOneSingleLine}[2]{
\left( \begin{array}{c}
#1 \\
#2
\end{array}
\right)
}
\newcommand{\FlaThreeByOneB}[3]{
\left( \begin{array}{c}
#1 \\ \hline
#2 \\
#3
\end{array}
\right)
}
\newcommand{\FlaThreeByOneT}[3]{
\left( \begin{array}{c}
#1 \\
#2 \\ \hline
#3
\end{array}
\right)
}
\newcommand{\FlaOneByThreeR}[3]{
\left( \begin{array}{c | c c}
#1 \amp #2 \amp #3
\end{array}
\right)
}
\newcommand{\FlaOneByThreeL}[3]{
\left( \begin{array}{c c | c}
#1 \amp #2 \amp #3
\end{array}
\right)
}
\newcommand{\FlaThreeByThreeBR}[9]{
\left( \begin{array}{c | c c}
#1 \amp #2 \amp #3 \\ \hline
#4 \amp #5 \amp #6 \\
#7 \amp #8 \amp #9
\end{array}
\right)
}
\newcommand{\FlaThreeByThreeTL}[9]{
\left( \begin{array}{c c | c}
#1 \amp #2 \amp #3 \\
#4 \amp #5 \amp #6 \\ \hline
#7 \amp #8 \amp #9
\end{array}
\right)
}
\newcommand{\diag}[1]{{\rm diag}( #1 )}
\newcommand{\URt}{{\sc HQR}}
\newcommand{\FlaAlgorithm}{
\begin{array}{|l|} \hline
\routinename \\ \hline
\partitionings \\
~~~ \begin{array}{l}
\partitionsizes
\end{array} \\
{\bf \color{blue} {while}~} \guard \\
~~~ \begin{array}{l}
\repartitionings
\end{array} \\
~~~ \color{red} { \begin{array}{l} \hline
\color{black} {\update} \\ \hline
\end{array}} \\
~~~ \begin{array}{l}
\moveboundaries
\end{array} \\
{\bf \color{blue} {endwhile}}
\\ \hline
\end{array}
}
\newcommand{\FlaAlgorithmWithInit}{
\begin{array}{|l|} \hline
\routinename \\ \hline
\initialize \\
\partitionings \\
~~~ \begin{array}{l}
\partitionsizes
\end{array} \\
{\bf \color{blue} {while}~} \guard \\
~~~ \begin{array}{l}
\repartitionings
\end{array} \\
~~~ \color{red} { \begin{array}{l} \hline
\color{black} {\update} \\ \hline
\end{array}} \\
~~~ \begin{array}{l}
\moveboundaries
\end{array} \\
{\bf \color{blue} {endwhile}}
\\ \hline
\end{array}
}
\newcommand{\FlaBlkAlgorithm}{
\begin{array}{|l|} \hline
\routinename \\ \hline
\partitionings \\
~~~ \begin{array}{l}
\partitionsizes
\end{array} \\
{\bf \color{blue} {while}~} \guard \\
~~~ {\bf choose~block~size~} \blocksize \\
~~~ \begin{array}{l}
\repartitionings
\end{array} \\
~~~ ~~~ \repartitionsizes \\
~~~ \color{red} { \begin{array}{l} \hline
\color{black} {\update} \\ \hline
\end{array}} \\
~~~ \begin{array}{l}
\moveboundaries
\end{array} \\
{\bf \color{blue} {endwhile}}
\\ \hline
\end{array}
}
\newcommand{\complexone}{
\begin{array}{|c|}\hline
\!\pm\!
\\ \hline
\end{array}~
}
\newcommand{\HQR}{{\rm HQR}}
\newcommand{\QR}{{\rm QR}}
\newcommand{\st}{{\rm \ s.t. }}
\newcommand{\QRQ}{{\rm {\normalsize \bf Q}{\rm \tiny R}}}
\newcommand{\QRR}{{\rm {\rm \tiny Q}{\bf \normalsize R}}}
\newcommand{\deltaalpha}{\delta\!\alpha}
\newcommand{\deltax}{\delta\!x}
\newcommand{\deltay}{\delta\!y}
\newcommand{\deltaz}{\delta\!z}
\newcommand{\deltaw}{\delta\!w}
\newcommand{\DeltaA}{\delta\!\!A}
\newcommand{\meps}{\epsilon_{\rm mach}}
\newcommand{\fl}[1]{{\rm fl( #1 )}}
\newcommand{\becomes}{:=}
\newcommand{\defrowvector}[2]{
\left(#1_0, #1_1, \ldots, #1_{#2-1}\right)
}
\newcommand{\tr}[1]{{#1}^T}
\newcommand{\LUpiv}[1]{{\rm LU}(#1)}
\newcommand{\maxi}{{\rm maxi}}
\newcommand{\Chol}[1]{{\rm Chol}( #1 )}
\newcommand{\lt}{<}
\newcommand{\gt}{>}
\newcommand{\amp}{&}
\definecolor{fillinmathshade}{gray}{0.9}
\newcommand{\fillinmath}[1]{\mathchoice{\colorbox{fillinmathshade}{$\displaystyle \phantom{\,#1\,}$}}{\colorbox{fillinmathshade}{$\textstyle \phantom{\,#1\,}$}}{\colorbox{fillinmathshade}{$\scriptstyle \phantom{\,#1\,}$}}{\colorbox{fillinmathshade}{$\scriptscriptstyle\phantom{\,#1\,}$}}}
\)
Section 5.2 Talks
Subsection 5.2.1 Nikoli Dryden, "A Distributed Multilinear Algebra Library for Deep Learning"
Lawrence Livermore National Laboratory Access to recording available upon request. Send email to rvdg@cs.utexas.edu
Subsection 5.2.2 Carl Kwan, "The Cholesky Factorization Theorem in ACL2"
UT Austin
Subsection 5.2.3 Joe Dobson, "Strategy Selection in the Arm Performance Libraries"
Arm
Subsection 5.2.4 Elliott Binder, "FAST Attention for Small Tensors"
Carnegie Mellon UniversityAbstract:
Matrix multiplication (MM) is a primary building block of attention layers found in transformer language models. As the MMs are typically small in one or two dimensions, these operations are considered memory bound on many of today’s architectures. Yet, MM libraries rarely achieve this bound, particularly when leveraging low precision data types. By identifying inefficiencies in state-of-the-art approaches to matrix multiplication, we redesign our approach to these MMs in order to reduce unnecessary data movement, improve bandwidth efficiency, and provide greater memory level parallelism. We show that this approach can achieve [some improvement] on memory performance compared to vendor libraries, translating to [some improvement] in end-to-end inference time.
Subsection 5.2.5 Upasana Sridhar, "Layer fusion with composable abstraction"
Carnegie Mellon UniversityAbstract:
Layer fusion, i.e. fusing multiple layers in a deep neural net (DNN) into a single layer, is often performed to reduce memory usage, and improve the performance. However, the implementation of fused layers can be limited, particularly in frameworks that rely on libraries built on expert-written kernels. This work seeks to ease the burden of fusion by presenting a general template for fusing operations, without requiring fused expert-kernels. Using this template, many pre-existing types of fusion between two operations can be systematically implemented. Furthermore, we show using the SMaLL framework, that fusion can yield performance and memory benefits on operations previously considered difficult to fuse.
Subsection 5.2.6 Cem Bassoy, "Fast and layout-oblivious tensor-matrix multiplication with BLAS"
Technical University of Hamburg
Subsection 5.2.7 Jim Demmel, "How to grade the accuracy of an implementation of the BLAS; Short update on Exception Handling"
University of California, Berkeley
Subsection 5.2.8 Grace Dinh, "Cost Estimation and Bounds for Sparse Kernels "
Cornell University
Subsection 5.2.9 Evarist Fomenko, "NVPL BLAS Architecture and Implementation Overview"
Nvidia
Subsection 5.2.10 Thijs Steel, "Communication efficient application of sequences of rotations to a matrix"
KU LeuvenAbstract:
Applying a sequence of rotations to a matrix is an important component in several linear algebra algorithm for eigenvalues. I will present a new algorithm that focuses on minimizing the cost of the memory operations involved and show that it achieves a flop rate close to the theoretical peak on modern hardware.
Subsection 5.2.11 Bhaskar, Nallani, "LPGEMM Enhancements in AOCL BLAS"
AMD
Subsection 5.2.12 Arnav Sharma, "BLAS Extension APIs"
AMD
Subsection 5.2.13 Eleni Vlachopoulou, "CMake Build System in AOCL BLAS"
AMD
Subsection 5.2.14 Sridhar Govindaswamy, "Close coupling of AOCL BLAS in AOCL LAPACK"
AMD
Subsection 5.2.15 Stepan Nassyr, "Simulating Parameterized Kernels on Parameterized Architectures"
Juelich Supercomputing Center
Subsection 5.2.16 Devin Matthews, "The state of BLIS 1.0 and 2.0"
Southern Methodist University
Related SIAM Article (Sept. 2024):
Subsection 5.2.17 Devin Matthews and Robert van de Geijn, "Vertical integration of the linear and multilinear software stack"
Southern Methodist University and UT Austin
Subsection 5.2.18 Devangi Parikh and Greg Henry, "Accuracy study of Cascading GEMM"
UT Austin and Intel
Subsection 5.2.19 Chao Yin, "LTLT Decomposition of a Skew-Symmetric Matrix - High Performance Implementation"
Southern Methodist University
Subsection 5.2.20 Ishna Satyarth, "LTLT Decomposition of a Skew-Symmetric Matrix - Derivation"
Southern Methodist University