BLIS Retreat 2018

Contributed talks

Cris Cecka, NVIDIA
Title: CUTLASS

Joe Dobson, ARM
Title: Investigations into GEMM on the Cavium Thunder X2

Marat Dukhan, Facebook
Title: CPU Information about processor characteristics, such as supported instruction sets, types and number of cores, cache hierarchy, are crucial from high-performance dense linear algebra and other compute-intensive primitives. While several libraries for detection of CPU characteristics are available as open-source software, they offer very limited support on mobile platforms. In this talk we present the unique challenges of processor characteristics detection on mobile platforms, and overview cpuinfo library (https://github.com/pytorch/cpuinfo) which provides cross-platform implementation for detection of crucial processor characteristics.
Evgeny Epifanovsky, Q-Chem
Title: TBD

Greg Henry (Intel)
Title: Other BLAS Precisions

Jane Herriman (Caltech and Julia Computing)
Title: Making Julia more inclusive and accessible
Abstract: In this talk, I'll discuss our efforts to diversify Julia's user base by working to make the language and ecosystem more inclusive of and accessible to potential new users. In particular, I'll share past and upcoming efforts to develop and distribute accessible materials to learn Julia, and why I'd love to see Julia used as a tool in the Massive Open Online Courses Linear Algebra: Foundations to Frontiers (LAFF) and LAFF-on Programming for High Performance.

Jane Herriman (Caltech and Julia Computing) and Sacha Verweij (Stanford)
Title: Julia, BLIS, and Pedagogy

Koby Hayashi
Title: Parallel Nonnegative CP Decomposition of Dense Tensorsg
Abstract:
Abstract: The CP tensor decomposition is a low-rank approximation of a tensor. We present a distributed-memory parallel algorithm and implementation of an alternating optimization method for computing a CP decomposition of dense tensor data that can enforce nonnegativity of the computed low-rank factors. The principal task is to parallelize the matricized-tensor times Khatri-Rao product (MTTKRP) bottleneck subcomputation. The algorithm is computation efficient, using dimension trees to avoid redundant computation across MTTKRPs within the alternating method. Our approach is also communication efficient, using a data distribution and parallel algorithm across a multidimensional processor grid that can be tuned to minimize communication. We benchmark our software on synthetic as well as hyperspectral image and neuroscience dynamic functional connectivity data, demonstrating that our algorithm scales well to 100s of nodes (up to 4096 cores) and is faster and more general than the currently available parallel software.

Daya Kudia (Facebook)
Title: Low-Precision Inference Friendly GEMM and Convolution Library: Interface and Implementation
Abstract:
At Facebook, inference using deep learning models is the biggest consumer of floating point operations in data centers. Therefore, optimizing inference performance directly results in higher throughput, power savings and/or lesser compute resources usage. In this talk, I will talk about the new interface that allows us to use pre-packed matrices, avoids internal memory allocations and allows fusion of post GEMM operations such as non-linearities, bias addition, requantization etc. The weight matrices are constant during inference and we can pre-pack weight matrix during inference for an optimized GEMM implementation. The implementation dynamically generates efficient matrix-shape specific vectorized code. The interface is specifically designed to support optimized quantized inference and fusion of post-GEMM operations. The flexible interface is implemented with the help of C++ templates. This allows use of different packing methods and construction of a pipeline of post-GEMM operations on the output matrix. Internal micro/macro GEMM kernels still use BLIS approach to achieve high performance.
Tze Meng Low, CMU
Title: Extending the BLIS analytical model to GPUs
Devin Matthews, SMU
Title: Lots of things look like matrices

Mesut Meterelliyoz, Intel
Title: Intel MKL Vectorized Compact Routines

Arthur Araujo Mitrano, Intel
Title: Intel MKL small matrix multiplication optimizations using JIT compilation

Maggie Myers, UT-Austin
Title: Sharing a BLISful State
Abstract:
Getting the word out. What? Why? Who? Where? How? You will be invited to give you input.

Devangi Parikh, UT-Austin
Title: BLIS Performs

Pradeep Rao, AMD
Title: Analysis of BLIS Multithreaded GEMM and HPC workloads
Topics:
- BLIS DGEMM Multi-threaded performance analysis on the AMD EPYC(TM) processor
- BLIS overheads in HPC workloads
Martin Schatz, Facebook
Title: Notation for distributed-memory parallel tensor computations

Paul Springer, NVIDIA
Title: Multi-GPU GEMM: A cache-based approach

Field Van Zee, UT-Austin
Title: Looking back on another year of progress
Title: Supporting Mixed Domain/Precision in BLIS
Richard Veras, LSU
Title: Flyte-MM: A Software Based Sub-Floating Point Precision GEMM
Abstract:

In many emerging applications, such as machine learning and graph analytic algorithms, linear algebra provides a critical performance building block. These blocks allow developers to offload computation on preexisting tuned BLAS libraries. However these libraries are developed for full floating point operations whereas these domains typically only need a very small amount of precision, but high dynamic range.

Thus the surplus precision provided by the native floating point types is wasted space and performance. To this end we design a GEMM implementation around a sub-precision derived type called Flytes. These data types maintain the full exponent of the IEEE Floating Point type, providing the necessary dynamic range, but they truncate the mantissa according to the precision requirements of the application, which results in a more compact format. This can reduce the storage requirement to three quarters to one eighth of the base type.

Existing BLAS libraries do not support the implementation of custom types and current generation compilers fall short of achieving BLAS-level performance. However, the BLIS framework allows developers to peel away the layers of the BLAS and modify the core operations, namely the data packing routines and the computational kernels, while still maintaining the same high performance algorithms. In this work we demonstrate the various ways of how Flytes can be integrated in the BLIS framework and still achieve the level of performance seen by expert implementations of the BLAS for native types.

This work is a collaboration between The Louisiana State University and Trinity College in Dublin
Chenhan Yu, UT-Austin/NVidia
Title: Strassen's Algorithm Reloaded for GPUs
Collabortive work with Jianyu Huang and Robert van de Geijn.