Week References
[1]
Ed Anderson, Zhaojun Bai, James Demmel, Jack J. Dongarra, Jeremy DuCroz, Ann Greenbaum, Sven Hammarling, Alan E. McKenney, Susan Ostrouchov, and Danny Sorensen, LAPACK Users' Guide, SIAM, Philadelphia, 1992.
[2]
Jeff Bilmes, Krste Asanovc, Chee-whye Chin, Jim Demmel, Optimizing Matrix Multiply using PHiPAC: a Portable, High-Performance, ANSI C Coding Methodology, International Conference on Supercomputing, July 1997.
[3]
BLAS-like Library Instantiation Software Framework, GitHub repository.
[4]
Ernie Chan, Marcel Heimlich, Avi Purkayastha, and Robert van de Geijn, Collective communication: theory, practice, and experience, Concurrency and Computation: Practice and Experience, Volume 19, Number 13, 2007.
If you don't have access, you may want to read the advanced draft Ernie Chan, Marcel Heimlich, Avijit Purkayastha, and Robert van de Geijn. Collective Communication: Theory, Practice, and Experience, FLAME Working Note #22. The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-06-44. September 26, 2006.
[5]
Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Iain Duff, A Set of Level 3 Basic Linear Algebra Subprograms, ACM Transactions on Mathematical Software, Vol. 16, No. 1, pp. 1-17, March 1990.
[6]
Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Richard J. Hanson, An Extended Set of {FORTRAN} Basic Linear Algebra Subprograms, ACM Transactions on Mathematical Software, Vol. 14, No. 1, pp. 1-17, March 1988.
[7]
Jack J. Dongarra, Iain S. Duff, Danny C. Sorensen, and Henk A. van der Vorst, Solving Linear Systems on Vector and Shared Memory Computers, SIAM, Philadelphia, PA, 1991.
[8]
Victor Eijkhout, Introduction to High-Performance Scientific Computing, lulu.com.
[9]
Kazushige Goto and Robert van de Geijn, On reducing TLB misses in matrix multiplication, Technical Report TR02-55, Department of Computer Sciences, UT-Austin, Nov. 2002.
[10]
Kazushige Goto and Robert van de Geijn, Anatomy of High-Performance Matrix Multiplication, ACM Transactions on Mathematical Software, Vol. 34, No. 3: Article 12, May 2008.
[11]
Kazushige Goto and Robert van de Geijn, High-performance implementation of the level-3 BLAS, ACM Transactions on Mathematical Software, Vol. 35, No. 1: Article 4, July 2008.
[12]
Jianyu Huang, Leslie Rice, Devin A. Matthews, Robert A. van de Geijn, Generating Families of Practical Fast Matrix Multiplication Algorithms, in Proceedings of the 31st IEEE International Parallel and Distributed Processing Symposium (IPDPS17), Orlando, FL, May 29-June 2, 2017.
[13]
Jianyu Huang, Chenhan D. Yu, and Robert A. van de Geijn, Strassen’s Algorithm Reloaded on GPUs, ACM Transactions on Mathematics Software, in review.
[14]
Jianyu Huang, Tyler Smith, Greg Henry, and Robert van de Geijn, Strassen's Algorithm Reloaded, International Conference for High Performance Computing, Networking, Storage and Analysis (SC'16), 2016.
[15]
Bo Kågström, Per Ling, and Charles Van Loan, GEMM-based Level 3 BLAS: High Performance Model Implementations and Performance Evaluation Benchmark, ACM Transactions on Mathematical Software, Vol. 24, No. 3: pp. 268-302, 1998.
[16]
Andrew Kerr, Duane Merrill, Julien Demouth and John Tran, CUDA Templates for Linear Algebra Subroutines (CUTLASS), NVIDIA Developer Blog, 2017.
https://devblogs.nvidia.com/cutlass-linear-algebra-cuda/
.[17]
C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh, Basic Linear Algebra Subprograms for Fortran Usage, ACM Transactions on Mathematical Software, Vol. 5, No. 3, pp. 308-323, Sept. 1979.
[18]
Tze Meng Low, Francisco D. Igual, Tyler M. Smith, and Enrique S. Quintana-Orti, Analytical Modeling Is Enough for High-Performance {BLIS}, ACM Journal on Mathematical Software, Vol. 43, No. 2, Aug. 2016. PDF of draft
[19]
Margaret E. Myers and Robert A. van de Geijn, Advanced Linear Algebra: Foundations to Frontiers, ulaff.net, in preparation.
[20]
Margaret E. Myers and Robert A. van de Geijn, LAFF-On Programming for Correctness, ulaff.net, 2017.
[21]
Margaret E. Myers and Robert A. van de Geijn, Linear Algebra: Foundations to Frontiers - Notes to LAFF With, ulaff.net, 2014.
[22]
Martin D. Schatz, Robert A. van de Geijn, and Jack Poulson, Parallel Matrix Multiplication: A Systematic Journey, SIAM Journal on Scientific Computing, Volume 38, Issue 6, 2016.
[23]
Tyler Michael Smith, Bradley Lowery, Julien Langou, Robert A. van de Geijn, A Tight I/O Lower Bound for Matrix Multiplication, arxiv.org:1702.02017v2, 2019. (Submitted to ACM Transactions on Mathematical Software.)
[24]
Tyler Smith and Robert van de Geijn, The MOMMS Family of Matrix Multiplication Algorithms, arxiv.org:1904.05717 , 2019.
[25]
Tyler M. Smith, Robert A. van de Geijn, Mikhail Smelyanskiy, Jeff R. Hammond, and Field G. Van Zee, Anatomy of High-Performance Many-Threaded Matrix Multiplication, proceedings of the 28th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2014), 2014.
[26]
Volker Strassen, Gaussian Elimination is not Optimal, Numer. Math. 13, p. 354-356, 1969
[27]
Field G. Van Zee and Tyler M. Smith, Implementing High-performance Complex Matrix Multiplication via the 3M and 4M Methods, ACM Transactions on Mathematical Software, Vol. 44, No. 1, pp. 7:1-7:36, July 2017.
[28]
Robert van de Geijn and Kazushige Goto, BLAS (Basic Linear Algebra Subprograms), Encyclopedia of Parallel Computing, Part 2, pp. 157-164, 2011. If you don't have access, you may want to read an advanced draft.
[29]
Robert van de Geijn and Jerrell Watts, SUMMA: Scalable Universal Matrix Multiplication Algorithm, Concurrency: Practice and Experience, Volume 9, Number 4, 1997.
[30]
Field G. Van Zee, Implementing High-Performance Complex Matrix Multiplication via the 1m Method, ACM Journal on Mathematical Software, in review.
[31]
Field G. Van Zee, Tyler Smith, Francisco D. Igual, Mikhail Smelyanskiy, Xianyi Zhang, Michael Kistler, Vernon Austel, John Gunnels, Tze Meng Low, Bryan Marker, Lee Killough, and Robert A. van de Geijn, The BLIS Framework: Experiments in Portability, ACM Journal on Mathematical Software, Vol. 42, No. 2, June 2016. You can access this article for free by visiting the Science of High-Performance Computing group webpage and clicking on the title of Journal Article 39.
[32]
Field G. Van Zee and Robert A. van de Geijn, BLIS: A Framework for Rapidly Instantiating BLAS Functionality, ACM Journal on Mathematical Software, Vol. 41, No. 3, June 2015. You can access this article for free by visiting the Science of High-Performance Computing group webpage and clicking on the title of Journal Article 39.
[33]
Richard C. Whaley, Antoine Petitet, and Jack J. Dongarra, Automated Empirical Optimization of Software and the ATLAS Project, Parallel Computing, 27 (1–2): 3–35, 2001.
[34]
Chenhan D. Yu, Jianyu Huang, Woody Austin, Bo Xiao, George Biros, Performance Optimization for the K-Nearest Neighbors Kernel on x86 Architectures, proceedings of SC'15, 2015. [PDF]