In this course, we will learn about the architecture, design, and implementation of
software and hardware systems that play a central role in modern large-scale machine learning (ML) training
and inference.
Lectures will discuss various state-of-the-art frameworks for programming
for distributed training and inference, and techniques for hardware acceleration,
compilation, and distributed execution that make large scale training possible, easy to use, and performant. We will also cover the basic systems challenges brought on by the advent of large-language models (LLMs) with respect to performance and resource efficiency and the new techniques and systems building blocks that address these challenges.
Specifically, topics we cover will include:
Monday, Wednesday 9:30 am - 11:00 am
UTC 4.110
CS429 and CS439 are useful but not required. A working knowledge of machine learning and computer systems will suffice to take this course as background relevant to the lecture, programming assignments and quizzes will be provided in class.
Programming proficiency in Python is strongly encouraged.
The course will have 3 in-class quizzes (no midterms or finals), and 5
programming assignments that dovetail with in-class discussions of the topics above. Grading split is as follows:
On the practice front, we will do programming 5 assignments covering:
See a tentative outline of lectures here and planned assignments here.
Administrative Details
Class time
Class location
Pre-requisites (not strict, but useful)
Grading
Instructor: Aditya Akella
Email: akella@cs.utexas.edu
Office Hours: Monday 11am-1pm
Location: GDC 6.826
TA: Bodun Hu
Email: bodunhu@utexas.edu
Office Hours: Wednesday 11:00am-12:00pm
Location: GDC 1.302 Station Desk 3
TA: Brian Chang
Email: brianchang@utexas.edu
Office Hours: Tuesday 10:30am-11:30am
Location: GDC 1.302 Station Desk 2