The job of building computer networks that train and run large AI models is becoming increasingly complicated because traditional network designs can’t operate at the higher speeds that the AI workloads require and need to be tuned to a variety of communication endpoints (such as CPUs, graphics processing units and AI accelerators) that have widely different characteristics, including data generation speeds. Moreover, AI workloads require advanced network monitoring capabilities to quickly diagnose and resolve performance bottlenecks.
To make life easier for network architects and developers, a research team led by UT Austin computer science faculty members Daehyeok Kim, Aditya Akella, and Venkat Arun is launching a project to develop a new framework called Transcraft that makes the process of designing and implementing new network stacks simpler and more adaptable. Their work promises to streamline how networks manage the growing complexity of AI model training and application, setting the stage for significant advancements in both fields.
“Designing and implementing network transport stacks is a slow, complex process, complicated by the vast design space, challenges in refining transport algorithms, and the intricacies of validating and optimizing performance,” said Daehyeok Kim, an assistant professor of computer science who is leading the project. “Our project aims to realize Transcraft by addressing key questions in computer networking, systems, and formal methods.”
The implications for both academic research and real-world AI applications are enormous, particularly as network infrastructures struggle to keep pace with the increasing demands of large-scale AI.
This high-impact research has been awarded a $1.08 million NSF Medium grant over the next four years, recognizing the critical role it could play in advancing network and AI systems on a national scale.