Exploiting Modern GPU Architectural Features for Distributed Multi-GPU Graph Analytics

Project contacts: Vishwesh Jatala (vishwesh.jatala@austin.utexas.edu)
Roshan Dathathri (roshan@cs.utexas.edu)

Project description: GPUs have become popular platform for improving the performance of graph analytical systems. However, the real world graph are very large in size and hence they can not processed using the capacity of a single GPU memory. Hence, researchers have focused towards developing distributed multi-host multi-GPU graph analytical frameworks. D-IrGL is one such system that supports multi-host multi-GPU architectures. It uses IrGL[1] generated code to perform computation on each GPU and uses Gluon’s[2] communication optimizations for synchronization among GPUs.

Currently, D-IrGL does not use the features of modern GPU architectures to optimize communication phase. In this project, your objective is to improve the performance the D-IrGL framework by exploiting the following modern GPU architectures:

[1] Asynchronous data transfers and streams within a single GPU.
[2] Virtual memory to support large graphs on a single GPU.
[3] Inter GPU communication without CPU intervention, even among the GPUs located in a single machine and GPUs located across multiple machines. This removes overhead associated with the redundant data transfers through CPU. You can achieve this by NVLink and GPU-Direct RDMA features.

You will implement this project in D-IrGL (https://github.com/IntelligentSoftwareSystems/Galois). You will be provided with the following hardware resources.

Hardware:

Bridges cluster P100 nodes: 4 machines each with 2 NVIDIA P100 GPUs.
Bridges cluster V100 nodes: 2 machines each with 8 NVIDIA Volta V100 GPUs.
Internal Machine located in UT called Tuxedo: 1 machine with 4 K80 GPUs and 2 GTX 1080 GPUs.

Project deliverables and deadlines:

(Nov 1) Description of project proposal.
(Nov 8) An understanding of the project and source code for implementing your ideas and a collection of the baseline performance results of D-IrGL on the hardware you’re using.
(Dec 6) Extensions to D-IrGL that support the modern GPU architecture features.
(Dec 6) A project report that has your contributions using an ACM conference paper format.

Papers:

[1] Sreepathi Pai, Keshav Pingali: A compiler for throughput optimization of graph algorithms on GPUs. OOPSLA 2016: 1-19

[2] Roshan Dathathri, Gurbinder Gill, Loc Hoang, Hoang-Vu Dang, Alex Brooks, Nikoli Dryden, Marc Snir, Keshav Pingali, Gluon: a communication-optimizing substrate for distributed heterogeneous graph analytics. PLDI 2018: 752-768

[3] Hao Wang, Sreeram Potluri, Miao Luo, Ashish Kumar Singh, Sayantan Sur, Dhabaleswar K. Panda: MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters. Computer Science - R&D 26(3-4): 257-266 (2011)

[4] Sreeram Potluri, Khaled Hamidouche, Akshay Venkatesh, Devendar Bureddy, Dhabaleswar K. Panda: Efficient Inter-node MPI Communication Using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs. ICPP 2013: 80-89

[5] https://devblogs.nvidia.com/introduction-cuda-aware-mpi/