CS378 Assignment 5

Due April 24 by 11:59 pm

Overview

This assignment is designed to build on your in-class understanding of how distributed training of machine learning algorithms is performed. You will get on hands-on experience in using PyTorch and TorchServe.

Learning Outcomes

After completing this programming assignment, you should be able to:

Environment Setup

You will complete your assignment in CloudLab. You can refer to Assignment 0 to learn how to use CloudLab. We suggest you create experiments in form of groups and work together. An experiment lasts 16 hours, which is very quick. So, set a time frame that all your group members can sit together and focus on the project, or make sure to extend the experiment when it is necessary.

Similar to Assignment 1, we will continue to use the “378-s22-assignment1” profile under the “UT-CS378-S22” project for you to start your experiment.

As the first step, you should run following commands on every VM:

  1. sudo apt update
  2. Install Miniconda: wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh and follow the installation instructions
  3. Close and restart the shell to enable conda
  4. Install numpy: conda install numpy
  5. Install Pytorch for CPU: conda install pytorch torchvision torchaudio cpuonly -c pytorch
  6. Install Java JDK: sudo apt install --no-install-recommends -y openjdk-11-jre-headless.
  7. Install TorchServe: pip install torchserve torch-model-archiver torch-workflow-archiver

Note: Ideally when setting things up you should use virtual environment but since we have dedicated instances for the project you are free to use the miniconda base environment.

For this assignment using the home directory is enough, you will not need to use any extra disk.

Part 1: Training VGG-11 on Cifar10

We have provided a base script for you to start with, which provides a model setup (model.py) and training setup(main.py) to train on VGG-11 network with Cifar10 dataset.

You can find the base training scripts to modify here

Task1: Fill in the standard training loop of forward pass, backward pass, loss computation and optimizer step in main.py. Make sure to print the loss value after every 20 iterations. Run training for a total of 1 epoch (i.e., until every example has been seen once) with batch size 256.

There are several examples for training. Some good resources include the PyTorch examples repository and the Pytorch tutorials. This script is also a starting point for later parts of the assignment. Familiarize yourself with the script and run training for 1 epoch on a single machine.

Part2: Distributed Data Parallel Training using Built in Module

Next you will modify the script used in Part 1, to enable distributed data parallel training. There are primarily two ways distributed training is performed i) Data Parallel, ii) Model Parallel. In case of Data parallel each of the participating workers train the same network, but on different data points from the dataset. After each iteration (forward and backward pass) the workers average their local gradients to come up with a single update. In model parallel training the model is partitioned among number of workers. Each worker performs training on part of the model and sends it output to the worker which has the next partition during forward pass and vice-versa in backward pass. Model parallel is usually used when the size of the network is very large and doesn't fit on a single worker. In this assignment we solely focus on Data Parallel Training. For data parallel training you will need to partition the data among other nodes.

In this part, we use the distributed functionality provided by PyTorch. Register your model with distributed data parallel and perform distributed training. Unlike in Part 1, you will not need to read the gradients for each layer as DistributedDataParallel performs these steps automatically. For more details read here.

Data Partitioning: In case of data parallel training, the workers train on non-overlapping data partitions. You will use the distributed sampler to distribute the data among workers. For more details look at torch.utils.data.distributed_sampler.

Part3: Serving your model with TorchServe

TorchServe is a performant, flexible and easy to use tool for serving PyTorch models. In this part, we will serve the trained model in Part 1 using only one machine. You will need to do the following steps:

After finishing task 3, stop TorchServer using: torchserve --stop

Deliverables

You should submit a tar.gz file to Canvas, which consists of a brief report (filename: groupx.pdf) and the code of each task. In the report include the following contents: