CS378 Assignment 5

Due April 24 by 11:59 pm

Overview

This assignment is designed to build on your in-class understanding of how distributed training of machine learning algorithms is performed. You will get on hands-on experience in using PyTorch and TorchServe.

Learning Outcomes

After completing this programming assignment, you should be able to:

Deploy and Configure Distributed ML Training frameworks.
Write distributed training applications with PyTorch.
Serve applications using TorchServe

Environment Setup

You will complete your assignment in CloudLab. You can refer to Assignment 0 to learn how to use CloudLab. We suggest you create experiments in form of groups and work together. An experiment lasts 16 hours, which is very quick. So, set a time frame that all your group members can sit together and focus on the project, or make sure to extend the experiment when it is necessary.

Similar to Assignment 1, we will continue to use the “378-s22-assignment1” profile under the “UT-CS378-S22” project for you to start your experiment.

As the first step, you should run following commands on every VM:

sudo apt update
Install Miniconda: wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh and follow the installation instructions
Close and restart the shell to enable conda
Install numpy: conda install numpy
Install Pytorch for CPU: conda install pytorch torchvision torchaudio cpuonly -c pytorch
Install Java JDK: sudo apt install --no-install-recommends -y openjdk-11-jre-headless.
Install TorchServe: pip install torchserve torch-model-archiver torch-workflow-archiver

Note: Ideally when setting things up you should use virtual environment but since we have dedicated instances for the project you are free to use the miniconda base environment.

For this assignment using the home directory is enough, you will not need to use any extra disk.

Part 1: Training VGG-11 on Cifar10

We have provided a base script for you to start with, which provides a model setup (model.py) and training setup(main.py) to train on VGG-11 network with Cifar10 dataset.

You can find the base training scripts to modify here

Task1: Fill in the standard training loop of forward pass, backward pass, loss computation and optimizer step in main.py. Make sure to print the loss value after every 20 iterations. Run training for a total of 1 epoch (i.e., until every example has been seen once) with batch size 256.

There are several examples for training. Some good resources include the PyTorch examples repository and the Pytorch tutorials. This script is also a starting point for later parts of the assignment. Familiarize yourself with the script and run training for 1 epoch on a single machine.

Part2: Distributed Data Parallel Training using Built in Module

Next you will modify the script used in Part 1, to enable distributed data parallel training. There are primarily two ways distributed training is performed i) Data Parallel, ii) Model Parallel. In case of Data parallel each of the participating workers train the same network, but on different data points from the dataset. After each iteration (forward and backward pass) the workers average their local gradients to come up with a single update. In model parallel training the model is partitioned among number of workers. Each worker performs training on part of the model and sends it output to the worker which has the next partition during forward pass and vice-versa in backward pass. Model parallel is usually used when the size of the network is very large and doesn't fit on a single worker. In this assignment we solely focus on Data Parallel Training. For data parallel training you will need to partition the data among other nodes.

In this part, we use the distributed functionality provided by PyTorch. Register your model with distributed data parallel and perform distributed training. Unlike in Part 1, you will not need to read the gradients for each layer as DistributedDataParallel performs these steps automatically. For more details read here.

Data Partitioning: In case of data parallel training, the workers train on non-overlapping data partitions. You will use the distributed sampler to distribute the data among workers. For more details look at torch.utils.data.distributed_sampler.

Part3: Serving your model with TorchServe

TorchServe is a performant, flexible and easy to use tool for serving PyTorch models. In this part, we will serve the trained model in Part 1 using only one machine. You will need to do the following steps:

As TorchServe directly uses the trained model parameters, you need to add code to save the trained model in main.py of part 1. You can save the trained model into vgg_cnn.pt file.
Then, create your own custom handler file(vgg_handler.py) for VGG model or extend the standard VGG handler.
Create a torch model archive using the torch-model-archiver utility to archive the above files. torch-model-archiver --model-name vgg --version 1.0 --model-file model.py --serialized-file vgg_cnn.pt --handler vgg_handler.py.
Create a model store directory and move the archive file into the model store like mv vgg.mar model_store/.
Start the TorchServe server: torchserve --start --model-store model_store --models vgg=vgg.mar. You can use more flags, if you want.
Create a new session on the same machine. Check if the vgg model is loaded using TorchServe API: curl http://127.0.0.1:8081/models.
Check the prediction of kitten for the vgg_model using TorchServe API: curl http://127.0.0.1:8080/predictions/vgg -T kitten.jpg. Note: you may need to resize the kitten image which can be done in the vgg_handler.py file.

After finishing task 3, stop TorchServer using: torchserve --stop

Deliverables

You should submit a tar.gz file to Canvas, which consists of a brief report (filename: groupx.pdf) and the code of each task. In the report include the following contents:

Run first and second part for 40 iterations (40 mini-batches of data). Discard the timings of first iteration and report avg time per iteration for the remaining 39 iterations for each part (1,2).
Comment on the scalability of distributed machine learning based on your results.
Add contributions of each member of the group.
Code for each part should be in different folders.
All your codes should be runnable using the following command: python main.py --master-ip $ip_address$ --num-nodes 3 --rank $rank$. Look at python argparse to provide this functionality, where IP-address, num-nodes and rank are command line parameters supplied by the grader at run time.
For part 3, add the screenshot of the ouput of the prediction api for the kitten image.
Provide any other implementation details