CS378 Assignment 2

Overview

This assignment is a continuation of the previous assignment and is designed to support your in-class understanding of how data analytics stacks work and get some hands-on experience in using them. You will need to deploy Apache Hadoop as the underlying file system and Apache Spark as the execution engine. You will then develop Pagerank algorithm based on them. You will produce a short report detailing your observations and takeaways.

Learning Outcomes

After completing this programming assignment, you should:

Get experience in deploying and configuring Apache Spark and HDFS.
Describe how Apache Spark, Spark SQL and HDFS work, and interact with each other.

Environment Setup

You will complete your assignment in CloudLab. You can refer to Assignment 0 to learn how to use CloudLab. We suggest you create experiments in form of groups and work together. An experiment lasts 16 hours, which is very quick. So, set a time frame that all your group members can sit together and focus on the project, or make sure to extend the experiment when it is necessary.

In this assignment, we provide you a CloudLab profile called “378-s22-assignment1” under the “UT-CS378-S22” project for you to start your experiment. The profile is a simple 3-node cluster of VMs with Ubuntu 18 installed on each. Be patient as starting an experiment with this profile can take up to 15 minutes or more, so better to plan in advance. You get full control of the machines once the experiment is created, so feel free to download any missing packages you need in the assignment.

As the first step, you should run following commands on every VM:

sudo apt update
sudo apt install openjdk-8-jdk
enable the SSH service among the nodes in the cluster. To do this, you have to generate a private/public key pair using: ssh-keygen -t rsa on the master node. You should designate a VM to act as a master/slave (say the one called node-0) while the others are assigned as slaves only. Then, manually copy the public key of node-0 to the authorized_keys file in all the nodes (including node-0) under ~/.ssh/. To get the content of the public key, do:
```
cat ~/.ssh/id_rsa.pub
```
When you copy the content, make sure you do not append any newlines. Otherwise, it will not work.

Once you have done this you can copy files from the master node (i.e. node-0) to the other nodes using tools like parallel-ssh. To use parallel-ssh you will need to create a file (followers) with the hostnames of all the machines. You can test your parallel-ssh with a command like

parallel-ssh -i -h followers -O StrictHostKeyChecking=no hostname

Part 1: Software Deployment

For Part1, follow exactly the steps in Part 1 of assignment 1. After setting up Hadoop and spark, setup the properties for the memory and CPU used by Spark applications. Set Spark driver memory to 30GB and executor memory to 30GB. Set executor cores to be 5 and number of cpus per task to be 1. Document about setting properties is here.

Part 2: PageRank

In this part, you will need to implement the PageRank algorithm, which is an algorithm used by search engines like Google to evaluate the quality of links to a webpage. The algorithm can be summarized as follows:

Set initial rank of each page to be 1.
On each iteration, each page contributes to its neighbors by rank(p)/ # of neighbors.
Update each page’s rank to be 0.15 + 0.85 * (sum of contributions).
Go to next iteration.

In this assignment, we will run the algorithm on two data sets. Berkeley-Stanford web graph is a smaller data set to help you test your algorithm and enwiki-20180601-pages-articles is a larger one to help you better understand the performance of Spark. We have already put the enwiki dataset at path /proj/ut-cs378-s22-PG0/assignment2/data-part3/enwiki-pages-articles/ (Only for the Emulab cloudlab cluster the path is /proj/UT-CS378-S22/assignment2/data-part3/enwiki-pages-articles/). Each line in the data set consists of a page and one of its neighbors. You need to copy them to HDFS first. In this assignment, always run the algorithm for a total of 10 iterations.

Task 1. Write a Scala/Python/Java Spark application that implements the PageRank algorithm. You can use either RDDs or DataFrames to implement the algorithm.

Task 2. Add appropriate RDD/DataFrame partitioning and see what changes. By modifying the number of partitions, Spark will split the data into smaller chunks called partitions which are distributed across different nodes in the cluster. Internally, this uses a shuffle to redistribute data. You can control the number of partitions by using the repartition(…) command. Run the algorithm with different number of partitions and report what difference you observe in the running time of the two datasets.

Task 3. Persist the appropriate RDD/DataFrame as in-memory objects and see what changes.

Task 4. Kill a Worker process and see the changes. You should trigger the failure to a desired worker VM when the application reaches 25% and 75% of its lifetime:

Clear the memory cache using sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches".
Kill the Worker process.

With respect to Task 1-4, in your report you should report the application completion time. Present / reason about the difference in performance or your own findings, if any. Take a look at the lineage graphs of applications from Spark UI, or investigate into the log to find the amount of network/storage read/write bandwidth and number of tasks for every execution that may help you better understand the performance issues.

Deliverables

You should submit a tar.gz file to canvas, which consists of a brief report (filename: groupx.pdf) and the code of the task (you will be put into your groups on canvas so only 1 person should need to submit it).

Include your responses for Tasks 1-4 in the report. This should include details of what you found, the reasons behind your findings and corresponding evidence in terms of screenshots or graphs etc.
In the report, add a section detailing the specific contributions of each group member.
Put the code of each part and each task into separate folders give them meaningful names. Code should be commented well (that will be worth some percentage of your grade for the assignment, grader will be looking at your code). Also create a README file for each task and provide the instructions about how to run your code.
Include a run.sh script for each part of the assignment that can re-execute your code on a similar CloudLab cluster assuming that Hadoop and Spark are present in the same location.

Acknowledgements

This assignment uses insights from Professor Mosharaf Chowdhury’s assignment 1 of ECE598 Fall 2017.