CS378 Assignment 3

Deadline: Mar 13 (Sun) 11:59 pm

Overview

This assignment is designed to support your in-class understanding of how in-memory data analytics stacks and stream processing engines work. You will learn how to write streaming applications using Structured Streaming, SPARK's latest library to support development of end-to-end streaming applications. As in assignment 2, you will produce a short report detailing your observations, scripts and takeaways.

Learning Outcomes

After completing this programming assignment, you should:

Be able to write end-to-end Streaming application using Structured Streaming APIs.
Have gotten experience in using Spark contexts and sessions, Structured Streaming triggers and output modes.

Environment Setup

You will complete your assignment in CloudLab. You can refer to Assignment 0 to learn how to use CloudLab. We suggest you create experiments in form of groups and work together. An experiment lasts 16 hours, which is very quick. So, set a time frame that all your group members can sit together and focus on the project, or make sure to extend the experiment when it is necessary.

Similar to Assignment 1, we will continue to use the “378-s22-assignment1” profile under the “UT-CS378-S22” project for you to start your experiment.

Part 1: Software Deployment

Follow the instructions in Assignments 0 and 1 to setup HDFS and Spark. After setting up Hadoop and spark, setup the properties for the memory and CPU used by Spark applications. Set Spark driver memory to 30GB and executor memory to 30GB. Set executor cores to be 5 and number of cpus per task to be 1. Document about setting properties is here.

Part 2: Stuctured Streaming

This part of the assignment is aimed at familiarizing you with the process of developing simple streaming applications on big-data frameworks. You will use Structured Streaming (Spark's latest library to build continuous applications) for building your applications. It is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. You can express your streaming computation the same way you would express a batch computation on static data.

Structured Streaming has a new processing model that aids in developing end-to-end streaming applications. Conceptually, Structured Streaming treats all the data arriving as an unbounded input table. Each new item in the stream is like a row appended to the input table. The framework doesn't actually retain all the input, but the results will be equivalent to having all of it and running a batch job. A developer using Structured Streaming defines a query on this input table, as if it were a static table, to compute a final result table that will be written to an output sink. Spark automatically converts this batch-like query to a streaming execution plan.

Before developing the streaming applications, students are recommended to read about Structured Streaming here.

For this part of the assignment, you will be developing simple streaming applications in Python/Java/Scala that analyze the Higgs Twitter Dataset. The Higgs dataset has been built after monitoring the spreading processes on Twitter before, during and after the announcement of the discovery of a new particle with the features of the Higgs boson. Each row in this dataset is of the format <userA, userB, timestamp, interaction> where interactions can be retweets (RT), mention (MT) and reply (RE).

We have split the dataset into a number of small files so that we can use the dataset to emulate streaming data. Download the split dataset onto your master node. You would need to login to your UTMail to access this drive.

Task 1. One of the key features of Structured Streaming is support for window operations on event time (as opposed to arrival time). Leveraging the aforementioned feature, you are expected to write a simple application that emits the number of retweets (RT), mention (MT) and reply (RE) for an hourly window that is updated every 30 minutes based on the timestamps of the tweets. You are expected to write the output of your application onto the standard console. You need to take care of choosing the appropriate output mode while developing your application.

In order to emulate streaming data, you are required to write a simple script that would periodically (say every 5 seconds) copy one split file of the Higgs dataset to the HDFS directory your application is listening to. To be more specific, you should do the following:

Copy the entire split dataset to the HDFS. This would be your staging directory which consists of all your data.
Create a monitoring directory too on the HDFS that your application listens to. This would be the directory your streaming application is listening to.
Periodically, move your files from the staging directory to the monitoring directory using the hadoop fs -mv command.

Task 2. Structured Streaming offers developers the flexibility is decide how often should the data be processed. Write a simple application that emits the twitter IDs of users that have been mentioned by other users every 10 seconds. You are expected to write the output of your application to HDFS. You need to take care of choosing the appropriate output mode while developing your application. You will have to emulate streaming data as you did in the previous question.

Task 3. Another key feature of Structured Streaming is that it allows developers to mix static data and streaming computations. You are expected to write a simple application that takes as input the list of twitter user IDs and every 5 seconds, it emits the number of tweet actions of a user if it is present in the input list. You are expected to write the output of your application to the console. You will have to emulate streaming data as you did in the previous question.

Notes:

You should empty the buffers cache before any of your experiment run. Clear the memory cache using sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches".
All your applications should take a command line argument - HDFS path to the directory that your application will listen to.
All the applications should use the cluster resources appropriately.

Deliverables

You should tar all the following files/folders and put it in an archive named group-x.tar.gz.

Include your responses for each task in the report. This should include details of what you found, the reasons behind your findings and corresponding evidence in terms of screenshots or graphs etc.
In the report, add a section detailing the specific contributions of each group member.
Name the code of each task with meaningful names. Code should be commented well (that will be worth some percentage of your grade for the assignment, grader will be looking at your code). Also create a README file for each task and provide the instructions about how to run your code.
Include a run.sh script so we can re-execute your code on a similar CloudLab cluster assuming that Hadoop and Spark are present in the same location.