In this assignment we will learn how to use DataBrick's GraphFrames library for graph-parallel computation in the Spark ecosystem. GraphFrames is a package for Apache Spark which provides DataFrame-based Graphs. It provides high-level APIs in Scala, Java, and Python. It aims to provide both the functionality of GraphX and extended functionality taking advantage of Spark DataFrames.
We strongly encourage you to go through the GraphFrames user guide before starting this assignment. This guide should give you a comprehensive overview of GraphFrames' programming model.
After completing this programming assignment, you should be able to:
As usual, you will complete your assignment in CloudLab. Please refer to Assignment 0 to learn how to use CloudLab.
Similar to Assignment 1, you will continue to use the “378-s22-assignment1” profile under the “UT-CS378-S22” project for you to start your experiment.
Follow the instructions in Assignments 0 and 1 to setup HDFS and Spark. After setting up Hadoop and spark, setup the properties for the memory and CPU used by Spark applications. Set Spark driver memory to 30GB and executor memory to 30GB. Set executor cores to be 5 and number of cpus per task to be 1. Document about setting properties can be found here.
GraphFrames is available in Spark as an Spark package. In order to incorporate GraphFrames in your spark program, you will have to specify which packages to use when you create a SparkSession. Specifically, create a SparkSession as follows:
import pyspark
sparkSession = pyspark.sql.SparkSession.builder.appName("MyApp").config("spark.jars.packages", "graphframes:graphframes:0.8.2-spark3.2-s_2.12").getOrCreate()
This would allow you to use various GraphFrames constructs in your program.
It is also possible to import a Spark package via spark-submit
. Please see this post for more details.
For this part of this assignment, write an application that implements the PageRank algorithm using the various constructs that GraphFrames provides. Please use the summarized algorithm you implemented in Assignment 2.
Note: Your application cannot use the built-in GraphFrames PageRank object.
Similar to Assignment 2, we will be using the Berkeley-Stanford web graph dataset for this assignment. You are required to execute the algorithm for a total of 10 iterations. Each line in the dataset consists of FromNodeId and one of it's neighbors. You are required to copy this file to HDFS.
Task 1. Implemet the PageRank application using the GraphFrames constructs.
Task 2. Compare your GraphFrames application's runtime to that of the PageRank program you wrote in Assignment 2.
Task 3. Based on the tasks above, does GraphFrames provide additional benefits while implementing the PageRank algorithm? Explain and reason out the difference in performance in your report, if any.
You should submit a tar.gz file to canvas, which consists of a brief report (filename: groupx.pdf) and the code of you PageRank implementation (you are in your groups on canvas so only 1 person should need to submit it). Also include a run.sh
script for each part of the assignment that can re-execute your code on a similar CloudLab cluster or a detailed README about how we should execute your code.