Syllabus - CS378 - Big Data Programming
Spring 2015
MW 9:30 - 11:00 WAG 214
Unique: 52022

Description

The map-reduce programming paradigm is a fundamental tool used in processing large data sets, and is supported in current tools such as Hadoop and MongoDB. Apache Spark offers another programming paradigm for processing large data sets. In this course you will gain an understanding of the concepts embodied in map-reduce, and will investigate how map-reduce is used to address various problems in processing and analyzing large data sets. This course will explore map-reduce as implemented in Hadoop, as well as the associated distributed file system (HDFS). In this course you will gain an understanding of the concepts offered and supported in Spark, and will investigate how to apply these concepts to address various problems including those you addressed using map-reduce.

 

Objectives

Upon completing this course, the student will be able to design and implement map-reduce programs for various large data set processing tasks, and will be able to deisgn and implement programs using Apache Spark.

 

Prerequisites

Data structures, Java programming experience.

 

Textbooks

  • Required: MapReduce Design Patterns, by Donald Miner and Adam Shook
    • O'Reilly Media
    • Print ISBN: 978-1-4493-2717-0 | ISBN 10: 1-4493-2717-6
    • Ebook ISBN: 978-1-4493-4197-8 | ISBN 10: 1-4493-4197-7
  • Required: Learning Spark, by Holden Karau, Andy Konwinsky, Patrick Wendell, Matei Zaharia
    • O'Reilly Media
    • Print ISBN: 978-1-4493-5862-4 | ISBN 10: 1-4493-5862-4
    • Ebook ISBN: 978-1-4493-5860-0 | ISBN 10: 1-4493-5860-8
  • Recommended: Hadoop: The Definitive Guide, 3rd Edition, by Tom White
    • O'Reilly Media/Yahoo Press
    • Print ISBN: 978-1-4493-1152-0 | ISBN 10: 1-4493-1152-0
    • Ebook ISBN: 978-1-4493-1151-3 | ISBN 10: 1-4493-1151-2
 

Instructor

David Franke
Email: dfranke@cs.utexas.edu
Office: GDC 4.706
Office Hours:

  • M 11:00 AM - 12:00 PM
  • T 12:00 PM - 1:00 PM
  • By appointment

 

TA

Swadhin Pradhan
Email: swadhin@utexas.edu
Office Hours: ThF: 3:30 - 5:00 PM
Office: GDC 1.302 (TA Station)

 

Class Schedule

Assignment #
Date Given Due Points Reading
Jan. 21Dean, Ghemawat paper
Jan. 261Design Patterns, Ch. 1
Jan. 28
Feb. 22110Design Patterns, Ch. 2
Feb. 4
Feb. 93215
Feb. 11
Feb. 164315
Feb. 18
Feb. 235420Design Patterns, Ch. 4
Feb. 25
Mar. 26510
Mar. 4
Mar. 9
Mar. 117620Design Patterns, Ch. 5
Mar. 16Spring Break
Mar. 18
Mar. 23Design Pattern, Ch. 5
Mar. 258720Design Pattern, Ch. 3
Mar. 30
Apr. 19815Design Pattern, Ch. 6
Apr. 6
Apr. 8
Apr. 1310925Learning Spark, Ch. 3
Apr. 15
Apr. 20111010Learning Spark, Ch. 4
Apr. 22
Apr. 271115
Apr. 2912Learning Spark, Ch. 5,6
May 4
May 6
May 111220No class, last assignment due

Programming Assignments

This course will focus on writing code to solve various problems, so assignments will be programming assignments. These programs will be cumulative in that subsequent assignments will build on previous programs you have written, so it is important to complete assignments on time so you can move on to the next assignment.

Small datasets will be provided for each assignment so that you will not consume too much computing resource (time and space) while developing your solution. Some assignments will also offer a large dataset so that you can measure how your map-reduce solution scales with the dataset size and the computing resources available.

You are free to discuss approaches to solving the assigned problems with your classmates, but each student is expected to write their own code. Source code must be submitted for each assignment, in addition to the results you obtained when running your program against the datasets provided. If duplicate work is detected, all parties involved will be penalized. All students should read and be familiar with the UTCS Rules to Live By.

Grading Rubric:

  • Does the program run
  • Does the program produce the correct output
  • Does the program contain the required element(s)
  • Code quality: structure and documentation
Percentages for each element may be different for each assignment.

Late Policy

Required artifacts for each programming assignment are due at the start of class (9:30 AM) on the due date, as we will be discussing the solution during that class period. Penalty for late submission is 25%.

Special Notes on Assignment Submission and Grading:

  • If you submit your assignment before the due date/time, that is what will be graded (the last submission before the due date/time). Additional late submissions will not be considered. You must decide between submitting a partially working program before the deadline versus a (more complete) program after the deadline.
  • Late submissions will only be accepted for one week following the due date for points.
  • Submissions made after this time (one week after the due date/time) will get some consideration when final grades are determined, so I encourage you to turn in your work even after the extended late deadline.

Final Grades

Final grade will be determined on the cumulative percentage score over all assignemnts. Letter grades will be assigned as follows:

  • A: 92% to 100%
  • A-: 90% to 92%
  • B+: 88% to 90%
  • B: 82% to 88%
  • B-: 80% to 82%
  • C+: 78% to 80%
  • C: 72% to 78%
  • C-: 70% to 72%
  • D+: 68% to 70%
  • D: 62% to 68%
  • D-: 60% to 62%

     

    Lecture Notes

    Further Reading

    Here are references to further reading that we will discuss during the course.

    • MapReduce: Simplified Data Processing on Large Clusters, by Jeffry Dean and Sanjay Ghemawat, can be downloaded here.
    • A Comparison of Join Algorithms for Log Processing in MapReduce, by Spyros Blanas, Jignesh M. Patel, Vuk Ercegovac, Jun Rao, Eugene J. Shekita, and Yuanyuan Tian, can be downloaded here.
    • The Family of MapReduce and Large-Scale Data Processing Systems, by Sherif Sakr, Anna Liu, and Ayman G. Fayoumi, can be downloaded here.
    • Spark: Cluster Computing with Working Sets, by Matei Zaharia, Mosharaf Chowdhury, Micheal J. Franklin, Scott Shenker, Ion Stoica, can be downloaded here.
    • Additional papers on various aspects of Spark can be found here.

    Important Dates

    Important dates for the Spring 2015 semester can be found on the Academic Calendar.