The map-reduce programming paradigm is a fundamental tool used in processing large data sets, and is supported in current tools such as Hadoop and MongoDB. Apache Spark offers another programming paradigm for processing large data sets. In this course you will gain an understanding of the concepts embodied in map-reduce, and will investigate how map-reduce is used to address various problems in processing and analyzing large data sets. This course will explore map-reduce as implemented in Hadoop, as well as the associated distributed file system (HDFS). In this course you will gain an understanding of the concepts offered and supported in Spark, and will investigate how to apply these concepts to address various problems including those you addressed using map-reduce. |
  |
Upon completing this course, the student will be able to design and implement map-reduce programs for various large data set processing tasks, and will be able to deisgn and implement programs using Apache Spark. |
  |
Data structures, Java programming experience. |
  |
|
  |
David Franke
|
  |
Swadhin Pradhan
|
  |
Assignment # | ||||
---|---|---|---|---|
Date | Given | Due | Points | Reading |
Jan. 21 | Dean, Ghemawat paper | |||
Jan. 26 | 1 | Design Patterns, Ch. 1 | ||
Jan. 28 | ||||
Feb. 2 | 2 | 1 | 10 | Design Patterns, Ch. 2 |
Feb. 4 | ||||
Feb. 9 | 3 | 2 | 15 | |
Feb. 11 | ||||
Feb. 16 | 4 | 3 | 15 | |
Feb. 18 | ||||
Feb. 23 | 5 | 4 | 20 | Design Patterns, Ch. 4 |
Feb. 25 | ||||
Mar. 2 | 6 | 5 | 10 | |
Mar. 4 | ||||
Mar. 9 | ||||
Mar. 11 | 7 | 6 | 20 | Design Patterns, Ch. 5 |
Mar. 16 | Spring Break | |||
Mar. 18 | ||||
Mar. 23 | Design Pattern, Ch. 5 | |||
Mar. 25 | 8 | 7 | 20 | Design Pattern, Ch. 3 |
Mar. 30 | ||||
Apr. 1 | 9 | 8 | 15 | Design Pattern, Ch. 6 |
Apr. 6 | ||||
Apr. 8 | ||||
Apr. 13 | 10 | 9 | 25 | Learning Spark, Ch. 3 |
Apr. 15 | ||||
Apr. 20 | 11 | 10 | 10 | Learning Spark, Ch. 4 |
Apr. 22 | ||||
Apr. 27 | 11 | 15 | ||
Apr. 29 | 12 | Learning Spark, Ch. 5,6 | ||
May 4 | ||||
May 6 | ||||
May 11 | 12 | 20 | No class, last assignment due |
This course will focus on writing code to solve various problems, so assignments will be programming assignments. These programs will be cumulative in that subsequent assignments will build on previous programs you have written, so it is important to complete assignments on time so you can move on to the next assignment. Small datasets will be provided for each assignment so that you will not consume too much computing resource (time and space) while developing your solution. Some assignments will also offer a large dataset so that you can measure how your map-reduce solution scales with the dataset size and the computing resources available. You are free to discuss approaches to solving the assigned problems with your classmates, but each student is expected to write their own code. Source code must be submitted for each assignment, in addition to the results you obtained when running your program against the datasets provided. If duplicate work is detected, all parties involved will be penalized. All students should read and be familiar with the UTCS Rules to Live By.
Grading Rubric:
Late PolicyRequired artifacts for each programming assignment are due at the start of class (9:30 AM) on the due date, as we will be discussing the solution during that class period. Penalty for late submission is 25%. Special Notes on Assignment Submission and Grading:
Final GradesFinal grade will be determined on the cumulative percentage score over all assignemnts. Letter grades will be assigned as follows: |
  |
Further Reading
|