CS 315 Assignment 11: MapReduce

Due: December 2, 2011.

Files: Cons.java MapReduce.java file1.txt decind.txt movies.txt

This assignment should be done in Java. The purpose of this assignment is to gain experience with MapReduce programming. MapReduce is used by Google for much of their processing of large data sets (terabytes of data). Hadoop is an open-source version of MapReduce written in Java; it is available to commercial users through Amazon Web Services. For this assignment, we will use a simplified version of MapReduce; the way in which it is programmed is similar to Hadoop.

A MapReduce program has two basic components provided by the user, a Mapper and a Reducer; these in turn have methods map and reduce. The MapReduce class calls the methods and manages the data.

map has two arguments, a String key and a String value, plus the MapReduce mr. For our purposes, the key will be a line number (sometimes used, sometimes not used), and the value will be a line of text. map will emit zero or more items of data by calling mr.collect_map with arguments of a key and a value that is a list of one String (potentially more). Note that the key in the call to mr.collect_map is often different from the key parameter to the map method. As an example, suppose that the mapper program detects some conditions in the lines of input text and wants to count those conditions; for condition foo, the call to emit this output would be mr.collect_map("foo", list("1")); In this case, "foo" is the key, and the value is a list of one String value, in this case denoting one occurrence of foo.

The reduce method gets a key that is a String, and a value that is a list of lists of (usually one) String. The value contains all the values emitted for that key by the map method, collected into a single list. The task of reduce is to produce an answer from the list of values and output that answer by calling mr.collect_reduce . The answer is of type Cons, a list of the key and combined value. As an example, if there are 3 occurrences of foo, the value would be (("1") ("1") ("1")) ; the quotation marks are shown here to emphasize that the values are String and must be parsed to get integer values, e.g. by Integer.decode(). The final value might be a list ("foo" 3), where the 3 is the Integer combined value.

An example pair of methods to perform a line count of a file is provided. This example simply emits mr.collect_map("line", list("1")); for each line. The reduce method parses and adds the 1 counts, to emphasize this technique, although it could simply find the length of the list.
Write map and reduce methods to count the number of occurrences of each alphabetic character in a file. The count for each letter should be case-insensitive (i.e., include both upper-case and lower-case versions of the letter). Ignore non-alphabetic characters. Character frequency counts can be used in breaking substitution ciphers.
```
((a 484) (b 95) (c 187)  ...)
```
Write map and reduce methods to count the number of occurrences of each word in a file. For the purposes of this assignment a word will be defined as any string of alphabetic characters appearing between non-alphabetic characters. nature's is two words. The count should be case-insensitive. If a word occurs multiple times in a line, all should be counted. A StringTokenizer is a convenient way to parse the words from the input line. There is documentation of StringTokenizer online, and there is an example of its use in the reader functions. (justice 3)
Write map and reduce methods to perform a grep to find occurrences of a given string (case-sensitive) in a file. The desired string is obtained as mr.parameter() ; it may occur more than once in a given line. The output should use the desired string as a key and give a list of all the line numbers (as Integer) on which the string occurs. If a word occurs multiple times in a line, the line number should be output multiple times.
```
((ther (7 46 52 63 72 73 85 94 116 168 172 182 184)))
```
Write map and reduce methods to construct an index of words in a file. For each word, the index should give a list of the line numbers (as Integer) on which the word occurs in the file. The index should be case-insensitive. If a word occurs multiple times in a line, all should be included in the index. (without (67 89 107))
Write map and reduce methods to determine the average ratings of movies. The input consists of a series of lines, each containing a movie number, user number, rating, and date: 3980,294028,5,2005-11-15
map should emit movie number and list of rating, and reduce should return for each movie number a list of average rating as Double, and number of ratings as Integer. This data is similar to the Netflix Prize data.

References: