CS 315 Assignment 11: MapReduce
Due: December 2, 2011.
Files: Cons.java   MapReduce.java  
file1.txt   decind.txt   movies.txt
This assignment should be done in Java.
The purpose of this assignment is to gain experience with MapReduce
programming. MapReduce is used by Google for much of their
processing of large data sets (terabytes of data). Hadoop is an
open-source version of MapReduce written in Java; it is available
to commercial users through Amazon Web Services.
For this assignment, we will use a simplified version of MapReduce;
the way in which it is programmed is similar to Hadoop.
A MapReduce program has two basic components provided by the user, a
Mapper and a Reducer; these in turn have methods
map and reduce. The MapReduce class
calls the methods and manages the data.
map has two arguments,
a String key and a String value, plus the
MapReduce mr. For our purposes, the key will be
a line number (sometimes used, sometimes not used), and the
value will be a line of text. map will emit
zero or more items of data by calling mr.collect_map with
arguments of a key and a value that is a list of one
String (potentially more).
Note that the key in the call to mr.collect_map
is often different from the key parameter to the
map method. As an example, suppose that the mapper program
detects some conditions in the lines of input text and wants to
count those conditions; for condition foo, the call to emit this
output would be mr.collect_map("foo", list("1")); In this case,
"foo" is the key, and the value is a list of one String
value, in this case denoting one occurrence of foo.
The reduce method gets a key that is a String,
and a value that is a list of lists of (usually one) String.
The value contains all the values emitted for that key
by the map method, collected into a single list. The task
of reduce is to produce an answer from the list of values
and output that answer by calling mr.collect_reduce . The answer
is of type Cons, a list of the key and combined value. As
an example, if there are 3 occurrences of foo, the value
would be (("1") ("1") ("1")) ; the quotation marks are shown
here to emphasize that the values are String and must be parsed
to get integer values, e.g. by Integer.decode(). The final value might
be a list ("foo" 3), where the 3 is the
Integer combined value.
- An example pair of methods to perform a line count of a file is
provided. This example simply emits
mr.collect_map("line", list("1")); for each line. The reduce
method parses and adds the 1 counts, to emphasize this technique,
although it could simply find the length of the list.
- Write map and reduce methods to count the number
of occurrences of each alphabetic character in a file. The count for
each letter should be case-insensitive (i.e., include both upper-case
and lower-case versions of the letter). Ignore non-alphabetic
characters. Character frequency counts can be used in breaking
substitution ciphers.
((a 484) (b 95) (c 187) ...)
- Write map and reduce methods to count the number
of occurrences of each word in a file.
For the purposes of this assignment a word will be defined as any
string of alphabetic characters appearing between non-alphabetic
characters. nature's is two words.
The count should be case-insensitive.
If a word occurs multiple times in a line, all should be counted.
A StringTokenizer is a convenient way to parse the words from
the input line. There is documentation of StringTokenizer
online, and there is an example of its use in the reader
functions.
     
(justice 3)
- Write map and reduce methods to perform a
grep to find occurrences of a given string (case-sensitive)
in a file.
The desired string is obtained as mr.parameter() ; it may
occur more than once in a given line. The output should use the
desired string as a key and give a list of all the line numbers
(as Integer) on which the string occurs. If a word
occurs multiple times in a line, the line number should be output
multiple times.
((ther (7 46 52 63 72 73 85 94 116 168 172 182 184)))
- Write map and reduce methods to construct an
index of words in a file. For each word, the index should give a list
of the line numbers (as Integer) on which the word occurs in
the file. The index should be case-insensitive. If a word
occurs multiple times in a line, all should be included in the index.
     
(without (67 89 107))
- Write map and reduce methods to determine
the average ratings of movies. The input consists of a series
of lines, each containing a movie number, user number, rating, and
date:
     
3980,294028,5,2005-11-15
map should emit movie number and list of rating, and
reduce should return for each movie number a list of
average rating as Double, and number of ratings as
Integer. This data is similar to the Netflix Prize data.
References:
-
http://research.google.com/archive/mapreduce.html
  MapReduce Paper
-
http://www.nytimes.com/2009/03/17/technology/business-computing/17cloud.html  NYT Article on Hadoop
-
http://code.google.com/edu/parallel/index.html
  Google Code University
-
http://hadoop.apache.org/core/   Hadoop
- Lecture slides on PageRank