This assignment should be done in Java. The purpose of this assignment is to gain experience with MapReduce programming. MapReduce is used by Google for much of their processing of large data sets (terabytes of data). Hadoop is an open-source version of MapReduce written in Java; it is available to commercial users through Amazon Web Services. For this assignment, we will use a simplified version of MapReduce; the way in which it is programmed is similar to Hadoop.
A MapReduce program has two basic components provided by the user, a Mapper and a Reducer; these in turn have methods map and reduce. The MapReduce class calls the methods and manages the data.
map has two arguments, a String key and a String value, plus the MapReduce mr. For our purposes, the key will be a line number (sometimes used, sometimes not used), and the value will be a line of text. map will emit zero or more items of data by calling mr.collect_map with arguments of a key and a value that is a list of one String (potentially more). Note that the key in the call to mr.collect_map is often different from the key parameter to the map method. As an example, suppose that the mapper program detects some conditions in the lines of input text and wants to count those conditions; for condition foo, the call to emit this output would be mr.collect_map("foo", list("1")); In this case, "foo" is the key, and the value is a list of one String value, in this case denoting one occurrence of foo.
The reduce method gets a key that is a String, and a value that is a list of lists of (usually one) String. The value contains all the values emitted for that key by the map method, collected into a single list. The task of reduce is to produce an answer from the list of values and output that answer by calling mr.collect_reduce . The answer is of type Cons, a list of the key and combined value. As an example, if there are 3 occurrences of foo, the value would be (("1") ("1") ("1")) ; the quotation marks are shown here to emphasize that the values are String and must be parsed to get integer values, e.g. by Integer.decode(). The final value might be a list ("foo" 3), where the 3 is the Integer combined value.
((a 484) (b 95) (c 187) ...)
((ther (7 46 52 63 72 73 85 94 116 168 172 182 184)))
map should emit movie number and list of rating, and reduce should return for each movie number a list of average rating as Double, and number of ratings as Integer. This data is similar to the Netflix Prize data.
References: