CS 378 Lab 6 - Parallel Computing with Map/Reduce

CS 378 Concurrency

Lab 6 Programming Assignment:
Map Reduce Search Engine

Purpose

Assignment

/$ bin/hadoop jar search.jar Search hadoop export

Search Complete
3 relevant files found
Terms: hadoop, export, 

hadoop-env.sh (20)
        ...export HADOOP_SSH_OPTS="-o ConnectTimeout=1 -o SendEnv=HADOOP_CONF_DIR"...
        ...export HADOOP_LOG_DIR=${HADOOP_HOME}/logs...
        ...export HADOOP_BALANCER_OPTS="-Dcom.sun.management.jmxremote $HADOOP_BALANCER_OPTS"...
        ...export HADOOP_TASKTRACKER_OPTS=...

hadoop-default.xml (5)
        ...Indicates how many times hadoop should attempt to contact the...
        ...The filesystem for Hadoop archives...
        ...tasks and applications using Hadoop Pipes, Hadoop Streaming etc....
        ...Should native hadoop libraries, if present, be used...

log4j.properties (1)
        ...# Sends counts of logging messages at different severity levels to Hadoop Metrics....

"operating system" 
  http://en.wikipedia.org/wiki/Operating_system (offset 2355)
  http://www.webopedia.com/TERM/O/operating_system.html (offset 12)
  ...
"semaphore"
  http://en.wikipedia.org/wiki/Semaphore_(programming)
  http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/Semaphore.html
  ...
...

Hence, when we want to search for the term "operating system", rather than reading all possible files looking for the term, we simply go to the index, see that it occurs in a number of indexed files at certain offsets from the beginning of the file, and we can return a result for the search very quickly. Of course, searching using this technique relies heavily on the ability to maintain an accurate index, so search engines have (at least) 2 main components: a component which is always indexing the web (or other searchable body) in the background, and a component which consults that index in response to search queries from users.

Another difficult problem for search engines is returning results that are actually relevant to the user's query. For example, if we search for "space pigs walking tall", we may be searching the exact phrase, or we may be searching for documents that contain some subset of these terms. In fact there may be no single document that contains all of them, so deciding whether documents about "space walking" are more relevant than documents about "tall pigs" is a difficult challenge. Perhaps we are interested in documents that contain "pig" and "walk". Consequently, search engines use elaborate heuristics (for example, Google's PageRank algorithm) to attempt to maximize the relevance of search results. These heuristics can directly influence the design of the search index.

In this lab you will develop 5 Map/Reduce algorithms that address the problems of indexing, relevance, searching, and summarizing results. The general structure of our search engine is shown below.

The input files represent the body of files that our search engine searches. The search engine works by creating two different kinds of indexes. Note that these indexes only need to be created once (although you will likely need to create them many times during debugging!). After the indexes are built the search query can operate directly on them.

Summary Index: this index maps all searchable words to a list of places (byte offsets) where they occur in files. The format of a summary index entry is:
```
      term file-name_0|offset_0| summary_line_0^
           file-name_1|offset_1| summary_line_1^
		   ...
           file-name_n|offset_n| summary_line_n^
    
```
A term is a searchable word such as "operating", and is followed by a list of files, offsets, and copies of the line on which the term occurred. In this example, shown from our implementation,
```
      volume hadoop-default.xml|16384|  Reserved space in bytes per volume. Always leave this much space 
      waits  hadoop-default.xml|24576|  The time, in milliseconds, the tasktracker waits for sending a^
             hadoop-default.xml|24576|  The interval, in milliseconds, for which the tasktracker waits
    
```
the term "volume" occurs just once in the file "hadoop-default.xml" at byte offset 16484, while the term "waits" occurs in two places in that same file. The summary index is used by the Presenter component of the search engine when writing the final search results to provide some summary context around the search terms found: this is a simplified version of how Internet search engines summarize the content of files when presenting search results.
N-gram Index: this index maps all N-grams of searchable words to a list of files in which they occur, and the number of times those N-grams occur in that file. Recall that an N-gram is a N words that occur consecutively in some lexical context. For example, the phrase "A B C" has 2-grams (bi-grams) "A B" and "B C", while the phrase "I just love map reduce" has the 4-grams "I just love map" and "just love map reduce". The format of an n-gram index entry is:
```
      term1 term2 ... termN|file-name  count
     
```
A group of N terms, separated by spaces is followed by a list of files in which that N-gram occurs, along with a count of the the number of times the n-gram occured in that file. In the example below:
```
      waits           |hadoop-default.xml     2
      to sleep between|hadoop-default.xml     1
      to sleep between|hadoop-env.sh          1
     
```
The 1-gram "waits" occurs twice in the file "hadoop-default.xml" (which we know to be true in the summary index above as well), and the 3-gram "to sleep between" occurs once in "hadoop-default.xml" and once in "hadoop-env.sh". NOTE:To simplify the problem, we only count N-grams that do not cross line breaks in the input file. The N-gram index is used by the search component of the search engine to generate a list of candidate search matches, along with a relevance score.

Search Relevance:

exactly

Supplies (lab6.jar)

This jar file contains the Map Reduce infrastructure, a skeleton search engine program, and a searchable corpus of files. The skeleton search engine contains a version of the n-gram indexer that works for single words (1-grams) only, which you may find helpful for pattern-matching to develop the rest of your map-reduce algorithms.

To extract the lab6.jar, and create the search corpus.
$> jar -xvf lab6.jar $> gunzip corpus.tar.gz $> tar -xvf corpus.tar

To build the lab from the command line:

First, make sure you've got the HADOOP_HOME environment variable defined. If you are working on a linux CS machine, you can either edit the .bashrc file in your home directory to contain the line:

HADOOP_HOME=/lusr/hadoop

Or, at the command line you can type

$> export HADOOP_HOME=/lusr/hadoop

Then building is a simple matter of typing:

$> make

To invoke the search engine, navigate to the hadoop base directory and type the following at the command line:

make run

The maximum number of searchable terms is 10 by default, but this can be set by changing Search.MAX_SEARCH_TERMS.

Configuring eclipse

To use eclipse as your development environment, you need to create a project "from existing source" after you extract the jar file, the same way you did for the previous projects. However, an additional step is required to put the Hadoop libraries on your build path. To do this, you want to select "Properties" from the project menu, and choose the "Java Build Path" item in the left pane of the properties dialog. Next, select the "Add External JARS..." button, and add the following libraries. The paths shown are for CS machines:

/lusr/hadoop/hadoop*core.jar
/lusr/hadoop/hadoop*tools.jar
/lusr/hadoop/lib/*logging*.jar (Note this means 2 files!)

If you are working with your own installation of hadoop, replace "/lusr/hadoop" above with whatever directory your $HADOOP_HOME variable maps to. Note that this will only get you to the point where you can build and edit your project in eclipse. You still need to run the project from the command line. To do that, read the next section!

Setting up eclipse to debug

To debug your search engine in eclipse, you need to jump through a few more hoops to make eclipse and hadoop work together. Under your Run menu, choose "Open Debug Dialog", and choose a new "Java Application" when creating a configuration. These steps should be exactly the same as for your previous projects. After this do the following:

In the "Main" tab, specify "Search" as your main class.
In the "Arguments" tab, specify the search terms you want to search for in the "Program Arguments" box.
Also in the "Arguments" tab, under "Working Directory" click the radio button that says "other", and use the "Browse" button to choose the directory where you extracted your lab6.jar.
In the "Classpath" tab, you need to make sure all the hadoop jar files are on your class path. To do this, select your project in the list, and click on "Add External JARs..." From there, navigate the file system and add jar files until you've added all of the following (noting that "*" is a regular expression, and you need to choose all the files matching that expression):
- /lusr/hadoop/hadoop*core.jar
- /lusr/hadoop/hadoop*tools.jar
- /lusr/hadoop/lib/*.jar

Alternatively, you can work using eclipse with the IBM alphaWorks hadoop plug-in, which will set up these things for you, but which may introduce a number of other complications!

Implementation

N-gram Indexer: This component accepts lines from files and outputs N-gram index entries as described above, for all values of N up to the value Search.MAX_SEARCH_TERMS defined in Search.java. There will be a separate directory under your HDFS directory / ngram_index for each N. For example, the bigram index is stored in ngram_index/2-grams. The skeleton implementation we provide implements a correct indexer for single word indexing (1-grams). You should expand this implementation to make it possible to index N-grams. Additionally, you can use this to pattern match and understand how to implement your other mapper/reducer classes. You should implement this first, since you have some sample implementation to start from.
You need to write a method called JobManager.waitComplete(), which uses either busy waiting, or blocking techniques to wait for outstanding map/reduce jobs to complete before moving on to the next phase. Both your blocking version and busy-waiting version must work correctly!

Your search engine will not work until this is implemented, so you should implement this in tandem with your work on the N-Gram indexer.

Summarizer: This component accepts lines from files and outputs summary index entries as described above. The summary index is written to the HDFS directory / summary_index.
Search: This component implements the actual search and relevance computation according to the algorithm for relevance described above. The main search engine application logic is invoked by the Search.run() method, and you are responsible for filling in some of this logic. See comments in Search.java.
Winnower: This component is responsible for winnowing down the search results to a reasonable number of matches. Specifically, the number of results returned by our search engine should be less than or equal to Search.SHOW_RESULTS_MAX_FILES in Search.java. The winnower also sorts the output files selected by relevance score.
Presenter: The presenter takes as input a list of relevant files plus scores (see the RelevantFile object), and uses the summary index to generate the final output of the search which contains a list of relevant files, relevance scores, and line summaries where search terms or N-grams occur. The number of summaries shown should not exceed Presenter.MAX_RESULTS_PER_FILE. An example of the search engine output format is shown above.

Legal Search Terms

We have provided a method that takes care of this for you: use Utils.termify

Tips, Tricks, and other Minutiae

Using Hadoop

http://hadoop.apache.org/core/releases.html

tutorial

tutorial.

API

Downloading the Hadoop Eclipse Environment

here

Debugging in Hadoop

System.out.println

Fortunately,

Hadoop supports a local mode which forces hadoop to use the local file system, and ensures that System.out.println statements go to the console where you invoked the command line. The hadoop server running the CS machines will start by default in this mode, when you run your code, so System.out.println will work. You can also use Utils.DEBUG(), and Utils.info().
Getting the Eclipse debugger to work with hadoop may be tricky. The hadoop eclipse plugin claims to support debugging in the Eclipse IDE. Feel free to learn to use it. (See below for links to download).

Also, note that because the search wants to work directly on the indexes produced by your indexer, it checks to see if they are present in the HDFS, and if so, skips the step of building them (that is, after all, the point of an index). However, as you debug you may find that you need to rebuild an existing index. You can ensure that this happens by either setting the rebuild_index property in system.properties to true, or by using the Cleaner class described in the next section to delete the index.

The Hadoop File System and the Cleaner

ls

bin/hadoop fs -ls

cat search_results

bin/hadoop 
					fs -cat search_results/part-00000

However, When you are working on a CS machine with your hadoop server running in a standalone debug mode (the default), you can look directly at your output without going through HDFS. Your intermediate and temporary files will be in /tmp/cs378-username/ and your search results will be in your working directory in search_results/part-00000.

HDFS hates to delete files, and hates to overwrite them. We have provided a class called Cleaner.java . This class can be used to remove temporary files and indexes.

To remove all temporary and results files used by the search engine, type:
$HADOOP_HOME/bin/hadoop jar search.jar Cleaner .

To remove all temporary, results, and index files, type:
$HADOOP_HOME/bin/hadoop jar search.jar Cleaner -a .

You can also just use the makefile to do this by typing: 'make clean_index'.

Instructions on how to check outputs:

We have provided sample correct output for the default Makefile (make run) settings in the files macbeth*.txt, which will get extracted when you unpack the jar file. The macbeth_winnow_results.txt and macbeth_search_results.txt files do not need to match exactly, and are provided to help you get a sense of what the output of intermediate search phases look like.

The following diff should yield no differences:

> make run > myresults.txt

> diff macbeth_presentation_results myresults.txt

As with the previous labs, we will be testing your solution on other inputs, so you will need to spend some time devising other test cases and verfying that your solution works for these cases!

Other Details and Submission Instructions

Each student must do the assignment independently, or work as a pair in the pair programming style. In pair programming, two students must sit at the same computer and both must participate actively. We require each student who is in a pair programming group to spend at least 80% of the total development time participating actively, and at least 40% of that time being the one typing on the keyboard. (This style does not mean two students can split up the program, work separately, and then reconvene to put the solution together.)
If you pair program, follow these additional guidelines.
- Only one person turns in the assignment.
- Your README file also contains, the name of both pairs.
- See the Pair Programming web page for how to do pair programming and the benefits of it.
In your README file, include the following:
To turn in the project, create a tarball of your work and submit it using canvas.

Useful Links:
- Map Reduce links