/u/mooney/ir-code/ir/vsr/
. See the Javadoc for this system. Use
the main
method for InvertedIndex
to index a set of documents and then process queries.
You can use the corpus of UTCS department faculty webpages in
/u/mooney/ir-code/corpora/cs-faculty/
as a set of test documents.
This corpus contains 800 pages spidered from the department web site. See
the sample trace of using the system on these
to corpora.
For example, using the corpus of UTCS department faculty webpages, for the query "student paper" none of the top-10 results contain the phrase even though every document has multiple occurences of the word "student" or "paper". Similarly, for the queries "undergraduate research", "Microsoft research" and "IEEE conference", the top ranked page and most of the top-10 pages only contain the two separate words a number of times, but not the actual relevant phrase.
In some situations, cosine similarity can prefer documents that contain a high density of some of the query words at the expense of completely ignoring other query words. In addition, cosine similarity never considers multi-word phrases or the proximity or ordering of words.
Appropriate retrieval for many such queries can be aided by noticing that certain phrases such as "student paper", "undergraduate research", "Microsoft research" and "IEEE conference" are important as multi-word phrases and are not well represented by a bag of words.
A simple statistical approach to discovering useful phrases is to simply look
for frequently occuring sequences of words. In a first pass through the
corpus, your system should find all two-word phrases in the corpus (so called
"bigrams") and determine the frequency of each bigram across the entire corpus.
Consider bigrams as two indexed tokens produced in sequence by the current
Document token generator, therefore, they do not include stop words. After
finding all bigrams, your program should determine the set of most frequent
bigrams and store them as known phrases. Your system should have a parameter,
called maxPhrases
, that determines the maximum number of phrases
to be remembered (which should default to a value of 1,000). You may find the
Java sorting methods Arrays.sort
or Collections.sort
useful.
Then, when producing the vector representations of documents and queries, it should notice instances of the known phrases (two tokens generated in order), and create a single token for the entire phrase but not tokens for the individual words. For example, the query "undergraduate research" should result in a vector containing a single phrasal token "undergraduate research" that does not include the individual tokens "undergraduate" and "research".
Here is a sample solution trace produced by my solution to this problem. After the first pass through the corpus, the system prints out the 1,000 most-common phrases with their frequency. You can verify that all of the retrieved documents now contain the complete two-word query phrases. Replicating the minute details of this trace is not important, but the trace for your system should be similar and only retrieve documents that contain these complete common phrases. Your solution should obviously be a general purpose phrase-indexer and not just a hack that works with these specific queries.
Implement your new version as a specialized class of InvertedIndex
called InvertedPhraseIndex
that accepts the same command line
options as InvertedIndex
. You may also need to add methods to other
classes. In particular, my solution added methods to at least
Document
and HashMapVector
.
In submitting your solution, follow the general course instructions on submitting projects on the course homepage. Note especially, what the prefix for a file name is, and the command used to generate the zip file in a way that maintains the directory structure required.
Along with that, follow these specific instructions for Project 1. The following files should be submitted.
The grading breakdown for this assignment is: