ir.vsr
Class InvertedPosIndex

java.lang.Object
  |
  +--ir.vsr.InvertedIndex
        |
        +--ir.vsr.InvertedPosIndex

public class InvertedPosIndex
extends InvertedIndex

An inverted index for vector-space information retrieval. Contains methods for creating an inverted index from a set of documents and retrieving ranked matches to queries using standard TF/IDF weighting and cosine similarity, but also accounts for "proximity" of the terms in a document. Documents are ranked by the ratio of normal vector-similarity and a distance measure that captures how far apart the query terms appear in the document. The distance (proximity) measure is based on the average across all pairs of terms in the query, of the average distance in the document between the one word in the pair and the closest occurrence of the other word in the pair. Pairs appearing in the wrong order relative to the query get an additional distance penalty.


Field Summary
static double MAX_DISTANCE
          The maximum measurable distance that can separate two query words.
static double WRONG_ORDER_PENALTY_FACTOR
          The multiplicative penalty factor for distance that is incurred when query terms are in the opposite order in the document
 
Fields inherited from class ir.vsr.InvertedIndex
dirFile, docRefs, docType, feedback, MAX_RETRIEVALS, stem, tokenHash
 
Constructor Summary
InvertedPosIndex(java.io.File dirFile, short docType, boolean stem)
           
 
Method Summary
protected static double averageClosestDistance(int[] previousPositions, int[] positions)
          Returns the average closest positional distance between an occurrence of the current query token and an occurrence of a specified previous query token.
protected static void finalizeProximityScore(RetrievalPosInfo info, int queryLength)
          Finalize the proximity score for a RetrievalPosInfo by averaging over all possible pairs of query tokens, the average closest distance score for each pair
 double incorporateToken(java.lang.String token, int count, java.util.HashMap retrievalHash)
          Retrieve the documents indexed by this token in the inverted index, add it to the retrievalHash if needed, and update its running scores.
protected  void indexDocuments()
          Index the documents in dirFile.
protected  void indexToken(java.lang.String token, java.util.ArrayList positions, DocumentReference docRef)
          Add a token occurrence to the index.
static void main(java.lang.String[] args)
          Index a directory of files and then interactively accept retrieval queries.
protected  void printExtraTokenOccurrenceInfo(TokenOccurrence occ)
          TokenOccurence in an InvertedPosIndex should be a TokenPosOccurrence, so print the positional information for this occurrence.
 void processQueries()
          Enter an interactive user-query loop, accepting queries and showing the retrieved documents in ranked order.
 Retrieval[] retrieve(Document doc)
          Perform ranked retrieval on this input query Document.
 Retrieval[] retrieve(TokenPositionInfo[] vector)
          Perform ranked retrieval on this input query Document vector.
protected static void updateProximityScore(RetrievalPosInfo info, TokenPosOccurrence occ, java.lang.String token)
          Update the proximity score of the RetrievalPosInfo of this retrieved document based on how close the current query token appears to each of the previous query tokens found in this document.
 
Methods inherited from class ir.vsr.InvertedIndex
computeIDFandDocumentLengths, incorporateToken, indexDocuments, indexToken, presentRetrievals, print, printRetrievals, retrieve, retrieve, showRetrievals, size
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

MAX_DISTANCE

public static final double MAX_DISTANCE
The maximum measurable distance that can separate two query words.

WRONG_ORDER_PENALTY_FACTOR

public static final double WRONG_ORDER_PENALTY_FACTOR
The multiplicative penalty factor for distance that is incurred when query terms are in the opposite order in the document
Constructor Detail

InvertedPosIndex

public InvertedPosIndex(java.io.File dirFile,
                        short docType,
                        boolean stem)
Method Detail

indexDocuments

protected void indexDocuments()
Index the documents in dirFile.
Overrides:
indexDocuments in class InvertedIndex

indexToken

protected void indexToken(java.lang.String token,
                          java.util.ArrayList positions,
                          DocumentReference docRef)
Add a token occurrence to the index.
Parameters:
token - The token to index.
positions - An ArrayList of Integer positions of the token in the doc
docRef - A reference to the Document it occurs in.

printExtraTokenOccurrenceInfo

protected void printExtraTokenOccurrenceInfo(TokenOccurrence occ)
TokenOccurence in an InvertedPosIndex should be a TokenPosOccurrence, so print the positional information for this occurrence.

retrieve

public Retrieval[] retrieve(Document doc)
Perform ranked retrieval on this input query Document.
Overrides:
retrieve in class InvertedIndex

retrieve

public Retrieval[] retrieve(TokenPositionInfo[] vector)
Perform ranked retrieval on this input query Document vector.

incorporateToken

public double incorporateToken(java.lang.String token,
                               int count,
                               java.util.HashMap retrievalHash)
Retrieve the documents indexed by this token in the inverted index, add it to the retrievalHash if needed, and update its running scores.
Parameters:
token - The token in the query to incorporate.
count - The count of this token in the query.
retrievalHash - The hashtable of retrieved DocumentReferences and related RetrievalPosInfo
Returns:
The square of the weight of this token in the query vector for use in calculating the length of the query vector.

updateProximityScore

protected static void updateProximityScore(RetrievalPosInfo info,
                                           TokenPosOccurrence occ,
                                           java.lang.String token)
Update the proximity score of the RetrievalPosInfo of this retrieved document based on how close the current query token appears to each of the previous query tokens found in this document.

finalizeProximityScore

protected static void finalizeProximityScore(RetrievalPosInfo info,
                                             int queryLength)
Finalize the proximity score for a RetrievalPosInfo by averaging over all possible pairs of query tokens, the average closest distance score for each pair

averageClosestDistance

protected static double averageClosestDistance(int[] previousPositions,
                                               int[] positions)
Returns the average closest positional distance between an occurrence of the current query token and an occurrence of a specified previous query token. If the tokens appear in the wrong order in the text (i.e. the previous token appears after the current token), there is a penalty factor of WRONG_ORDER_PENALTY_FACTOR multiplied by the distance. The "closest" occurrence taking into account this penalty is what is counted in computing the average closest distance.
Parameters:
previousPositions - The locations of the previous query token in the text.
positions - The locations of the current query token in the text.

processQueries

public void processQueries()
Enter an interactive user-query loop, accepting queries and showing the retrieved documents in ranked order.
Overrides:
processQueries in class InvertedIndex

main

public static void main(java.lang.String[] args)
Index a directory of files and then interactively accept retrieval queries. Command format: "InvertedIndex [OPTION] [DIR]" where DIR is the name of the directory whose files should be indexed, and the optional OPTION can be "-html" to specify HTML files whose HTML tags should be removed. and "-stem" to specify Porter stemming of tokens.