|
||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||
SUMMARY: INNER | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object | +--ir.vsr.InvertedIndex | +--ir.vsr.InvertedPosIndex
An inverted index for vector-space information retrieval. Contains methods for creating an inverted index from a set of documents and retrieving ranked matches to queries using standard TF/IDF weighting and cosine similarity, but also accounts for "proximity" of the terms in a document. Documents are ranked by the ratio of normal vector-similarity and a distance measure that captures how far apart the query terms appear in the document. The distance (proximity) measure is based on the average across all pairs of terms in the query, of the average distance in the document between the one word in the pair and the closest occurrence of the other word in the pair. Pairs appearing in the wrong order relative to the query get an additional distance penalty.
Field Summary | |
static double |
MAX_DISTANCE
The maximum measurable distance that can separate two query words. |
static double |
WRONG_ORDER_PENALTY_FACTOR
The multiplicative penalty factor for distance that is incurred when query terms are in the opposite order in the document |
Fields inherited from class ir.vsr.InvertedIndex |
dirFile, docRefs, docType, feedback, MAX_RETRIEVALS, stem, tokenHash |
Constructor Summary | |
InvertedPosIndex(java.io.File dirFile,
short docType,
boolean stem)
|
Method Summary | |
protected static double |
averageClosestDistance(int[] previousPositions,
int[] positions)
Returns the average closest positional distance between an occurrence of the current query token and an occurrence of a specified previous query token. |
protected static void |
finalizeProximityScore(RetrievalPosInfo info,
int queryLength)
Finalize the proximity score for a RetrievalPosInfo by averaging over all possible pairs of query tokens, the average closest distance score for each pair |
double |
incorporateToken(java.lang.String token,
int count,
java.util.HashMap retrievalHash)
Retrieve the documents indexed by this token in the inverted index, add it to the retrievalHash if needed, and update its running scores. |
protected void |
indexDocuments()
Index the documents in dirFile. |
protected void |
indexToken(java.lang.String token,
java.util.ArrayList positions,
DocumentReference docRef)
Add a token occurrence to the index. |
static void |
main(java.lang.String[] args)
Index a directory of files and then interactively accept retrieval queries. |
protected void |
printExtraTokenOccurrenceInfo(TokenOccurrence occ)
TokenOccurence in an InvertedPosIndex should be a TokenPosOccurrence, so print the positional information for this occurrence. |
void |
processQueries()
Enter an interactive user-query loop, accepting queries and showing the retrieved documents in ranked order. |
Retrieval[] |
retrieve(Document doc)
Perform ranked retrieval on this input query Document. |
Retrieval[] |
retrieve(TokenPositionInfo[] vector)
Perform ranked retrieval on this input query Document vector. |
protected static void |
updateProximityScore(RetrievalPosInfo info,
TokenPosOccurrence occ,
java.lang.String token)
Update the proximity score of the RetrievalPosInfo of this retrieved document based on how close the current query token appears to each of the previous query tokens found in this document. |
Methods inherited from class ir.vsr.InvertedIndex |
computeIDFandDocumentLengths, incorporateToken, indexDocuments, indexToken, presentRetrievals, print, printRetrievals, retrieve, retrieve, showRetrievals, size |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
public static final double MAX_DISTANCE
public static final double WRONG_ORDER_PENALTY_FACTOR
Constructor Detail |
public InvertedPosIndex(java.io.File dirFile, short docType, boolean stem)
Method Detail |
protected void indexDocuments()
indexDocuments
in class InvertedIndex
protected void indexToken(java.lang.String token, java.util.ArrayList positions, DocumentReference docRef)
token
- The token to index.positions
- An ArrayList of Integer positions of the token in the docdocRef
- A reference to the Document it occurs in.protected void printExtraTokenOccurrenceInfo(TokenOccurrence occ)
public Retrieval[] retrieve(Document doc)
retrieve
in class InvertedIndex
public Retrieval[] retrieve(TokenPositionInfo[] vector)
public double incorporateToken(java.lang.String token, int count, java.util.HashMap retrievalHash)
token
- The token in the query to incorporate.count
- The count of this token in the query.retrievalHash
- The hashtable of retrieved DocumentReferences and related RetrievalPosInfoprotected static void updateProximityScore(RetrievalPosInfo info, TokenPosOccurrence occ, java.lang.String token)
protected static void finalizeProximityScore(RetrievalPosInfo info, int queryLength)
protected static double averageClosestDistance(int[] previousPositions, int[] positions)
previousPositions
- The locations of the previous query token in the text.positions
- The locations of the current query token in the text.public void processQueries()
processQueries
in class InvertedIndex
public static void main(java.lang.String[] args)
|
||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||
SUMMARY: INNER | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |