InvertedIndex

java.lang.Object
- ir.vsr.InvertedIndex

```
public class InvertedIndex
extends java.lang.Object
```
An inverted index for vector-space information retrieval. Contains methods for creating an inverted index from a set of documents and retrieving ranked matches to queries using standard TF/IDF weighting and cosine similarity.

Field Summary

Fields
Modifier and Type	Field and Description
`java.io.File`	`dirFile` The directory from which the indexed documents come.
`java.util.List<DocumentReference>`	`docRefs` A list of all indexed documents.
`short`	`docType` The type of Documents (text, HTML).
`boolean`	`feedback` Whether relevance feedback using the Ide_regular algorithm is used
`static int`	`MAX_RETRIEVALS` The maximum number of retrieved documents for a query to present to the user at a time
`boolean`	`stem` Whether tokens should be stemmed with Porter stemmer
`java.util.Map<java.lang.String,TokenInfo>`	`tokenHash` A HashMap where tokens are indexed.

Constructor Summary

Constructors
Constructor and Description
`InvertedIndex(java.io.File dirFile, short docType, boolean stem, boolean feedback)` Create an inverted index of the documents in a directory.
`InvertedIndex(java.util.List<Example> examples)` Create an inverted index of the documents in a List of Example objects of documents for text categorization.

Method Summary

Methods
Modifier and Type	Method and Description
`void`	`clear()` Clear all documents from the inverted index
`protected void`	`computeIDFandDocumentLengths()` Compute the IDF factor for every token in the index and the length of the document vector for every document referenced in the index.
`protected Retrieval`	`getRetrieval(double queryLength, DocumentReference docRef, double score)` Calculate the final score for a retrieval and return a Retrieval object representing the retrieval with its final score.
`double`	`incorporateToken(java.lang.String token, double count, java.util.Map<DocumentReference,DoubleValue> retrievalHash)` Retrieve the documents indexed by this token in the inverted index, add it to the retrievalHash if needed, and update its running total score.
`protected void`	`indexDocument(FileDocument doc, HashMapVector vector)` Index the given document using its corresponding vector
`protected void`	`indexDocuments()` Index the documents in dirFile.
`void`	`indexDocuments(java.util.List<Example> examples)` Index the documents in the List of Examples for text categorization.
`protected void`	`indexToken(java.lang.String token, int count, DocumentReference docRef)` Add a token occurrence to the index.
`static void`	`main(java.lang.String[] args)` Index a directory of files and then interactively accept retrieval queries.
`void`	`presentRetrievals(HashMapVector queryVector, Retrieval[] retrievals)` Print out a ranked set of retrievals.
`void`	`print()` Print out an inverted index by listing each token and the documents it occurs in.
`void`	`printRetrievals(Retrieval[] retrievals, int start)` Print out at most MAX_RETRIEVALS ranked retrievals starting at given starting rank number.
`void`	`processQueries()` Enter an interactive user-query loop, accepting queries and showing the retrieved documents in ranked order.
`Retrieval[]`	`retrieve(Document doc)` Perform ranked retrieval on this input query Document.
`Retrieval[]`	`retrieve(HashMapVector vector)` Perform ranked retrieval on this input query Document vector.
`Retrieval[]`	`retrieve(java.lang.String input)` Perform ranked retrieval on this input query.
`boolean`	`showRetrievals(Retrieval[] retrievals)` Show the top retrievals to the user if there are any.
`int`	`size()` Return the number of tokens indexed.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - MAX_RETRIEVALS
```
public static final int MAX_RETRIEVALS
```
    The maximum number of retrieved documents for a query to present to the user at a time
    
    See Also:
    Constant Field Values
  - tokenHash
```
public java.util.Map<java.lang.String,TokenInfo> tokenHash
```
    A HashMap where tokens are indexed. Each indexed token maps to a TokenInfo.
  - docRefs
```
public java.util.List<DocumentReference> docRefs
```
    A list of all indexed documents. Elements are DocumentReference's.
  - dirFile
```
public java.io.File dirFile
```
    The directory from which the indexed documents come.
  - docType
```
public short docType
```
    The type of Documents (text, HTML). See docType in DocumentIterator.
  - stem
```
public boolean stem
```
    Whether tokens should be stemmed with Porter stemmer
  - feedback
```
public boolean feedback
```
    Whether relevance feedback using the Ide_regular algorithm is used
- Constructor Detail
  - InvertedIndex
```
public InvertedIndex(java.io.File dirFile,
             short docType,
             boolean stem,
             boolean feedback)
```
    Create an inverted index of the documents in a directory.
    
    Parameters:
    dirFile - The directory of files to index.
    docType - The type of documents to index (See docType in DocumentIterator)
    stem - Whether tokens should be stemmed with Porter stemmer.
    feedback - Whether relevance feedback should be used.
  - InvertedIndex
```
public InvertedIndex(java.util.List<Example> examples)
```
    Create an inverted index of the documents in a List of Example objects of documents for text categorization.
    
    Parameters:
    examples - A List containing the Example objects for text categorization to index
- Method Detail
  - indexDocuments
```
protected void indexDocuments()
```
    Index the documents in dirFile.
  - indexDocuments
```
public void indexDocuments(java.util.List<Example> examples)
```
    Index the documents in the List of Examples for text categorization.
  - indexDocument
```
protected void indexDocument(FileDocument doc,
                 HashMapVector vector)
```
    Index the given document using its corresponding vector
  - indexToken
```
protected void indexToken(java.lang.String token,
              int count,
              DocumentReference docRef)
```
    Add a token occurrence to the index.
    
    Parameters:
    token - The token to index.
    count - The number of times it occurs in the document.
    docRef - A reference to the Document it occurs in.
  - computeIDFandDocumentLengths
```
protected void computeIDFandDocumentLengths()
```
    Compute the IDF factor for every token in the index and the length of the document vector for every document referenced in the index.
  - print
```
public void print()
```
    Print out an inverted index by listing each token and the documents it occurs in. Include info on IDF factors, occurrence counts, and document vector lengths.
  - size
```
public int size()
```
    Return the number of tokens indexed.
  - clear
```
public void clear()
```
    Clear all documents from the inverted index
  - retrieve
```
public Retrieval[] retrieve(java.lang.String input)
```
    Perform ranked retrieval on this input query.
  - retrieve
```
public Retrieval[] retrieve(Document doc)
```
    Perform ranked retrieval on this input query Document.
  - retrieve
```
public Retrieval[] retrieve(HashMapVector vector)
```
    Perform ranked retrieval on this input query Document vector.
  - getRetrieval
```
protected Retrieval getRetrieval(double queryLength,
                     DocumentReference docRef,
                     double score)
```
    Calculate the final score for a retrieval and return a Retrieval object representing the retrieval with its final score.
    
    Parameters:
    queryLength - The length of the query vector, incorporated into the final score
    docRef - The document reference for the document concerned
    score - The partially computed score
    
    Returns:
    The retrieval object for the document described by docRef and score under the query with length queryLength
  - incorporateToken
```
public double incorporateToken(java.lang.String token,
                      double count,
                      java.util.Map<DocumentReference,DoubleValue> retrievalHash)
```
    Retrieve the documents indexed by this token in the inverted index, add it to the retrievalHash if needed, and update its running total score.
    
    Parameters:
    token - The token in the query to incorporate.
    count - The count of this token in the query.
    retrievalHash - The hash table of retrieved DocumentReferences and current scores.
    
    Returns:
    The square of the weight of this token in the query vector for use in calculating the length of the query vector.
  - processQueries
```
public void processQueries()
```
    Enter an interactive user-query loop, accepting queries and showing the retrieved documents in ranked order.
  - presentRetrievals
```
public void presentRetrievals(HashMapVector queryVector,
                     Retrieval[] retrievals)
```
    Print out a ranked set of retrievals. Show the file name and score for the top retrieved documents in order. Then allow user to see more or display individual documents.
  - showRetrievals
```
public boolean showRetrievals(Retrieval[] retrievals)
```
    Show the top retrievals to the user if there are any.
    
    Returns:
    true if retrievals are non-empty.
  - printRetrievals
```
public void printRetrievals(Retrieval[] retrievals,
                   int start)
```
    Print out at most MAX_RETRIEVALS ranked retrievals starting at given starting rank number. Include the rank number and the score.
  - main
```
public static void main(java.lang.String[] args)
```
    Index a directory of files and then interactively accept retrieval queries. Command format: "InvertedIndex [OPTION]* [DIR]" where DIR is the name of the directory whose files should be indexed, and OPTIONs can be "-html" to specify HTML files whose HTML tags should be removed. "-stem" to specify tokens should be stemmed with Porter stemmer. "-feedback" to allow relevance feedback from the user.

Class InvertedIndex

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

MAX_RETRIEVALS

tokenHash

docRefs

dirFile

docType

stem

feedback

Constructor Detail

InvertedIndex

InvertedIndex

Method Detail

indexDocuments

indexDocuments

indexDocument

indexToken

computeIDFandDocumentLengths

print

size

clear

retrieve

retrieve

retrieve

getRetrieval

incorporateToken

processQueries

presentRetrievals

showRetrievals

printRetrievals

main