ir.vsr
Class Document

java.lang.Object
  |
  +--ir.vsr.Document
Direct Known Subclasses:
FileDocument, TextStringDocument

public abstract class Document
extends java.lang.Object

Docment is an abstract class that provides for tokenization of a document with stop-word removal and an iterator-like interface similar to StringTokenizer. Also provides a method for converting a document into a vector-space bag-of-words in the form of a HashMap of tokens and their occurrence counts.


Field Summary
protected  java.lang.String nextToken
          The next token in the document
protected static int numStopWords
          The number of stopwords in this file
protected  int numTokens
          The number of tokens currently read from document
protected  boolean stem
          Whether to stem tokens with the Porter stemmer
protected static ir.utilities.Porter stemmer
          The Porter stemmer
protected static java.util.HashSet stopWords
          The hashtable where stopwords are indexed
protected static java.lang.String stopWordsFile
          The file where a list of stopwords, 1 per line, are stored
 
Constructor Summary
Document(boolean stem)
          Creates a new Document making sure that the stopwords are loaded, indexed, and ready for use.
 
Method Summary
protected abstract  java.lang.String getNextCandidateToken()
          Return the next possible token in the document.
 HashMapPosVector hashMapPosVector()
          Returns a hashmap version of the term-vector with positional info for this document, where each token is a key whose value is an ArrayList of Integers of the token positions (not counting stopwords) of this word in the document.
 HashMapVector hashMapVector()
          Returns a hashmap version of the term-vector (bag of words) for this document, where each token is a key whose value is the number of times it occurs in the document as stored in a Weight.
 boolean hasMoreTokens()
          Returns true iff the document contains more tokens
protected static void loadStopWords()
          Load the stopwords from file to the hashtable where they are indexed.
 java.lang.String nextToken()
          Returns the next token in the document or null if there are none
 int numberOfTokens()
          Returns the total number of tokens in the document or -1 if there are still more tokens to be read and the total count is not yet available.
 TokenPositionInfo[] positionOrderedTokenVector()
          Returns an array of TokenPositionInfo's for each token in the Document ordered by their first appearance in the Document.
protected  void prepareNextToken()
          The nextToken slot is always precomputed and stored by this method.
 void printVector()
          Compute and print out (one line per term) the term-vector (bag of words) for this document
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

stopWordsFile

protected static final java.lang.String stopWordsFile
The file where a list of stopwords, 1 per line, are stored

numStopWords

protected static final int numStopWords
The number of stopwords in this file

stopWords

protected static java.util.HashSet stopWords
The hashtable where stopwords are indexed

stemmer

protected static ir.utilities.Porter stemmer
The Porter stemmer

nextToken

protected java.lang.String nextToken
The next token in the document

numTokens

protected int numTokens
The number of tokens currently read from document

stem

protected boolean stem
Whether to stem tokens with the Porter stemmer
Constructor Detail

Document

public Document(boolean stem)
Creates a new Document making sure that the stopwords are loaded, indexed, and ready for use. Subclasses that create concrete instances MUST call prepareNextToken before finishing to ensure that the first token is precomputed and available.
Method Detail

hasMoreTokens

public boolean hasMoreTokens()
Returns true iff the document contains more tokens

nextToken

public java.lang.String nextToken()
Returns the next token in the document or null if there are none

prepareNextToken

protected void prepareNextToken()
The nextToken slot is always precomputed and stored by this method. Performs stop-word removal of candidate tokens.

getNextCandidateToken

protected abstract java.lang.String getNextCandidateToken()
Return the next possible token in the document. Each subclass must implement this method to produce candidate tokens for subsequent stop-word filtering.

numberOfTokens

public int numberOfTokens()
Returns the total number of tokens in the document or -1 if there are still more tokens to be read and the total count is not yet available.

loadStopWords

protected static void loadStopWords()
Load the stopwords from file to the hashtable where they are indexed.

hashMapVector

public HashMapVector hashMapVector()
Returns a hashmap version of the term-vector (bag of words) for this document, where each token is a key whose value is the number of times it occurs in the document as stored in a Weight.
See Also:
Weight

positionOrderedTokenVector

public TokenPositionInfo[] positionOrderedTokenVector()
Returns an array of TokenPositionInfo's for each token in the Document ordered by their first appearance in the Document.

hashMapPosVector

public HashMapPosVector hashMapPosVector()
Returns a hashmap version of the term-vector with positional info for this document, where each token is a key whose value is an ArrayList of Integers of the token positions (not counting stopwords) of this word in the document.

printVector

public void printVector()
Compute and print out (one line per term) the term-vector (bag of words) for this document