ir.vsr
Class Document

java.lang.Object
  extended by ir.vsr.Document
Direct Known Subclasses:
FileDocument, TextStringDocument

public abstract class Document
extends java.lang.Object

Docment is an abstract class that provides for tokenization of a document with stop-word removal and an iterator-like interface similar to StringTokenizer. Also provides a method for converting a document into a vector-space bag-of-words in the form of a HashMap of tokens and their occurrence counts.


Field Summary
protected  java.lang.String nextToken
          The next token in the document
protected static int numStopWords
          The number of stopwords in this file
protected  int numTokens
          The number of tokens currently read from document
protected  boolean stem
          Whether to stem tokens with the Porter stemmer
protected static Porter stemmer
          The Porter stemmer
protected static java.util.HashSet<java.lang.String> stopWords
          The hashtable where stopwords are indexed
protected static java.lang.String stopWordsFile
          The file where a list of stopwords, 1 per line, are stored
 
Constructor Summary
Document(boolean stem)
          Creates a new Document making sure that the stopwords are loaded, indexed, and ready for use.
 
Method Summary
protected  boolean allLetters(java.lang.String token)
          Check if this token consists of all Unicode letters to eliminate other bizarre tokens
protected abstract  java.lang.String getNextCandidateToken()
          Return the next possible token in the document.
 HashMapVector hashMapVector()
          Returns a hashmap version of the term-vector (bag of words) for this document, where each token is a key whose value is the number of times it occurs in the document as stored in a Weight.
 boolean hasMoreTokens()
          Returns true iff the document contains more tokens
protected static void loadStopWords()
          Load the stopwords from file to the hashtable where they are indexed.
 java.lang.String nextToken()
          Returns the next token in the document or null if there are none
 int numberOfTokens()
          Returns the total number of tokens in the document or -1 if there are still more tokens to be read and the total count is not yet available.
protected  void prepareNextToken()
          The nextToken slot is always precomputed and stored by this method.
 void printVector()
          Compute and print out (one line per term) the term-vector (bag of words) for this document
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

stopWordsFile

protected static final java.lang.String stopWordsFile
The file where a list of stopwords, 1 per line, are stored

See Also:
Constant Field Values

numStopWords

protected static final int numStopWords
The number of stopwords in this file

See Also:
Constant Field Values

stopWords

protected static java.util.HashSet<java.lang.String> stopWords
The hashtable where stopwords are indexed


stemmer

protected static Porter stemmer
The Porter stemmer


nextToken

protected java.lang.String nextToken
The next token in the document


numTokens

protected int numTokens
The number of tokens currently read from document


stem

protected boolean stem
Whether to stem tokens with the Porter stemmer

Constructor Detail

Document

public Document(boolean stem)
Creates a new Document making sure that the stopwords are loaded, indexed, and ready for use. Subclasses that create concrete instances MUST call prepareNextToken before finishing to ensure that the first token is precomputed and available.

Method Detail

hasMoreTokens

public boolean hasMoreTokens()
Returns true iff the document contains more tokens


nextToken

public java.lang.String nextToken()
Returns the next token in the document or null if there are none


prepareNextToken

protected void prepareNextToken()
The nextToken slot is always precomputed and stored by this method. Performs stop-word removal of candidate tokens.


allLetters

protected boolean allLetters(java.lang.String token)
Check if this token consists of all Unicode letters to eliminate other bizarre tokens


getNextCandidateToken

protected abstract java.lang.String getNextCandidateToken()
Return the next possible token in the document. Each subclass must implement this method to produce candidate tokens for subsequent stop-word filtering.


numberOfTokens

public int numberOfTokens()
Returns the total number of tokens in the document or -1 if there are still more tokens to be read and the total count is not yet available.


loadStopWords

protected static void loadStopWords()
Load the stopwords from file to the hashtable where they are indexed.


hashMapVector

public HashMapVector hashMapVector()
Returns a hashmap version of the term-vector (bag of words) for this document, where each token is a key whose value is the number of times it occurs in the document as stored in a Weight.

See Also:
Weight

printVector

public void printVector()
Compute and print out (one line per term) the term-vector (bag of words) for this document