Document

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

ir.vsr
Class Document

java.lang.Object
  ir.vsr.Document

Direct Known Subclasses:: FileDocument, TextStringDocument

public abstract class Document
extends java.lang.Object
extends java.lang.Object

Docment is an abstract class that provides for tokenization of a document with stop-word removal and an iterator-like interface similar to StringTokenizer. Also provides a method for converting a document into a vector-space bag-of-words in the form of a HashMap of tokens and their occurrence counts.

Field Summary
`protected java.lang.String`	`nextToken` The next token in the document
`protected static int`	`numStopWords` The number of stopwords in this file
`protected int`	`numTokens` The number of tokens currently read from document
`protected boolean`	`stem` Whether to stem tokens with the Porter stemmer
`protected static Porter`	`stemmer` The Porter stemmer
`protected static java.util.HashSet<java.lang.String>`	`stopWords` The hashtable where stopwords are indexed
`protected static java.lang.String`	`stopWordsFile` The file where a list of stopwords, 1 per line, are stored

Constructor Summary
`Document(boolean stem)` Creates a new Document making sure that the stopwords are loaded, indexed, and ready for use.

Method Summary
`protected boolean`	`allLetters(java.lang.String token)` Check if this token consists of all Unicode letters to eliminate other bizarre tokens
`protected abstract java.lang.String`	`getNextCandidateToken()` Return the next possible token in the document.
`HashMapVector`	`hashMapVector()` Returns a hashmap version of the term-vector (bag of words) for this document, where each token is a key whose value is the number of times it occurs in the document as stored in a Weight.
`boolean`	`hasMoreTokens()` Returns true iff the document contains more tokens
`protected static void`	`loadStopWords()` Load the stopwords from file to the hashtable where they are indexed.
`java.lang.String`	`nextToken()` Returns the next token in the document or null if there are none
`int`	`numberOfTokens()` Returns the total number of tokens in the document or -1 if there are still more tokens to be read and the total count is not yet available.
`protected void`	`prepareNextToken()` The nextToken slot is always precomputed and stored by this method.
`void`	`printVector()` Compute and print out (one line per term) the term-vector (bag of words) for this document

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

stopWordsFile

protected static final java.lang.String stopWordsFile

The file where a list of stopwords, 1 per line, are stored

See Also:: Constant Field Values

numStopWords

protected static final int numStopWords

The number of stopwords in this file

See Also:: Constant Field Values

stopWords

protected static java.util.HashSet<java.lang.String> stopWords

The hashtable where stopwords are indexed

stemmer

protected static Porter stemmer

The Porter stemmer

nextToken

protected java.lang.String nextToken

The next token in the document

numTokens

protected int numTokens

The number of tokens currently read from document

stem

protected boolean stem

Whether to stem tokens with the Porter stemmer

Constructor Detail

Document

public Document(boolean stem)

Creates a new Document making sure that the stopwords are loaded, indexed, and ready for use. Subclasses that create concrete instances MUST call prepareNextToken before finishing to ensure that the first token is precomputed and available.

Method Detail

hasMoreTokens

public boolean hasMoreTokens()

Returns true iff the document contains more tokens

nextToken

public java.lang.String nextToken()

Returns the next token in the document or null if there are none

prepareNextToken

protected void prepareNextToken()

The nextToken slot is always precomputed and stored by this method. Performs stop-word removal of candidate tokens.

allLetters

protected boolean allLetters(java.lang.String token)

Check if this token consists of all Unicode letters to eliminate other bizarre tokens

getNextCandidateToken

protected abstract java.lang.String getNextCandidateToken()

Return the next possible token in the document. Each subclass must implement this method to produce candidate tokens for subsequent stop-word filtering.

numberOfTokens

public int numberOfTokens()

Returns the total number of tokens in the document or -1 if there are still more tokens to be read and the total count is not yet available.

loadStopWords

protected static void loadStopWords()

Load the stopwords from file to the hashtable where they are indexed.

hashMapVector

public HashMapVector hashMapVector()

Returns a hashmap version of the term-vector (bag of words) for this document, where each token is a key whose value is the number of times it occurs in the document as stored in a Weight.

See Also:: Weight

printVector

public void printVector()

Compute and print out (one line per term) the term-vector (bag of words) for this document

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

ir.vsr Class Document

stopWordsFile

numStopWords

stopWords

stemmer

nextToken

numTokens

stem

Document

hasMoreTokens

nextToken

prepareNextToken

allLetters

getNextCandidateToken

numberOfTokens

loadStopWords

hashMapVector

printVector

ir.vsr
Class Document