|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object ir.vsr.Document
public abstract class Document
Docment is an abstract class that provides for tokenization of a document with stop-word removal and an iterator-like interface similar to StringTokenizer. Also provides a method for converting a document into a vector-space bag-of-words in the form of a HashMap of tokens and their occurrence counts.
Field Summary | |
---|---|
protected java.lang.String |
nextToken
The next token in the document |
protected static int |
numStopWords
The number of stopwords in this file |
protected int |
numTokens
The number of tokens currently read from document |
protected boolean |
stem
Whether to stem tokens with the Porter stemmer |
protected static Porter |
stemmer
The Porter stemmer |
protected static java.util.HashSet<java.lang.String> |
stopWords
The hashtable where stopwords are indexed |
protected static java.lang.String |
stopWordsFile
The file where a list of stopwords, 1 per line, are stored |
Constructor Summary | |
---|---|
Document(boolean stem)
Creates a new Document making sure that the stopwords are loaded, indexed, and ready for use. |
Method Summary | |
---|---|
protected boolean |
allLetters(java.lang.String token)
Check if this token consists of all Unicode letters to eliminate other bizarre tokens |
protected abstract java.lang.String |
getNextCandidateToken()
Return the next possible token in the document. |
HashMapVector |
hashMapVector()
Returns a hashmap version of the term-vector (bag of words) for this document, where each token is a key whose value is the number of times it occurs in the document as stored in a Weight. |
boolean |
hasMoreTokens()
Returns true iff the document contains more tokens |
protected static void |
loadStopWords()
Load the stopwords from file to the hashtable where they are indexed. |
java.lang.String |
nextToken()
Returns the next token in the document or null if there are none |
int |
numberOfTokens()
Returns the total number of tokens in the document or -1 if there are still more tokens to be read and the total count is not yet available. |
protected void |
prepareNextToken()
The nextToken slot is always precomputed and stored by this method. |
void |
printVector()
Compute and print out (one line per term) the term-vector (bag of words) for this document |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
protected static final java.lang.String stopWordsFile
protected static final int numStopWords
protected static java.util.HashSet<java.lang.String> stopWords
protected static Porter stemmer
protected java.lang.String nextToken
protected int numTokens
protected boolean stem
Constructor Detail |
---|
public Document(boolean stem)
Method Detail |
---|
public boolean hasMoreTokens()
public java.lang.String nextToken()
protected void prepareNextToken()
protected boolean allLetters(java.lang.String token)
protected abstract java.lang.String getNextCandidateToken()
public int numberOfTokens()
protected static void loadStopWords()
public HashMapVector hashMapVector()
Weight
public void printVector()
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |