public abstract class Document
extends java.lang.Object
Modifier and Type | Field and Description |
---|---|
protected java.lang.String |
nextToken
The next token in the document
|
protected static int |
numStopWords
The number of stopwords in this file
|
protected int |
numTokens
The number of tokens currently read from document
|
protected boolean |
stem
Whether to stem tokens with the Porter stemmer
|
protected static Porter |
stemmer
The Porter stemmer
|
protected static java.util.HashSet<java.lang.String> |
stopWords
The hashtable where stopwords are indexed
|
protected static java.lang.String |
stopWordsFile
The file where a list of stopwords, 1 per line, are stored
|
Constructor and Description |
---|
Document(boolean stem)
Creates a new Document making sure that the stopwords
are loaded, indexed, and ready for use.
|
Modifier and Type | Method and Description |
---|---|
protected boolean |
allLetters(java.lang.String token)
Check if this token consists of all Unicode letters to eliminate
other bizarre tokens
|
protected abstract java.lang.String |
getNextCandidateToken()
Return the next possible token in the document.
|
HashMapVector |
hashMapVector()
Returns a hashmap version of the term-vector (bag of words) for this
document, where each token is a key whose value is the number of times
it occurs in the document as stored in a Weight.
|
boolean |
hasMoreTokens()
Returns true iff the document contains more tokens
|
protected static void |
loadStopWords()
Load the stopwords from file to the hashtable where they are indexed.
|
java.lang.String |
nextToken()
Returns the next token in the document or null if there are none
|
int |
numberOfTokens()
Returns the total number of tokens in the document or -1 if
there are still more tokens to be read and the total count is not yet available.
|
protected void |
prepareNextToken()
The nextToken slot is always precomputed and stored by this method.
|
void |
printVector()
Compute and print out (one line per term) the term-vector (bag of words)
for this document
|
protected static final java.lang.String stopWordsFile
protected static final int numStopWords
protected static java.util.HashSet<java.lang.String> stopWords
protected static Porter stemmer
protected java.lang.String nextToken
protected int numTokens
protected boolean stem
public Document(boolean stem)
public boolean hasMoreTokens()
public java.lang.String nextToken()
protected void prepareNextToken()
protected boolean allLetters(java.lang.String token)
protected abstract java.lang.String getNextCandidateToken()
public int numberOfTokens()
protected static void loadStopWords()
public HashMapVector hashMapVector()
Weight
public void printVector()