|
||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||
SUMMARY: INNER | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object | +--ir.vsr.Document
Docment is an abstract class that provides for tokenization of a document with stop-word removal and an iterator-like interface similar to StringTokenizer. Also provides a method for converting a document into a vector-space bag-of-words in the form of a HashMap of tokens and their occurrence counts.
Field Summary | |
protected java.lang.String |
nextToken
The next token in the document |
protected static int |
numStopWords
The number of stopwords in this file |
protected int |
numTokens
The number of tokens currently read from document |
protected boolean |
stem
Whether to stem tokens with the Porter stemmer |
protected static ir.utilities.Porter |
stemmer
The Porter stemmer |
protected static java.util.HashSet |
stopWords
The hashtable where stopwords are indexed |
protected static java.lang.String |
stopWordsFile
The file where a list of stopwords, 1 per line, are stored |
Constructor Summary | |
Document(boolean stem)
Creates a new Document making sure that the stopwords are loaded, indexed, and ready for use. |
Method Summary | |
protected abstract java.lang.String |
getNextCandidateToken()
Return the next possible token in the document. |
HashMapPosVector |
hashMapPosVector()
Returns a hashmap version of the term-vector with positional info for this document, where each token is a key whose value is an ArrayList of Integers of the token positions (not counting stopwords) of this word in the document. |
HashMapVector |
hashMapVector()
Returns a hashmap version of the term-vector (bag of words) for this document, where each token is a key whose value is the number of times it occurs in the document as stored in a Weight. |
boolean |
hasMoreTokens()
Returns true iff the document contains more tokens |
protected static void |
loadStopWords()
Load the stopwords from file to the hashtable where they are indexed. |
java.lang.String |
nextToken()
Returns the next token in the document or null if there are none |
int |
numberOfTokens()
Returns the total number of tokens in the document or -1 if there are still more tokens to be read and the total count is not yet available. |
TokenPositionInfo[] |
positionOrderedTokenVector()
Returns an array of TokenPositionInfo's for each token in the Document ordered by their first appearance in the Document. |
protected void |
prepareNextToken()
The nextToken slot is always precomputed and stored by this method. |
void |
printVector()
Compute and print out (one line per term) the term-vector (bag of words) for this document |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
protected static final java.lang.String stopWordsFile
protected static final int numStopWords
protected static java.util.HashSet stopWords
protected static ir.utilities.Porter stemmer
protected java.lang.String nextToken
protected int numTokens
protected boolean stem
Constructor Detail |
public Document(boolean stem)
Method Detail |
public boolean hasMoreTokens()
public java.lang.String nextToken()
protected void prepareNextToken()
protected abstract java.lang.String getNextCandidateToken()
public int numberOfTokens()
protected static void loadStopWords()
public HashMapVector hashMapVector()
Weight
public TokenPositionInfo[] positionOrderedTokenVector()
public HashMapPosVector hashMapPosVector()
public void printVector()
|
||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||
SUMMARY: INNER | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |