public class HTMLFileDocument extends FileDocument
Modifier and Type | Field and Description |
---|---|
protected java.io.BufferedReader |
textReader
The I/O reader for accessing the output of the HTML parser.
|
protected java.util.StringTokenizer |
tokenizer
The tokenizer for lines read from this document.
|
static java.lang.String |
tokenizerDelim
StringTokenizer delim for tokenizing only alphabetic strings.
|
file, reader
nextToken, numStopWords, numTokens, stem, stemmer, stopWords, stopWordsFile
Constructor and Description |
---|
HTMLFileDocument(java.io.File file,
boolean stem)
Create a new text document for the given file.
|
HTMLFileDocument(java.lang.String fileName,
boolean stem)
Create a new text document for the given file name.
|
Modifier and Type | Method and Description |
---|---|
protected java.lang.String |
getNextCandidateToken()
Return the next purely alpha-character token in the document, or null if none left.
|
static void |
main(java.lang.String[] args)
For testing, print the bag-of-words vector for a given HTML file
|
allLetters, hashMapVector, hasMoreTokens, loadStopWords, nextToken, numberOfTokens, prepareNextToken, printVector
public static final java.lang.String tokenizerDelim
protected java.util.StringTokenizer tokenizer
protected java.io.BufferedReader textReader
public HTMLFileDocument(java.io.File file, boolean stem)
public HTMLFileDocument(java.lang.String fileName, boolean stem)
protected java.lang.String getNextCandidateToken()
getNextCandidateToken
in class Document
public static void main(java.lang.String[] args) throws java.io.IOException
java.io.IOException