Class HTMLFileDocument

  extended by ir.vsr.Document
      extended by ir.vsr.FileDocument
          extended by ir.vsr.HTMLFileDocument

public class HTMLFileDocument
extends FileDocument

An HTML file document where HTML commands are removed from the token stream. To include HTML tokens, just create a TextFileDocument from the HTML file.

Field Summary
protected textReader
          The I/O reader for accessing the output of the HTML parser.
protected  java.util.StringTokenizer tokenizer
          The tokenizer for lines read from this document.
static java.lang.String tokenizerDelim
          StringTokenizer delim for tokenizing only alphabetic strings.
Fields inherited from class ir.vsr.FileDocument
file, reader
Fields inherited from class ir.vsr.Document
nextToken, numStopWords, numTokens, stem, stemmer, stopWords, stopWordsFile
Constructor Summary
HTMLFileDocument( file, boolean stem)
          Create a new text document for the given file.
HTMLFileDocument(java.lang.String fileName, boolean stem)
          Create a new text document for the given file name.
Method Summary
protected  java.lang.String getNextCandidateToken()
          Return the next purely alpha-character token in the document, or null if none left.
static void main(java.lang.String[] args)
          For testing, print the bag-of-words vector for a given HTML file
Methods inherited from class ir.vsr.Document
allLetters, hashMapVector, hasMoreTokens, loadStopWords, nextToken, numberOfTokens, prepareNextToken, printVector
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail


public static final java.lang.String tokenizerDelim
StringTokenizer delim for tokenizing only alphabetic strings.

See Also:
Constant Field Values


protected java.util.StringTokenizer tokenizer
The tokenizer for lines read from this document.


protected textReader
The I/O reader for accessing the output of the HTML parser.

Constructor Detail


public HTMLFileDocument( file,
                        boolean stem)
Create a new text document for the given file.


public HTMLFileDocument(java.lang.String fileName,
                        boolean stem)
Create a new text document for the given file name.

Method Detail


protected java.lang.String getNextCandidateToken()
Return the next purely alpha-character token in the document, or null if none left.

Specified by:
getNextCandidateToken in class Document


public static void main(java.lang.String[] args)
For testing, print the bag-of-words vector for a given HTML file
