ir.vsr
Class HTMLFileDocument

java.lang.Object
  |
  +--ir.vsr.Document
        |
        +--ir.vsr.FileDocument
              |
              +--ir.vsr.HTMLFileDocument

public class HTMLFileDocument
extends FileDocument

An HTML file document where HTML commands are removed from the token stream. To include HTML tokens, just create a TextFileDocument from the HTML file.


Field Summary
protected  java.util.StringTokenizer tokenizer
          The tokenizer for lines read from this document.
static java.lang.String tokenizerDelim
          StringTokenizer delim for tokenizing only alphabetic strings.
 
Fields inherited from class ir.vsr.FileDocument
file, reader
 
Fields inherited from class ir.vsr.Document
nextToken, numStopWords, numTokens, stem, stemmer, stopWords, stopWordsFile
 
Constructor Summary
HTMLFileDocument(java.io.File file, boolean stem)
          Create a new HTML document for the given file.
HTMLFileDocument(java.lang.String fileName, boolean stem)
          Create a new text document for the given file name.
 
Method Summary
protected  java.lang.String getNextCandidateToken()
          Return the next non-HTML-command token in the document, or null if none left.
static void main(java.lang.String[] args)
          For testing, print the bag-of-words vector for a given HTML file
 
Methods inherited from class ir.vsr.Document
hashMapPosVector, hashMapVector, hasMoreTokens, loadStopWords, nextToken, numberOfTokens, positionOrderedTokenVector, prepareNextToken, printVector
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

tokenizerDelim

public static final java.lang.String tokenizerDelim
StringTokenizer delim for tokenizing only alphabetic strings.

tokenizer

protected java.util.StringTokenizer tokenizer
The tokenizer for lines read from this document.
Constructor Detail

HTMLFileDocument

public HTMLFileDocument(java.io.File file,
                        boolean stem)
Create a new HTML document for the given file.

HTMLFileDocument

public HTMLFileDocument(java.lang.String fileName,
                        boolean stem)
Create a new text document for the given file name.
Method Detail

getNextCandidateToken

protected java.lang.String getNextCandidateToken()
Return the next non-HTML-command token in the document, or null if none left.
Overrides:
getNextCandidateToken in class Document

main

public static void main(java.lang.String[] args)
                 throws java.io.IOException
For testing, print the bag-of-words vector for a given HTML file