weka.datagenerators
Class TextSource

java.lang.Object
  extended byweka.datagenerators.Generator
      extended byweka.datagenerators.TextSource
All Implemented Interfaces:
OptionHandler, java.io.Serializable

public class TextSource
extends Generator
implements OptionHandler, java.io.Serializable

Reads a collection of text documents and transforms them into sparse vectors. The sparse vectors are then put into an ARFF file for further processing by WEKA.

WEKA options:

The generic generator options -a, -c and -n are ignored.

Here are some sample command lines:

  java weka.datagenerators.TextSource
    -r news -R directory -D cmu-newsgroup-random-100/
    -L simple -y whitespace -o news.arff
 

The name of the dataset is news. We use the directory document reader. The directory being read is cmu-newsgroup-random-100/. We use the simple lexer and all tokens are delimited by whitespace. The output file is news.arff.

  java weka.datagenerators.TextSource
    -r news -R directory -D cmu-newsgroup-random-100/
    -L simple -y alphanum -o news.arff
 

In this case all tokens consist of only alphanumeric characters.

  java weka.datagenerators.TextSource
    -r news -R directory -D cmu-newsgroup-random-100/
    -L simple -y alpha -o news.arff
 

All tokens consist of only alphabets.

  java weka.datagenerators.TextSource
    -r news -R directory -D cmu-newsgroup-random-100/
    -L simple -y alpha -F lower_case -o news.arff
 

All tokens are converted to lower case before being indexed.

  java weka.datagenerators.TextSource
    -r news -R directory -D cmu-newsgroup-random-100/
    -L simple -y alpha -F lower_case:stop_word -o news.arff
 

All stop words are removed. The default SMART stop list is used.

  java weka.datagenerators.TextSource
    -r news -R directory -D cmu-newsgroup-random-100/
    -L simple -y alpha -F lower_case:stop_word:porter_stemmer -o news.arff
 

After removing the stop words, we apply the Porter stemmer.

  java weka.datagenerators.TextSource
    -r news -R directory -D cmu-newsgroup-random-100/
    -L simple -y alpha
    -F lower_case:stop_word:porter_stemmer:word_length -N 5 -o news.arff
 

After stemming the tokens, we throw away all tokens whose length is less than five.

  java weka.datagenerators.TextSource
    -r news -R directory -D cmu-newsgroup-random-100/
    -L simple -y alpha
    -F lower_case:stop_word:word_length:porter_stemmer -N 5 -o news.arff
 

We throw away tokens whose length is less than five before applying the Porter stemmer.

  java weka.datagenerators.TextSource
    -r news -R directory -D cmu-newsgroup-random-100/ -u 'talk.*'
    -L simple -y alpha
    -F lower_case:stop_word:word_length:porter_stemmer -N 5 -o news.arff
 

Read only documents that belong to the classes talk.*. The argument for -u can be any regular expression.

See Also:
Serialized Form

Nested Class Summary
 class TextSource.DataRow
          Sparse map data row structure with public hash map.
 class TextSource.Int
          A simpler wrapper for int than Integer.
 class TextSource.Real
          A simpler wrapper for double than Double.
 class TextSource.Table
          Table that allows incremental addition of attributes.
protected  class TextSource.Token
          Information about a particular token.
 
Field Summary
protected  java.util.ArrayList m_aTokens
          An ordered list for looking up tokens.
protected  boolean m_bFormatDefined
          True iff defineDataFormat() has been called.
protected  boolean m_bTFIDF
          Collect TFIDF statistics instead of TF.
protected  double m_dNextClass
          The next class ID.
protected  java.util.LinkedHashMap m_hashClasses
          A map for looking up classes.
protected  java.util.HashMap m_hashTokens
          A map for looking up tokens.
protected  weka.datagenerators.Lexer m_lexer
          The lexer.
protected  java.util.LinkedList m_lstFilters
          The list of token filters which are applied in order.
protected  int m_nNextToken
          The next token ID.
protected  weka.datagenerators.DocumentReader m_reader
          The document reader.
protected  java.lang.String m_strDocReader
          The option string for document reader.
protected  java.lang.String m_strFilters
          The option string for token filters.
protected  java.lang.String m_strLexer
          The option string for lexer.
protected  TextSource.Table m_table
          The example table.
 
Constructor Summary
TextSource()
           
 
Method Summary
 Instances defineDataFormat()
           
 Instance generateExample()
           
 Instances generateExamples()
           
 java.lang.String generateFinished()
           
protected  TextSource.DataRow getInstance(TextSource.Real dClass)
          Tokenizes a document and transforms it into a sparse vector.
 java.lang.String[] getOptions()
          Gets the current option settings for the OptionHandler.
 boolean getSingleModeFlag()
           
 java.lang.String globalInfo()
           
 java.util.Enumeration listOptions()
          Returns an enumeration of all the available options..
static void main(java.lang.String[] argv)
           
protected  void readInstances()
          Reads all documents and converts them all to sparse vectors.
 TextSource.Real registerClass(java.lang.String strClass)
           
 void setOptions(java.lang.String[] options)
          Sets the OptionHandler's options using the given list.
 
Methods inherited from class weka.datagenerators.Generator
getDebug, getFormat, getNumAttributes, getNumClasses, getNumExamples, getNumExamplesAct, getOutput, getRelationName, makeData, setDebug, setFormat, setNumAttributes, setNumClasses, setNumExamples, setNumExamplesAct, setOutput, setRelationName, toStringFormat
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

m_table

protected TextSource.Table m_table
The example table.


m_hashTokens

protected java.util.HashMap m_hashTokens
A map for looking up tokens.


m_aTokens

protected java.util.ArrayList m_aTokens
An ordered list for looking up tokens.


m_nNextToken

protected int m_nNextToken
The next token ID.


m_hashClasses

protected java.util.LinkedHashMap m_hashClasses
A map for looking up classes.


m_dNextClass

protected double m_dNextClass
The next class ID.


m_bTFIDF

protected boolean m_bTFIDF
Collect TFIDF statistics instead of TF.


m_reader

protected weka.datagenerators.DocumentReader m_reader
The document reader.


m_lexer

protected weka.datagenerators.Lexer m_lexer
The lexer.


m_lstFilters

protected java.util.LinkedList m_lstFilters
The list of token filters which are applied in order.


m_strDocReader

protected java.lang.String m_strDocReader
The option string for document reader.


m_strLexer

protected java.lang.String m_strLexer
The option string for lexer.


m_strFilters

protected java.lang.String m_strFilters
The option string for token filters.


m_bFormatDefined

protected boolean m_bFormatDefined
True iff defineDataFormat() has been called.

Constructor Detail

TextSource

public TextSource()
Method Detail

registerClass

public TextSource.Real registerClass(java.lang.String strClass)

getInstance

protected TextSource.DataRow getInstance(TextSource.Real dClass)
                                  throws java.io.IOException
Tokenizes a document and transforms it into a sparse vector.

Parameters:
dClass - The class index of the document to be read.
Throws:
java.io.IOException

readInstances

protected void readInstances()
                      throws java.lang.Exception
Reads all documents and converts them all to sparse vectors.

Throws:
java.lang.Exception

globalInfo

public java.lang.String globalInfo()

listOptions

public java.util.Enumeration listOptions()
Description copied from interface: OptionHandler
Returns an enumeration of all the available options..

Specified by:
listOptions in interface OptionHandler
Returns:
an enumeration of all available options.

setOptions

public void setOptions(java.lang.String[] options)
                throws java.lang.Exception
Description copied from interface: OptionHandler
Sets the OptionHandler's options using the given list. All options will be set (or reset) during this call (i.e. incremental setting of options is not possible).

Specified by:
setOptions in interface OptionHandler
Parameters:
options - the list of options as an array of strings
Throws:
java.lang.Exception - if an option is not supported

getOptions

public java.lang.String[] getOptions()
Description copied from interface: OptionHandler
Gets the current option settings for the OptionHandler.

Specified by:
getOptions in interface OptionHandler
Returns:
the list of current option settings as an array of strings

defineDataFormat

public Instances defineDataFormat()
                           throws java.lang.Exception
Returns:
the format for the dataset
Throws:
java.lang.Exception - if the generating of the format failed

generateExample

public Instance generateExample()
                         throws java.lang.Exception
Returns:
the generated example
Throws:
java.lang.Exception - if the generator only works with generateExamples which means in non single mode

generateExamples

public Instances generateExamples()
                           throws java.lang.Exception
Returns:
the generated dataset
Throws:
java.lang.Exception - if the format of the dataset is not yet defined

generateFinished

public java.lang.String generateFinished()
                                  throws java.lang.Exception
Returns:
string contains info about the generated rules
Throws:
java.lang.Exception - if the generating of the documentaion fails

getSingleModeFlag

public boolean getSingleModeFlag()
                          throws java.lang.Exception
Returns:
single mode flag
Throws:
java.lang.Exception - if mode is not set yet

main

public static void main(java.lang.String[] argv)
                 throws java.lang.Exception
Throws:
java.lang.Exception