|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectweka.datagenerators.Generator
weka.datagenerators.TextSource
Reads a collection of text documents and transforms them into sparse vectors. The sparse vectors are then put into an ARFF file for further processing by WEKA.
WEKA options:
-I
- Include TFIDF scores instead of TF.
-R <str>
- The document reader. Now only
one is supported, namely directory
. This parameter
has no default value and is not optional.
-L <str>
- The lexer. Now only one lexer
is supported, namely simple
. This parameter has no
default value and is not optional.
-F <str>[:<str>...]
- A
colon-separated list of filters being applied on the tokens.
Four filters are supported, namely lower_case
,
porter_stemmer
, stop_word
, and
word_length
. Order of listing is significant. For
example, if the value for filters
is
stop_word:porter_stemmer
, then the
stop_word
filter is applied before
porter_stemmer
. By default the list is empty.
The generic generator options -a
, -c
and -n
are ignored.
Here are some sample command lines:
java weka.datagenerators.TextSource -r news -R directory -D cmu-newsgroup-random-100/ -L simple -y whitespace -o news.arff
The name of the dataset is news
. We use the
directory
document reader. The directory being read
is cmu-newsgroup-random-100/
. We use the
simple
lexer and all tokens are delimited by
whitespace. The output file is news.arff
.
java weka.datagenerators.TextSource -r news -R directory -D cmu-newsgroup-random-100/ -L simple -y alphanum -o news.arff
In this case all tokens consist of only alphanumeric characters.
java weka.datagenerators.TextSource -r news -R directory -D cmu-newsgroup-random-100/ -L simple -y alpha -o news.arff
All tokens consist of only alphabets.
java weka.datagenerators.TextSource -r news -R directory -D cmu-newsgroup-random-100/ -L simple -y alpha -F lower_case -o news.arff
All tokens are converted to lower case before being indexed.
java weka.datagenerators.TextSource -r news -R directory -D cmu-newsgroup-random-100/ -L simple -y alpha -F lower_case:stop_word -o news.arff
All stop words are removed. The default SMART stop list is used.
java weka.datagenerators.TextSource -r news -R directory -D cmu-newsgroup-random-100/ -L simple -y alpha -F lower_case:stop_word:porter_stemmer -o news.arff
After removing the stop words, we apply the Porter stemmer.
java weka.datagenerators.TextSource -r news -R directory -D cmu-newsgroup-random-100/ -L simple -y alpha -F lower_case:stop_word:porter_stemmer:word_length -N 5 -o news.arff
After stemming the tokens, we throw away all tokens whose length is less than five.
java weka.datagenerators.TextSource -r news -R directory -D cmu-newsgroup-random-100/ -L simple -y alpha -F lower_case:stop_word:word_length:porter_stemmer -N 5 -o news.arff
We throw away tokens whose length is less than five before applying the Porter stemmer.
java weka.datagenerators.TextSource -r news -R directory -D cmu-newsgroup-random-100/ -u 'talk.*' -L simple -y alpha -F lower_case:stop_word:word_length:porter_stemmer -N 5 -o news.arff
Read only documents that belong to the classes
talk.*
. The argument for -u
can be any
regular expression.
Nested Class Summary | |
class |
TextSource.DataRow
Sparse map data row structure with public hash map. |
class |
TextSource.Int
A simpler wrapper for int than Integer. |
class |
TextSource.Real
A simpler wrapper for double than Double. |
class |
TextSource.Table
Table that allows incremental addition of attributes. |
protected class |
TextSource.Token
Information about a particular token. |
Field Summary | |
protected java.util.ArrayList |
m_aTokens
An ordered list for looking up tokens. |
protected boolean |
m_bFormatDefined
True iff defineDataFormat() has been called. |
protected boolean |
m_bTFIDF
Collect TFIDF statistics instead of TF. |
protected double |
m_dNextClass
The next class ID. |
protected java.util.LinkedHashMap |
m_hashClasses
A map for looking up classes. |
protected java.util.HashMap |
m_hashTokens
A map for looking up tokens. |
protected weka.datagenerators.Lexer |
m_lexer
The lexer. |
protected java.util.LinkedList |
m_lstFilters
The list of token filters which are applied in order. |
protected int |
m_nNextToken
The next token ID. |
protected weka.datagenerators.DocumentReader |
m_reader
The document reader. |
protected java.lang.String |
m_strDocReader
The option string for document reader. |
protected java.lang.String |
m_strFilters
The option string for token filters. |
protected java.lang.String |
m_strLexer
The option string for lexer. |
protected TextSource.Table |
m_table
The example table. |
Constructor Summary | |
TextSource()
|
Method Summary | |
Instances |
defineDataFormat()
|
Instance |
generateExample()
|
Instances |
generateExamples()
|
java.lang.String |
generateFinished()
|
protected TextSource.DataRow |
getInstance(TextSource.Real dClass)
Tokenizes a document and transforms it into a sparse vector. |
java.lang.String[] |
getOptions()
Gets the current option settings for the OptionHandler. |
boolean |
getSingleModeFlag()
|
java.lang.String |
globalInfo()
|
java.util.Enumeration |
listOptions()
Returns an enumeration of all the available options.. |
static void |
main(java.lang.String[] argv)
|
protected void |
readInstances()
Reads all documents and converts them all to sparse vectors. |
TextSource.Real |
registerClass(java.lang.String strClass)
|
void |
setOptions(java.lang.String[] options)
Sets the OptionHandler's options using the given list. |
Methods inherited from class weka.datagenerators.Generator |
getDebug, getFormat, getNumAttributes, getNumClasses, getNumExamples, getNumExamplesAct, getOutput, getRelationName, makeData, setDebug, setFormat, setNumAttributes, setNumClasses, setNumExamples, setNumExamplesAct, setOutput, setRelationName, toStringFormat |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
protected TextSource.Table m_table
protected java.util.HashMap m_hashTokens
protected java.util.ArrayList m_aTokens
protected int m_nNextToken
protected java.util.LinkedHashMap m_hashClasses
protected double m_dNextClass
protected boolean m_bTFIDF
protected weka.datagenerators.DocumentReader m_reader
protected weka.datagenerators.Lexer m_lexer
protected java.util.LinkedList m_lstFilters
protected java.lang.String m_strDocReader
protected java.lang.String m_strLexer
protected java.lang.String m_strFilters
protected boolean m_bFormatDefined
Constructor Detail |
public TextSource()
Method Detail |
public TextSource.Real registerClass(java.lang.String strClass)
protected TextSource.DataRow getInstance(TextSource.Real dClass) throws java.io.IOException
dClass
- The class index of the document to be read.
java.io.IOException
protected void readInstances() throws java.lang.Exception
java.lang.Exception
public java.lang.String globalInfo()
public java.util.Enumeration listOptions()
OptionHandler
listOptions
in interface OptionHandler
public void setOptions(java.lang.String[] options) throws java.lang.Exception
OptionHandler
setOptions
in interface OptionHandler
options
- the list of options as an array of strings
java.lang.Exception
- if an option is not supportedpublic java.lang.String[] getOptions()
OptionHandler
getOptions
in interface OptionHandler
public Instances defineDataFormat() throws java.lang.Exception
java.lang.Exception
- if the generating of the format failedpublic Instance generateExample() throws java.lang.Exception
java.lang.Exception
- if the generator only works with generateExamples
which means in non single modepublic Instances generateExamples() throws java.lang.Exception
java.lang.Exception
- if the format of the dataset is not yet definedpublic java.lang.String generateFinished() throws java.lang.Exception
java.lang.Exception
- if the generating of the documentaion failspublic boolean getSingleModeFlag() throws java.lang.Exception
java.lang.Exception
- if mode is not set yetpublic static void main(java.lang.String[] argv) throws java.lang.Exception
java.lang.Exception
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |