weka.filters.unsupervised.attribute
Class StringToWordVector

java.lang.Object
  extended byweka.filters.Filter
      extended byweka.filters.unsupervised.attribute.StringToWordVector
All Implemented Interfaces:
OptionHandler, java.io.Serializable, UnsupervisedFilter

public class StringToWordVector
extends Filter
implements UnsupervisedFilter, OptionHandler

Converts String attributes into a set of attributes representing word occurrence information from the text contained in the strings. The set of words (attributes) is determined by the first batch filtered (typically training data).

See Also:
Serialized Form

Field Summary
protected  Range m_SelectedRange
          Range of columns to convert to word vectors
 
Fields inherited from class weka.filters.Filter
m_NewBatch
 
Constructor Summary
StringToWordVector()
          Default constructor.
StringToWordVector(int wordsToKeep)
          Constructor that allows specification of the target number of words in the output.
 
Method Summary
 boolean batchFinished()
          Signify that this batch of input to the filter is finished.
 java.lang.String getDelimiters()
          Get the value of delimiters.
 java.lang.String[] getOptions()
          Gets the current settings of the filter.
 boolean getOutputWordCounts()
          Gets whether output instances contain 0 or 1 indicating word presence, or word counts.
 Range getSelectedRange()
          Get the value of m_SelectedRange.
 int getWordsToKeep()
          Gets the number of words (per class if there is a class attribute assigned) to attempt to keep.
 boolean input(Instance instance)
          Input an instance for filtering.
 java.util.Enumeration listOptions()
          Returns an enumeration describing the available options
static void main(java.lang.String[] argv)
          Main method for testing this class.
 void setDelimiters(java.lang.String newDelimiters)
          Set the value of delimiters.
 boolean setInputFormat(Instances instanceInfo)
          Sets the format of the input instances.
 void setOptions(java.lang.String[] options)
          Parses a given list of options controlling the behaviour of this object.
 void setOutputWordCounts(boolean outputWordCounts)
          Sets whether output instances contain 0 or 1 indicating word presence, or word counts.
 void setSelectedRange(java.lang.String newSelectedRange)
          Set the value of m_SelectedRange.
 void setWordsToKeep(int newWordsToKeep)
          Sets the number of words (per class if there is a class attribute assigned) to attempt to keep.
 
Methods inherited from class weka.filters.Filter
batchFilterFile, bufferInput, copyStringValues, copyStringValues, filterFile, flushInput, getInputFormat, getInputStringIndex, getOutputFormat, getOutputStringIndex, getStringIndices, inputFormat, isOutputFormatDefined, numPendingOutput, output, outputFormat, outputFormatPeek, outputPeek, push, resetQueue, setOutputFormat, useFilter
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

m_SelectedRange

protected Range m_SelectedRange
Range of columns to convert to word vectors

Constructor Detail

StringToWordVector

public StringToWordVector()
Default constructor. Targets 1000 words in the output.


StringToWordVector

public StringToWordVector(int wordsToKeep)
Constructor that allows specification of the target number of words in the output.

Parameters:
wordsToKeep - the number of words in the output vector (per class if assigned).
Method Detail

listOptions

public java.util.Enumeration listOptions()
Returns an enumeration describing the available options

Specified by:
listOptions in interface OptionHandler
Returns:
an enumeration of all the available options

setOptions

public void setOptions(java.lang.String[] options)
                throws java.lang.Exception
Parses a given list of options controlling the behaviour of this object. Valid options are:

-C
Output word counts rather than boolean word presence.

-D delimiter_charcters
Specify set of delimiter characters (default: " \n\t.,:'\\\"()?!\"

-R index1,index2-index4,...
Specify list of string attributes to convert to words. (default: all string attributes)

-W number_of_words_to_keep
Specify number of word fields to create. Other, less useful words will be discarded. (default: 1000)

Specified by:
setOptions in interface OptionHandler
Parameters:
options - the list of options as an array of strings
Throws:
java.lang.Exception - if an option is not supported

getOptions

public java.lang.String[] getOptions()
Gets the current settings of the filter.

Specified by:
getOptions in interface OptionHandler
Returns:
an array of strings suitable for passing to setOptions

setInputFormat

public boolean setInputFormat(Instances instanceInfo)
                       throws java.lang.Exception
Sets the format of the input instances.

Overrides:
setInputFormat in class Filter
Parameters:
instanceInfo - an Instances object containing the input instance structure (any instances contained in the object are ignored - only the structure is required).
Returns:
true if the outputFormat may be collected immediately
Throws:
java.lang.Exception - if the input format can't be set successfully

input

public boolean input(Instance instance)
Input an instance for filtering. Filter requires all training instances be read before producing output.

Overrides:
input in class Filter
Parameters:
instance - the input instance.
Returns:
true if the filtered instance may now be collected with output().
Throws:
java.lang.IllegalStateException - if no input structure has been defined.

batchFinished

public boolean batchFinished()
Signify that this batch of input to the filter is finished. If the filter requires all instances prior to filtering, output() may now be called to retrieve the filtered instances.

Overrides:
batchFinished in class Filter
Returns:
true if there are instances pending output.
Throws:
java.lang.IllegalStateException - if no input structure has been defined.

getOutputWordCounts

public boolean getOutputWordCounts()
Gets whether output instances contain 0 or 1 indicating word presence, or word counts.

Returns:
true if word counts should be output.

setOutputWordCounts

public void setOutputWordCounts(boolean outputWordCounts)
Sets whether output instances contain 0 or 1 indicating word presence, or word counts.

Parameters:
outputWordCounts - true if word counts should be output.

getDelimiters

public java.lang.String getDelimiters()
Get the value of delimiters.

Returns:
Value of delimiters.

setDelimiters

public void setDelimiters(java.lang.String newDelimiters)
Set the value of delimiters.


getSelectedRange

public Range getSelectedRange()
Get the value of m_SelectedRange.

Returns:
Value of m_SelectedRange.

setSelectedRange

public void setSelectedRange(java.lang.String newSelectedRange)
Set the value of m_SelectedRange.

Parameters:
newSelectedRange - Value to assign to m_SelectedRange.

getWordsToKeep

public int getWordsToKeep()
Gets the number of words (per class if there is a class attribute assigned) to attempt to keep.

Returns:
the target number of words in the output vector (per class if assigned).

setWordsToKeep

public void setWordsToKeep(int newWordsToKeep)
Sets the number of words (per class if there is a class attribute assigned) to attempt to keep.

Parameters:
newWordsToKeep - the target number of words in the output vector (per class if assigned).

main

public static void main(java.lang.String[] argv)
Main method for testing this class.

Parameters:
argv - should contain arguments to the filter: use -h for help