weka.deduping.metrics
Class WordTokenizer

java.lang.Object
  extended byweka.deduping.metrics.Tokenizer
      extended byweka.deduping.metrics.WordTokenizer
All Implemented Interfaces:
OptionHandler, java.io.Serializable

public class WordTokenizer
extends Tokenizer
implements java.io.Serializable, OptionHandler

This class defines a tokenizer that turns strings into HashMapVectors using the native Java StringTokenizer

See Also:
Serialized Form

Field Summary
protected  java.lang.String m_delimiters
          A default set of delimiters
protected  int m_minTokenLength
          The default minimum length of a token
 
Fields inherited from class weka.deduping.metrics.Tokenizer
m_caseInsensitive, m_stemmer, m_stemming, m_stopwordFilename, m_stopwordRemoval, m_stopwordSet
 
Constructor Summary
WordTokenizer()
          A default constructor
 
Method Summary
 java.lang.String getDelimiters()
          Get the delimiters
 int getMinTokenLength()
          Get the minimum token length
 java.lang.String[] getOptions()
          Gets the current settings of WordTokenizer.
 java.util.Enumeration listOptions()
          Returns an enumeration describing the available options.
 void setDelimiters(java.lang.String delimiters)
          Specify which delimiters to use
 void setMinTokenLength(int minTokenLength)
          Set the minimum token length
 void setOptions(java.lang.String[] options)
          Parses a given list of options.
 HashMapVector tokenize(java.lang.String string)
          Take a string and create a vector of tokens from it
 
Methods inherited from class weka.deduping.metrics.Tokenizer
getCaseInsensitive, getStemming, getStopwordRemoval, setCaseInsensitive, setStemming, setStopwordRemoval, stem
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

m_delimiters

protected java.lang.String m_delimiters
A default set of delimiters


m_minTokenLength

protected int m_minTokenLength
The default minimum length of a token

Constructor Detail

WordTokenizer

public WordTokenizer()
A default constructor

Method Detail

tokenize

public HashMapVector tokenize(java.lang.String string)
Take a string and create a vector of tokens from it

Specified by:
tokenize in class Tokenizer
Parameters:
string - a String to tokenize
Returns:
vector with individual tokens

setDelimiters

public void setDelimiters(java.lang.String delimiters)
Specify which delimiters to use


getDelimiters

public java.lang.String getDelimiters()
Get the delimiters

Returns:
a string containing delmiters that are used

setMinTokenLength

public void setMinTokenLength(int minTokenLength)
Set the minimum token length

Parameters:
minTokenLength - the minimum length of a token

getMinTokenLength

public int getMinTokenLength()
Get the minimum token length

Returns:
the minimum length of a token

getOptions

public java.lang.String[] getOptions()
Gets the current settings of WordTokenizer.

Specified by:
getOptions in interface OptionHandler
Returns:
an array of strings suitable for passing to setOptions()

setOptions

public void setOptions(java.lang.String[] options)
                throws java.lang.Exception
Parses a given list of options. Valid options are:

-S use stemming -R remove stopwords -m minimum length of a token for it to be included

Specified by:
setOptions in interface OptionHandler
Parameters:
options - the list of options as an array of strings
Throws:
java.lang.Exception - if an option is not supported

listOptions

public java.util.Enumeration listOptions()
Returns an enumeration describing the available options.

Specified by:
listOptions in interface OptionHandler
Returns:
an enumeration of all the available options.