|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectweka.deduping.metrics.Tokenizer
weka.deduping.metrics.WordTokenizer
This class defines a tokenizer that turns strings into HashMapVectors using the native Java StringTokenizer
Field Summary | |
protected java.lang.String |
m_delimiters
A default set of delimiters |
protected int |
m_minTokenLength
The default minimum length of a token |
Fields inherited from class weka.deduping.metrics.Tokenizer |
m_caseInsensitive, m_stemmer, m_stemming, m_stopwordFilename, m_stopwordRemoval, m_stopwordSet |
Constructor Summary | |
WordTokenizer()
A default constructor |
Method Summary | |
java.lang.String |
getDelimiters()
Get the delimiters |
int |
getMinTokenLength()
Get the minimum token length |
java.lang.String[] |
getOptions()
Gets the current settings of WordTokenizer. |
java.util.Enumeration |
listOptions()
Returns an enumeration describing the available options. |
void |
setDelimiters(java.lang.String delimiters)
Specify which delimiters to use |
void |
setMinTokenLength(int minTokenLength)
Set the minimum token length |
void |
setOptions(java.lang.String[] options)
Parses a given list of options. |
HashMapVector |
tokenize(java.lang.String string)
Take a string and create a vector of tokens from it |
Methods inherited from class weka.deduping.metrics.Tokenizer |
getCaseInsensitive, getStemming, getStopwordRemoval, setCaseInsensitive, setStemming, setStopwordRemoval, stem |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
protected java.lang.String m_delimiters
protected int m_minTokenLength
Constructor Detail |
public WordTokenizer()
Method Detail |
public HashMapVector tokenize(java.lang.String string)
tokenize
in class Tokenizer
string
- a String to tokenize
public void setDelimiters(java.lang.String delimiters)
public java.lang.String getDelimiters()
public void setMinTokenLength(int minTokenLength)
minTokenLength
- the minimum length of a tokenpublic int getMinTokenLength()
public java.lang.String[] getOptions()
getOptions
in interface OptionHandler
public void setOptions(java.lang.String[] options) throws java.lang.Exception
-S use stemming -R remove stopwords -m minimum length of a token for it to be included
setOptions
in interface OptionHandler
options
- the list of options as an array of strings
java.lang.Exception
- if an option is not supportedpublic java.util.Enumeration listOptions()
listOptions
in interface OptionHandler
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |