|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectweka.deduping.metrics.Tokenizer
weka.deduping.metrics.NGramTokenizer
This class defines a tokenizer that turns strings into HashMapVectors of n-grams
Field Summary | |
protected int |
m_n
Length of an n-gram |
protected char[] |
m_spaceChars
|
protected java.lang.String |
m_spaceEquivalents
A default set of space-equivalent characters |
Fields inherited from class weka.deduping.metrics.Tokenizer |
m_caseInsensitive, m_stemmer, m_stemming, m_stopwordFilename, m_stopwordRemoval, m_stopwordSet |
Constructor Summary | |
NGramTokenizer()
A default constructor |
Method Summary | |
int |
getN()
Get the gram length |
java.lang.String[] |
getOptions()
Gets the current settings of NGramTokenizer. |
java.lang.String |
getSpaceEquivalents()
Get the haracters that should be treated as spaces |
java.util.Enumeration |
listOptions()
Returns an enumeration describing the available options. |
void |
setN(int n)
Set the gram length |
void |
setOptions(java.lang.String[] options)
Parses a given list of options. |
void |
setSpaceEquivalents(java.lang.String spaceEquivalents)
Specify which characters should be treated as spaces |
HashMapVector |
tokenize(java.lang.String string)
Take a string and create a vector of n-gram tokens from it |
Methods inherited from class weka.deduping.metrics.Tokenizer |
getCaseInsensitive, getStemming, getStopwordRemoval, setCaseInsensitive, setStemming, setStopwordRemoval, stem |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
protected int m_n
protected java.lang.String m_spaceEquivalents
protected char[] m_spaceChars
Constructor Detail |
public NGramTokenizer()
Method Detail |
public HashMapVector tokenize(java.lang.String string)
tokenize
in class Tokenizer
string
- a String to tokenize
public void setN(int n)
n
- the gram lengthpublic int getN()
public void setSpaceEquivalents(java.lang.String spaceEquivalents)
spaceEquivalents
- a string containing space equivalentspublic java.lang.String getSpaceEquivalents()
public java.lang.String[] getOptions()
getOptions
in interface OptionHandler
public void setOptions(java.lang.String[] options) throws java.lang.Exception
-S use stemming -R remove stopwords -N gram size
setOptions
in interface OptionHandler
options
- the list of options as an array of strings
java.lang.Exception
- if an option is not supportedpublic java.util.Enumeration listOptions()
listOptions
in interface OptionHandler
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |