weka.deduping.metrics
Class NGramTokenizer

java.lang.Object
  extended byweka.deduping.metrics.Tokenizer
      extended byweka.deduping.metrics.NGramTokenizer
All Implemented Interfaces:
OptionHandler, java.io.Serializable

public class NGramTokenizer
extends Tokenizer
implements java.io.Serializable, OptionHandler

This class defines a tokenizer that turns strings into HashMapVectors of n-grams

See Also:
Serialized Form

Field Summary
protected  int m_n
          Length of an n-gram
protected  char[] m_spaceChars
           
protected  java.lang.String m_spaceEquivalents
          A default set of space-equivalent characters
 
Fields inherited from class weka.deduping.metrics.Tokenizer
m_caseInsensitive, m_stemmer, m_stemming, m_stopwordFilename, m_stopwordRemoval, m_stopwordSet
 
Constructor Summary
NGramTokenizer()
          A default constructor
 
Method Summary
 int getN()
          Get the gram length
 java.lang.String[] getOptions()
          Gets the current settings of NGramTokenizer.
 java.lang.String getSpaceEquivalents()
          Get the haracters that should be treated as spaces
 java.util.Enumeration listOptions()
          Returns an enumeration describing the available options.
 void setN(int n)
          Set the gram length
 void setOptions(java.lang.String[] options)
          Parses a given list of options.
 void setSpaceEquivalents(java.lang.String spaceEquivalents)
          Specify which characters should be treated as spaces
 HashMapVector tokenize(java.lang.String string)
          Take a string and create a vector of n-gram tokens from it
 
Methods inherited from class weka.deduping.metrics.Tokenizer
getCaseInsensitive, getStemming, getStopwordRemoval, setCaseInsensitive, setStemming, setStopwordRemoval, stem
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

m_n

protected int m_n
Length of an n-gram


m_spaceEquivalents

protected java.lang.String m_spaceEquivalents
A default set of space-equivalent characters


m_spaceChars

protected char[] m_spaceChars
Constructor Detail

NGramTokenizer

public NGramTokenizer()
A default constructor

Method Detail

tokenize

public HashMapVector tokenize(java.lang.String string)
Take a string and create a vector of n-gram tokens from it

Specified by:
tokenize in class Tokenizer
Parameters:
string - a String to tokenize
Returns:
vector with individual tokens

setN

public void setN(int n)
Set the gram length

Parameters:
n - the gram length

getN

public int getN()
Get the gram length

Returns:
the gram length

setSpaceEquivalents

public void setSpaceEquivalents(java.lang.String spaceEquivalents)
Specify which characters should be treated as spaces

Parameters:
spaceEquivalents - a string containing space equivalents

getSpaceEquivalents

public java.lang.String getSpaceEquivalents()
Get the haracters that should be treated as spaces

Returns:
a string containing space equivalents

getOptions

public java.lang.String[] getOptions()
Gets the current settings of NGramTokenizer.

Specified by:
getOptions in interface OptionHandler
Returns:
an array of strings suitable for passing to setOptions()

setOptions

public void setOptions(java.lang.String[] options)
                throws java.lang.Exception
Parses a given list of options. Valid options are:

-S use stemming -R remove stopwords -N gram size

Specified by:
setOptions in interface OptionHandler
Parameters:
options - the list of options as an array of strings
Throws:
java.lang.Exception - if an option is not supported

listOptions

public java.util.Enumeration listOptions()
Returns an enumeration describing the available options.

Specified by:
listOptions in interface OptionHandler
Returns:
an enumeration of all the available options.