weka.deduping.metrics
Class Tokenizer

java.lang.Object
  extended byweka.deduping.metrics.Tokenizer
Direct Known Subclasses:
NGramTokenizer, WordTokenizer

public abstract class Tokenizer
extends java.lang.Object

This abstract class defines a tokenizer that turns strings into HashMapVectors


Field Summary
protected  boolean m_caseInsensitive
          Converting all tokens to lowercase
protected  Porter m_stemmer
           
protected  boolean m_stemming
          Stemming
protected static java.lang.String m_stopwordFilename
          The with the stopword list
protected  boolean m_stopwordRemoval
          Stopword removal
protected static java.util.HashSet m_stopwordSet
          Stopword hash
 
Constructor Summary
Tokenizer()
           
 
Method Summary
 boolean getCaseInsensitive()
          Turn case sensitivity on/off
 boolean getStemming()
          Find out whether stemming is on/off
 boolean getStopwordRemoval()
          Get whether stopword removal is on or off
 void setCaseInsensitive(boolean caseInsensitive)
          Turn case sensitivity on/off
 void setStemming(boolean stemming)
          Turn stemming on/off
 void setStopwordRemoval(boolean stopwordRemoval)
          Turn stopword removal on/off and load the stopwords
 java.lang.String stem(java.lang.String token)
          Stem a given token
abstract  HashMapVector tokenize(java.lang.String string)
          Take a string and create a vector of tokens from it
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

m_caseInsensitive

protected boolean m_caseInsensitive
Converting all tokens to lowercase


m_stemming

protected boolean m_stemming
Stemming


m_stemmer

protected Porter m_stemmer

m_stopwordRemoval

protected boolean m_stopwordRemoval
Stopword removal


m_stopwordFilename

protected static java.lang.String m_stopwordFilename
The with the stopword list


m_stopwordSet

protected static java.util.HashSet m_stopwordSet
Stopword hash

Constructor Detail

Tokenizer

public Tokenizer()
Method Detail

tokenize

public abstract HashMapVector tokenize(java.lang.String string)
Take a string and create a vector of tokens from it

Parameters:
string - a String to tokenize
Returns:
vector with individual tokens

setCaseInsensitive

public void setCaseInsensitive(boolean caseInsensitive)
Turn case sensitivity on/off

Parameters:
caseInsensitive - if true, the tokenizer is case-insensitive

getCaseInsensitive

public boolean getCaseInsensitive()
Turn case sensitivity on/off

Returns:
if true, the tokenizer is case-insensitive

setStemming

public void setStemming(boolean stemming)
Turn stemming on/off

Parameters:
stemming - if true, stemming is used

getStemming

public boolean getStemming()
Find out whether stemming is on/off

Returns:
if true, stemming is used

stem

public java.lang.String stem(java.lang.String token)
Stem a given token

Parameters:
token - the token to be stemmed
Returns:
a new token resulting from applying the stemmer

setStopwordRemoval

public void setStopwordRemoval(boolean stopwordRemoval)
Turn stopword removal on/off and load the stopwords

Parameters:
stopwordRemoval - if true, stopwords from m_stopwordFile will be removed

getStopwordRemoval

public boolean getStopwordRemoval()
Get whether stopword removal is on or off

Returns:
true if stopword removal is on