weka.deduping.metrics
Class KernelVSMetric

java.lang.Object
  extended byweka.deduping.metrics.StringMetric
      extended byweka.deduping.metrics.KernelVSMetric
All Implemented Interfaces:
java.lang.Cloneable, DataDependentStringMetric, LearnableStringMetric, OptionHandler, java.io.Serializable

public class KernelVSMetric
extends StringMetric
implements DataDependentStringMetric, LearnableStringMetric, OptionHandler, java.io.Serializable

This class defines a basic string kernel based on vector space Some code borrowed from ir.vsr package by Raymond J. Mooney

See Also:
Serialized Form

Field Summary
static int CONVERSION_EXPONENTIAL
           
static int CONVERSION_LAPLACIAN
          We can have different ways of converting from similarity to distance
static int CONVERSION_UNIT
           
protected  DistributionClassifier m_classifier
          The classifier
protected  int m_conversionType
          The method of converting, by default laplacian
protected  Instances m_instances
          The dataset for the vector space attributes
protected  int m_numStringParts
          The number of vector spaces
protected  java.util.HashMap m_stringRefHash
          Strings are mapped to StringReferences in this hash
 java.util.ArrayList m_stringRefs
          A list of all indexed strings.
protected  java.util.HashMap m_tokenAttrMap
          A HashMap where each token is mapped to the corresponding Attribute
protected  java.util.HashMap m_tokenHash
          A HashMap where tokens are indexed.
protected  Tokenizer m_tokenizer
          An underlying tokenizer that is used for converting strings into HashMapVectors
protected  boolean m_trained
          has the classifier been trained?
protected  boolean m_useIDF
          Should IDF weighting be used?
static Tag[] TAGS_CONVERSION
           
 
Constructor Summary
KernelVSMetric()
          Construct a vector space from a given set of examples
 
Method Summary
 void buildMetric(java.util.List strings)
          Given a list of strings, build the vector space
 java.lang.Object clone()
          Create a copy of this metric
protected  void computeIDFandStringLengths()
          Compute the IDF factor for every token in the index and the length of the string vector for every string referenced in the index.
protected  SparseInstance createPairInstance(java.lang.String s1, java.lang.String s2)
          Given a pair of strings and a label (same-class/different-class), create a diff-instance
 double distance(java.lang.String string1, java.lang.String string2)
          Returns distance between two strings using the current conversion type (CONVERSION_LAPLACIAN, CONVERSION_EXPONENTIAL, CONVERSION_UNIT, ...)
 DistributionClassifier getClassifier()
          Get the classifier
 SelectedTag getConversionType()
          return the type of similarity to distance conversion
 int getNumStringParts()
          the number of string parts
 java.lang.String[] getOptions()
          Gets the current settings of NGramTokenizer.
 Tokenizer getTokenizer()
          Get the tokenizer to use
 boolean getUseIDF()
          check whether IDF weighting is on/off
protected  void indexString(java.lang.String string, HashMapVector vector)
          Index a given string using its corresponding vector
protected  void indexToken(java.lang.String token, int count, StringReference strRef)
          Add a token occurrence to the index.
protected  void initKernel()
          Provided that all features are known, initialize the feature space for the kernel
 boolean isDistanceBased()
          The computation of a metric can be either based on distance, or on similarity
 java.util.Enumeration listOptions()
          Returns an enumeration describing the available options.
 void setClassifier(DistributionClassifier classifier)
          Set the classifier
 void setConversionType(SelectedTag conversionType)
          Set the type of similarity to distance conversion.
 void setNumStringParts(int numStringParts)
          Specify the number of string parts
 void setOptions(java.lang.String[] options)
          Parses a given list of options.
 void setTokenizer(Tokenizer tokenizer)
          Set the tokenizer to use
 void setUseIDF(boolean useIDF)
          Turn IDF weighting on/off
 double similarity(java.lang.String s1, java.lang.String s2)
          Compute similarity between two strings
 int size()
          Return the number of tokens indexed.
 void trainMetric(java.util.ArrayList pairList)
          Train the metric given a set of aligned strings
 
Methods inherited from class weka.deduping.metrics.StringMetric
forName
 
Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

m_stringRefHash

protected java.util.HashMap m_stringRefHash
Strings are mapped to StringReferences in this hash


m_tokenHash

protected java.util.HashMap m_tokenHash
A HashMap where tokens are indexed. Each indexed token maps to a TokenInfo.


m_tokenAttrMap

protected java.util.HashMap m_tokenAttrMap
A HashMap where each token is mapped to the corresponding Attribute


m_stringRefs

public java.util.ArrayList m_stringRefs
A list of all indexed strings. Elements are StringReference's.


m_tokenizer

protected Tokenizer m_tokenizer
An underlying tokenizer that is used for converting strings into HashMapVectors


m_useIDF

protected boolean m_useIDF
Should IDF weighting be used?


CONVERSION_LAPLACIAN

public static final int CONVERSION_LAPLACIAN
We can have different ways of converting from similarity to distance

See Also:
Constant Field Values

CONVERSION_UNIT

public static final int CONVERSION_UNIT
See Also:
Constant Field Values

CONVERSION_EXPONENTIAL

public static final int CONVERSION_EXPONENTIAL
See Also:
Constant Field Values

TAGS_CONVERSION

public static final Tag[] TAGS_CONVERSION

m_conversionType

protected int m_conversionType
The method of converting, by default laplacian


m_classifier

protected DistributionClassifier m_classifier
The classifier


m_numStringParts

protected int m_numStringParts
The number of vector spaces


m_trained

protected boolean m_trained
has the classifier been trained?


m_instances

protected Instances m_instances
The dataset for the vector space attributes

Constructor Detail

KernelVSMetric

public KernelVSMetric()
Construct a vector space from a given set of examples

Method Detail

buildMetric

public void buildMetric(java.util.List strings)
                 throws java.lang.Exception
Given a list of strings, build the vector space

Specified by:
buildMetric in interface DataDependentStringMetric
Parameters:
strings - a list of strings that the metric is built on
Throws:
java.lang.Exception

indexString

protected void indexString(java.lang.String string,
                           HashMapVector vector)
Index a given string using its corresponding vector


indexToken

protected void indexToken(java.lang.String token,
                          int count,
                          StringReference strRef)
Add a token occurrence to the index.

Parameters:
token - The token to index.
count - The number of times it occurs in the document.
strRef - A reference to the String it occurs in.

computeIDFandStringLengths

protected void computeIDFandStringLengths()
Compute the IDF factor for every token in the index and the length of the string vector for every string referenced in the index.


initKernel

protected void initKernel()
Provided that all features are known, initialize the feature space for the kernel


trainMetric

public void trainMetric(java.util.ArrayList pairList)
                 throws java.lang.Exception
Train the metric given a set of aligned strings

Specified by:
trainMetric in interface LearnableStringMetric
Parameters:
pairList - the training data as a list of StringPair's
Throws:
java.lang.Exception

createPairInstance

protected SparseInstance createPairInstance(java.lang.String s1,
                                            java.lang.String s2)
Given a pair of strings and a label (same-class/different-class), create a diff-instance


similarity

public double similarity(java.lang.String s1,
                         java.lang.String s2)
                  throws java.lang.Exception
Compute similarity between two strings

Specified by:
similarity in class StringMetric
Parameters:
s1 - first string
s2 - second string
Throws:
java.lang.Exception

isDistanceBased

public boolean isDistanceBased()
The computation of a metric can be either based on distance, or on similarity

Specified by:
isDistanceBased in class StringMetric

setTokenizer

public void setTokenizer(Tokenizer tokenizer)
Set the tokenizer to use

Parameters:
tokenizer - the tokenizer that is used

getTokenizer

public Tokenizer getTokenizer()
Get the tokenizer to use

Returns:
the tokenizer that is used

setClassifier

public void setClassifier(DistributionClassifier classifier)
Set the classifier

Parameters:
classifier - the classifier

getClassifier

public DistributionClassifier getClassifier()
Get the classifier


setUseIDF

public void setUseIDF(boolean useIDF)
Turn IDF weighting on/off

Parameters:
useIDF - if true, all token weights will be weighted by IDF

getUseIDF

public boolean getUseIDF()
check whether IDF weighting is on/off

Returns:
if true, all token weights are weighted by IDF

size

public int size()
Return the number of tokens indexed.

Returns:
the number of tokens indexed

distance

public double distance(java.lang.String string1,
                       java.lang.String string2)
                throws java.lang.Exception
Returns distance between two strings using the current conversion type (CONVERSION_LAPLACIAN, CONVERSION_EXPONENTIAL, CONVERSION_UNIT, ...)

Specified by:
distance in class StringMetric
Parameters:
string1 - First string.
string2 - Second string.
Throws:
java.lang.Exception - if distance could not be estimated.

setConversionType

public void setConversionType(SelectedTag conversionType)
Set the type of similarity to distance conversion. Values other than CONVERSION_LAPLACIAN, CONVERSION_UNIT, or CONVERSION_EXPONENTIAL will be ignored


getConversionType

public SelectedTag getConversionType()
return the type of similarity to distance conversion

Returns:
one of CONVERSION_LAPLACIAN, CONVERSION_UNIT, or CONVERSION_EXPONENTIAL

setNumStringParts

public void setNumStringParts(int numStringParts)
Specify the number of string parts

Parameters:
numStringParts - the number of "parts" in each string for which separate features are created

getNumStringParts

public int getNumStringParts()
the number of string parts

Returns:
numStringParts the number of "parts" in each string for which separate features are created

clone

public java.lang.Object clone()
Create a copy of this metric

Specified by:
clone in class StringMetric
Returns:
another KernelVSMetric with the same exact parameters as this metric

getOptions

public java.lang.String[] getOptions()
Gets the current settings of NGramTokenizer.

Specified by:
getOptions in interface OptionHandler
Returns:
an array of strings suitable for passing to setOptions()

setOptions

public void setOptions(java.lang.String[] options)
                throws java.lang.Exception
Parses a given list of options. Valid options are:

-S use stemming -R remove stopwords -N gram size

Specified by:
setOptions in interface OptionHandler
Parameters:
options - the list of options as an array of strings
Throws:
java.lang.Exception - if an option is not supported

listOptions

public java.util.Enumeration listOptions()
Returns an enumeration describing the available options.

Specified by:
listOptions in interface OptionHandler
Returns:
an enumeration of all the available options.