weka.deduping.metrics
Class VectorSpaceMetric

java.lang.Object
  extended byweka.deduping.metrics.StringMetric
      extended byweka.deduping.metrics.VectorSpaceMetric
All Implemented Interfaces:
java.lang.Cloneable, DataDependentStringMetric, OptionHandler, java.io.Serializable

public class VectorSpaceMetric
extends StringMetric
implements DataDependentStringMetric, OptionHandler, java.io.Serializable

This class uses a vector space to calculate similarity between two strings Some code borrowed from ir.vsr package by Raymond J. Mooney

See Also:
Serialized Form

Field Summary
static int CONVERSION_EXPONENTIAL
           
static int CONVERSION_LAPLACIAN
          We can have different ways of converting from similarity to distance
static int CONVERSION_UNIT
           
protected  int m_conversionType
          The method of converting, by default laplacian
protected  java.util.HashMap m_stringRefHash
          Strings are mapped to StringReferences in this hash
 java.util.ArrayList m_stringRefs
          A list of all indexed strings.
protected  java.util.HashMap m_tokenHash
          A HashMap where tokens are indexed.
protected  Tokenizer m_tokenizer
          An underlying tokenizer that is used for converting strings into HashMapVectors
protected  boolean m_useIDF
          Should IDF weighting be used?
static Tag[] TAGS_CONVERSION
           
 
Constructor Summary
VectorSpaceMetric()
          Construct a vector space from a given set of examples
 
Method Summary
 void buildMetric(java.util.List strings)
          Given a list of strings, build the vector space
 java.lang.Object clone()
          Create a copy of this metric
protected  void computeIDFandStringLengths()
          Compute the IDF factor for every token in the index and the length of the string vector for every string referenced in the index.
 double distance(java.lang.String string1, java.lang.String string2)
          Returns distance between two strings using the current conversion type (CONVERSION_LAPLACIAN, CONVERSION_EXPONENTIAL, CONVERSION_UNIT, ...)
 SelectedTag getConversionType()
          return the type of similarity to distance conversion
 java.lang.String[] getOptions()
          Gets the current settings of NGramTokenizer.
 Tokenizer getTokenizer()
          Get the tokenizer to use
 boolean getUseIDF()
          check whether IDF weighting is on/off
protected  void indexString(java.lang.String string, HashMapVector vector)
          Index a given string using its corresponding vector
protected  void indexToken(java.lang.String token, int count, StringReference strRef)
          Add a token occurrence to the index.
 boolean isDistanceBased()
          The computation of a metric can be either based on distance, or on similarity
 java.util.Enumeration listOptions()
          Returns an enumeration describing the available options.
 void setConversionType(SelectedTag conversionType)
          Set the type of similarity to distance conversion.
 void setOptions(java.lang.String[] options)
          Parses a given list of options.
 void setTokenizer(Tokenizer tokenizer)
          Set the tokenizer to use
 void setUseIDF(boolean useIDF)
          Turn IDF weighting on/off
 double similarity(java.lang.String s1, java.lang.String s2)
          Compute similarity between two strings
 int size()
          Return the number of tokens indexed.
 
Methods inherited from class weka.deduping.metrics.StringMetric
forName
 
Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

m_stringRefHash

protected java.util.HashMap m_stringRefHash
Strings are mapped to StringReferences in this hash


m_tokenHash

protected java.util.HashMap m_tokenHash
A HashMap where tokens are indexed. Each indexed token maps to a TokenInfo.


m_stringRefs

public java.util.ArrayList m_stringRefs
A list of all indexed strings. Elements are StringReference's.


m_tokenizer

protected Tokenizer m_tokenizer
An underlying tokenizer that is used for converting strings into HashMapVectors


m_useIDF

protected boolean m_useIDF
Should IDF weighting be used?


CONVERSION_LAPLACIAN

public static final int CONVERSION_LAPLACIAN
We can have different ways of converting from similarity to distance

See Also:
Constant Field Values

CONVERSION_UNIT

public static final int CONVERSION_UNIT
See Also:
Constant Field Values

CONVERSION_EXPONENTIAL

public static final int CONVERSION_EXPONENTIAL
See Also:
Constant Field Values

TAGS_CONVERSION

public static final Tag[] TAGS_CONVERSION

m_conversionType

protected int m_conversionType
The method of converting, by default laplacian

Constructor Detail

VectorSpaceMetric

public VectorSpaceMetric()
Construct a vector space from a given set of examples

Method Detail

buildMetric

public void buildMetric(java.util.List strings)
                 throws java.lang.Exception
Given a list of strings, build the vector space

Specified by:
buildMetric in interface DataDependentStringMetric
Parameters:
strings - a list of strings that the metric is built on
Throws:
java.lang.Exception

indexString

protected void indexString(java.lang.String string,
                           HashMapVector vector)
Index a given string using its corresponding vector


indexToken

protected void indexToken(java.lang.String token,
                          int count,
                          StringReference strRef)
Add a token occurrence to the index.

Parameters:
token - The token to index.
count - The number of times it occurs in the document.
strRef - A reference to the String it occurs in.

computeIDFandStringLengths

protected void computeIDFandStringLengths()
Compute the IDF factor for every token in the index and the length of the string vector for every string referenced in the index.


similarity

public double similarity(java.lang.String s1,
                         java.lang.String s2)
Compute similarity between two strings

Specified by:
similarity in class StringMetric
Parameters:
s1 - first string
s2 - second string

isDistanceBased

public boolean isDistanceBased()
The computation of a metric can be either based on distance, or on similarity

Specified by:
isDistanceBased in class StringMetric

setTokenizer

public void setTokenizer(Tokenizer tokenizer)
Set the tokenizer to use

Parameters:
tokenizer - the tokenizer that is used

getTokenizer

public Tokenizer getTokenizer()
Get the tokenizer to use

Returns:
the tokenizer that is used

setUseIDF

public void setUseIDF(boolean useIDF)
Turn IDF weighting on/off

Parameters:
useIDF - if true, all token weights will be weighted by IDF

getUseIDF

public boolean getUseIDF()
check whether IDF weighting is on/off

Returns:
if true, all token weights are weighted by IDF

size

public int size()
Return the number of tokens indexed.

Returns:
the number of tokens indexed

distance

public double distance(java.lang.String string1,
                       java.lang.String string2)
                throws java.lang.Exception
Returns distance between two strings using the current conversion type (CONVERSION_LAPLACIAN, CONVERSION_EXPONENTIAL, CONVERSION_UNIT, ...)

Specified by:
distance in class StringMetric
Parameters:
string1 - First string.
string2 - Second string.
Throws:
java.lang.Exception - if distance could not be estimated.

setConversionType

public void setConversionType(SelectedTag conversionType)
Set the type of similarity to distance conversion. Values other than CONVERSION_LAPLACIAN, CONVERSION_UNIT, or CONVERSION_EXPONENTIAL will be ignored


getConversionType

public SelectedTag getConversionType()
return the type of similarity to distance conversion

Returns:
one of CONVERSION_LAPLACIAN, CONVERSION_UNIT, or CONVERSION_EXPONENTIAL

clone

public java.lang.Object clone()
Create a copy of this metric

Specified by:
clone in class StringMetric
Returns:
another VectorSpaceMetric with the same exact parameters as this metric

getOptions

public java.lang.String[] getOptions()
Gets the current settings of NGramTokenizer.

Specified by:
getOptions in interface OptionHandler
Returns:
an array of strings suitable for passing to setOptions()

setOptions

public void setOptions(java.lang.String[] options)
                throws java.lang.Exception
Parses a given list of options. Valid options are:

-S use stemming -R remove stopwords -N gram size

Specified by:
setOptions in interface OptionHandler
Parameters:
options - the list of options as an array of strings
Throws:
java.lang.Exception - if an option is not supported

listOptions

public java.util.Enumeration listOptions()
Returns an enumeration describing the available options.

Specified by:
listOptions in interface OptionHandler
Returns:
an enumeration of all the available options.