weka.deduping.metrics
Class JaccardMetric

java.lang.Object
  extended byweka.deduping.metrics.StringMetric
      extended byweka.deduping.metrics.JaccardMetric
All Implemented Interfaces:
java.lang.Cloneable, DataDependentStringMetric, OptionHandler, java.io.Serializable

public class JaccardMetric
extends StringMetric
implements DataDependentStringMetric, OptionHandler, java.io.Serializable

This class claculates similarity between two strings using the Jaccard metric Some code borrowed from ir.vsr package by Raymond J. Mooney

See Also:
Serialized Form

Field Summary
static int CONVERSION_EXPONENTIAL
           
static int CONVERSION_LAPLACIAN
          We can have different ways of converting from similarity to distance
static int CONVERSION_UNIT
           
protected  int m_conversionType
          The method of converting, by default laplacian
protected  java.util.HashMap m_stringRefHash
          Strings are mapped to StringReferences in this hash
 java.util.ArrayList m_stringRefs
          A list of all indexed strings.
protected  java.util.HashMap m_tokenHash
          A HashMap where tokens are indexed.
protected  Tokenizer m_tokenizer
          An underlying tokenizer that is used for converting strings into HashMapVectors
static Tag[] TAGS_CONVERSION
           
 
Constructor Summary
JaccardMetric()
          Construct a vector space from a given set of examples
 
Method Summary
 void buildMetric(java.util.List strings)
          Given a list of strings, build the vector space
 java.lang.Object clone()
          Create a copy of this metric
 double distance(java.lang.String string1, java.lang.String string2)
          Returns distance between two strings using the current conversion type (CONVERSION_LAPLACIAN, CONVERSION_EXPONENTIAL, CONVERSION_UNIT, ...)
 double distance1(java.lang.String s1, java.lang.String s2)
           
 SelectedTag getConversionType()
          return the type of similarity to distance conversion
 java.lang.String[] getOptions()
          Gets the current settings of NGramTokenizer.
 Tokenizer getTokenizer()
          Get the tokenizer to use
protected  void indexString(java.lang.String string, HashMapVector vector)
          Index a given string using its corresponding vector
protected  void indexToken(java.lang.String token, int count, StringReference strRef)
          Add a token occurrence to the index.
 boolean isDistanceBased()
          The computation of a metric can be either based on distance, or on similarity
 java.util.Enumeration listOptions()
          Returns an enumeration describing the available options.
 void setConversionType(SelectedTag conversionType)
          Set the type of similarity to distance conversion.
 void setOptions(java.lang.String[] options)
          Parses a given list of options.
 void setTokenizer(Tokenizer tokenizer)
          Set the tokenizer to use
 double similarity(java.lang.String s1, java.lang.String s2)
          Compute similarity between two strings
 int size()
          Return the number of tokens indexed.
 
Methods inherited from class weka.deduping.metrics.StringMetric
forName
 
Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

m_stringRefHash

protected java.util.HashMap m_stringRefHash
Strings are mapped to StringReferences in this hash


m_tokenHash

protected java.util.HashMap m_tokenHash
A HashMap where tokens are indexed. Each indexed token maps to a TokenInfo.


m_stringRefs

public java.util.ArrayList m_stringRefs
A list of all indexed strings. Elements are StringReference's.


m_tokenizer

protected Tokenizer m_tokenizer
An underlying tokenizer that is used for converting strings into HashMapVectors


CONVERSION_LAPLACIAN

public static final int CONVERSION_LAPLACIAN
We can have different ways of converting from similarity to distance

See Also:
Constant Field Values

CONVERSION_UNIT

public static final int CONVERSION_UNIT
See Also:
Constant Field Values

CONVERSION_EXPONENTIAL

public static final int CONVERSION_EXPONENTIAL
See Also:
Constant Field Values

TAGS_CONVERSION

public static final Tag[] TAGS_CONVERSION

m_conversionType

protected int m_conversionType
The method of converting, by default laplacian

Constructor Detail

JaccardMetric

public JaccardMetric()
Construct a vector space from a given set of examples

Method Detail

buildMetric

public void buildMetric(java.util.List strings)
                 throws java.lang.Exception
Given a list of strings, build the vector space

Specified by:
buildMetric in interface DataDependentStringMetric
Parameters:
strings - a list of strings that the metric is built on
Throws:
java.lang.Exception

indexString

protected void indexString(java.lang.String string,
                           HashMapVector vector)
Index a given string using its corresponding vector


indexToken

protected void indexToken(java.lang.String token,
                          int count,
                          StringReference strRef)
Add a token occurrence to the index.

Parameters:
token - The token to index.
count - The number of times it occurs in the document.
strRef - A reference to the String it occurs in.

similarity

public double similarity(java.lang.String s1,
                         java.lang.String s2)
Compute similarity between two strings

Specified by:
similarity in class StringMetric
Parameters:
s1 - first string
s2 - second string

isDistanceBased

public boolean isDistanceBased()
The computation of a metric can be either based on distance, or on similarity

Specified by:
isDistanceBased in class StringMetric

setTokenizer

public void setTokenizer(Tokenizer tokenizer)
Set the tokenizer to use

Parameters:
tokenizer - the tokenizer that is used

getTokenizer

public Tokenizer getTokenizer()
Get the tokenizer to use

Returns:
the tokenizer that is used

size

public int size()
Return the number of tokens indexed.

Returns:
the number of tokens indexed

distance1

public double distance1(java.lang.String s1,
                        java.lang.String s2)
                 throws java.lang.Exception
Throws:
java.lang.Exception

distance

public double distance(java.lang.String string1,
                       java.lang.String string2)
                throws java.lang.Exception
Returns distance between two strings using the current conversion type (CONVERSION_LAPLACIAN, CONVERSION_EXPONENTIAL, CONVERSION_UNIT, ...)

Specified by:
distance in class StringMetric
Parameters:
string1 - First string.
string2 - Second string.
Throws:
java.lang.Exception - if distance could not be estimated.

setConversionType

public void setConversionType(SelectedTag conversionType)
Set the type of similarity to distance conversion. Values other than CONVERSION_LAPLACIAN, CONVERSION_UNIT, or CONVERSION_EXPONENTIAL will be ignored


getConversionType

public SelectedTag getConversionType()
return the type of similarity to distance conversion

Returns:
one of CONVERSION_LAPLACIAN, CONVERSION_UNIT, or CONVERSION_EXPONENTIAL

clone

public java.lang.Object clone()
Create a copy of this metric

Specified by:
clone in class StringMetric
Returns:
another JaccardMetric with the same exact parameters as this metric

getOptions

public java.lang.String[] getOptions()
Gets the current settings of NGramTokenizer.

Specified by:
getOptions in interface OptionHandler
Returns:
an array of strings suitable for passing to setOptions()

setOptions

public void setOptions(java.lang.String[] options)
                throws java.lang.Exception
Parses a given list of options. Valid options are:

-S use stemming -R remove stopwords -N gram size

Specified by:
setOptions in interface OptionHandler
Parameters:
options - the list of options as an array of strings
Throws:
java.lang.Exception - if an option is not supported

listOptions

public java.util.Enumeration listOptions()
Returns an enumeration describing the available options.

Specified by:
listOptions in interface OptionHandler
Returns:
an enumeration of all the available options.