weka.deduping.metrics
Class AffineMetric

java.lang.Object
  extended byweka.deduping.metrics.StringMetric
      extended byweka.deduping.metrics.AffineMetric
All Implemented Interfaces:
java.lang.Cloneable, OptionHandler, java.io.Serializable

public class AffineMetric
extends StringMetric
implements OptionHandler, java.io.Serializable

A measure of distance between two strings based on affine distance. See D. Gusfield, "Algorithms on Strings, Trees and Sequences", Cambridge University Press, 1997.

See Also:
Serialized Form

Field Summary
static int CONVERSION_EXPONENTIAL
           
static int CONVERSION_LAPLACIAN
          We can have different ways of converting from distance to similarity
static int CONVERSION_UNIT
           
protected  int m_conversionType
          The method of converting, by default laplacian
protected  double m_gapExtendCost
          The cost of continuing a gap
protected  double m_gapStartCost
          The cost of opening a gap
protected  double m_matchCost
          The cost of matching two characters
protected  boolean m_normalized
          Should the distance be normalized by the lengths of the strings?
protected  double m_subCost
          The cost of a substituting two characters
static Tag[] TAGS_CONVERSION
           
 
Constructor Summary
AffineMetric()
          A default constructor that assigns the name of this distance
 
Method Summary
 java.lang.Object clone()
          Create a copy of this metric
 double distance(java.lang.String string1, java.lang.String string2)
          Obtain the distance between two strings
 double getGapExtendCost()
          Get the gap extension cost
 double getGapStartCost()
          Get the gap opening cost
 double getMatchCost()
          Get the match cost
 boolean getNormalized()
          Get whether the distance is normalized by the sum of the string's lengths
 java.lang.String[] getOptions()
          Gets the current settings of WeightedDotP.
 double getSubCost()
          Get the substitution cost
 boolean isDataDependent()
          A metric can be data-dependent (e.g.
 boolean isDistanceBased()
          The computation of a metric can be either based on distance, or on similarity
 java.util.Enumeration listOptions()
          Returns an enumeration describing the available options.
 void setGapExtendCost(double gapExtendCost)
          Set the gap extension cost
 void setGapStartCost(double gapStartCost)
          Set the gap opening cost
 void setMatchCost(double matchCost)
          Set the match cost
 void setNormalized(boolean normalized)
          Set the distance to be normalized by the sum of the string's lengths
 void setOptions(java.lang.String[] options)
          Parses a given list of options.
 void setSubCost(double subCost)
          Set the substitution cost
 double similarity(java.lang.String string1, java.lang.String string2)
          Returns a similarity estimate between two strings.
 
Methods inherited from class weka.deduping.metrics.StringMetric
forName
 
Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

m_matchCost

protected double m_matchCost
The cost of matching two characters


m_subCost

protected double m_subCost
The cost of a substituting two characters


m_gapStartCost

protected double m_gapStartCost
The cost of opening a gap


m_gapExtendCost

protected double m_gapExtendCost
The cost of continuing a gap


m_normalized

protected boolean m_normalized
Should the distance be normalized by the lengths of the strings?


CONVERSION_LAPLACIAN

public static final int CONVERSION_LAPLACIAN
We can have different ways of converting from distance to similarity

See Also:
Constant Field Values

CONVERSION_UNIT

public static final int CONVERSION_UNIT
See Also:
Constant Field Values

CONVERSION_EXPONENTIAL

public static final int CONVERSION_EXPONENTIAL
See Also:
Constant Field Values

TAGS_CONVERSION

public static final Tag[] TAGS_CONVERSION

m_conversionType

protected int m_conversionType
The method of converting, by default laplacian

Constructor Detail

AffineMetric

public AffineMetric()
A default constructor that assigns the name of this distance

Method Detail

isDataDependent

public boolean isDataDependent()
A metric can be data-dependent (e.g. vector space for IDF)


distance

public double distance(java.lang.String string1,
                       java.lang.String string2)
                throws java.lang.Exception
Obtain the distance between two strings

Specified by:
distance in class StringMetric
Parameters:
string1 - first string
string2 - second string
Throws:
java.lang.Exception

isDistanceBased

public boolean isDistanceBased()
The computation of a metric can be either based on distance, or on similarity

Specified by:
isDistanceBased in class StringMetric

similarity

public double similarity(java.lang.String string1,
                         java.lang.String string2)
                  throws java.lang.Exception
Returns a similarity estimate between two strings. Similarity is obtained by inverting the distance value using one of three methods: CONVERSION_LAPLACIAN, CONVERSION_EXPONENTIAL, CONVERSION_UNIT.

Specified by:
similarity in class StringMetric
Parameters:
string1 - First string.
string2 - Second string.
Throws:
java.lang.Exception - if similarity could not be estimated.

setMatchCost

public void setMatchCost(double matchCost)
Set the match cost

Parameters:
matchCost - the cost of finding a matching pair of characters

getMatchCost

public double getMatchCost()
Get the match cost


setSubCost

public void setSubCost(double subCost)
Set the substitution cost

Parameters:
subCost - the cost of substituting one character for another

getSubCost

public double getSubCost()
Get the substitution cost


setGapStartCost

public void setGapStartCost(double gapStartCost)
Set the gap opening cost

Parameters:
gapStartCost - the cost of opening a gap

getGapStartCost

public double getGapStartCost()
Get the gap opening cost


setGapExtendCost

public void setGapExtendCost(double gapExtendCost)
Set the gap extension cost

Parameters:
gapExtendCost - the cost of extending a gap

getGapExtendCost

public double getGapExtendCost()
Get the gap extension cost


setNormalized

public void setNormalized(boolean normalized)
Set the distance to be normalized by the sum of the string's lengths

Parameters:
normalized - if true, distance is normalized by the sum of string's lengths

getNormalized

public boolean getNormalized()
Get whether the distance is normalized by the sum of the string's lengths

Returns:
if true, distance is normalized by the sum of string's lengths

clone

public java.lang.Object clone()
Create a copy of this metric

Specified by:
clone in class StringMetric
Returns:
another AffineMetric with the same exact parameters as this metric

getOptions

public java.lang.String[] getOptions()
Gets the current settings of WeightedDotP.

Specified by:
getOptions in interface OptionHandler
Returns:
an array of strings suitable for passing to setOptions()

setOptions

public void setOptions(java.lang.String[] options)
                throws java.lang.Exception
Parses a given list of options. Valid options are:

-N normalize by length -m matchCost -s subCost -g gapStartCost -e gapExtendCost

Specified by:
setOptions in interface OptionHandler
Parameters:
options - the list of options as an array of strings
Throws:
java.lang.Exception - if an option is not supported

listOptions

public java.util.Enumeration listOptions()
Returns an enumeration describing the available options.

Specified by:
listOptions in interface OptionHandler
Returns:
an enumeration of all the available options.