weka.deduping.metrics
Class AffineProbMetric

java.lang.Object
  extended byweka.deduping.metrics.StringMetric
      extended byweka.deduping.metrics.AffineProbMetric
All Implemented Interfaces:
java.lang.Cloneable, LearnableStringMetric, OptionHandler, java.io.Serializable

public class AffineProbMetric
extends StringMetric
implements LearnableStringMetric, java.io.Serializable, OptionHandler

AffineProbMetric class implements a probabilistic model string edit distance with affine-cost gaps

See Also:
Serialized Form

Field Summary
protected  char blank
          A handy constant for insertions/deletions, we treat them as substitution with a null character
static int CONVERSION_EXPONENTIAL
           
static int CONVERSION_LAPLACIAN
          We can have different ways of converting from distance to similarity
static int CONVERSION_UNIT
           
protected  double m_clampProb
          Minimal value of a probability parameter.
protected  int m_conversionType
          The method of converting, by default laplacian
protected  double[][] m_editopCosts
          parameters for the additive model, obtained from log-probs to speed up computations in the "testing" phase after weights have been learned
protected  double[][] m_editopLogProbs
           
protected  double[][] m_editopOccs
           
protected  double[][] m_editopProbs
           
protected  double m_endAtGapCost
           
protected  double m_endAtGapLogProb
           
protected  double m_endAtGapOccs
           
protected  double m_endAtGapProb
           
protected  double m_endAtSubCost
           
protected  double m_endAtSubLogProb
           
protected  double m_endAtSubOccs
           
protected  double m_endAtSubProb
           
protected  double m_gapEndCost
           
protected  double m_gapEndLogProb
           
protected  double m_gapEndOccs
           
protected  double m_gapEndProb
           
protected  double m_gapExtendCost
           
protected  double m_gapExtendLogProb
           
protected  double m_gapExtendOccs
           
protected  double m_gapExtendProb
           
protected  double m_gapStartCost
           
protected  double m_gapStartLogProb
           
protected  double m_gapStartOccs
           
protected  double m_gapStartProb
           
protected  double m_noopCost
           
protected  double m_noopLogProb
          Parameters for the generative model
protected  double m_noopOccs
          Parameters for the generative model
protected  double m_noopProb
          Parameters for the generative model
protected  boolean m_normalized
          Normalization of edit distance by string length; equivalent to using the posterior probability in the generative model
protected  int m_numIterations
          Maximum number of iterations for training the model; usually converge in <10 iterations
protected  double m_subCost
           
protected  double m_subLogProb
           
protected  double m_subOccs
           
protected  double m_subProb
           
protected  char[] m_usedChars
          TODO: given a corpus, populate this array with the characters that are actually encountered
protected  boolean m_useGenerativeModel
          true if we are using a generative model for distance in the "testing" phase after learning the parameters By default we want to use the additive model that uses probabilities converted to costs
protected  boolean m_verbose
           
static Tag[] TAGS_CONVERSION
           
 
Constructor Summary
AffineProbMetric()
          set up an instance of AffineProbMetric
 
Method Summary
protected  double[][][] backward(java.lang.String _s1, java.lang.String _s2)
          Calculate the backward matrices
 java.lang.Object clone()
          Create a copy of this metric
 double costDistance(java.lang.String string1, java.lang.String string2)
          Calculate affine gapped distance using learned costs
 double distance(java.lang.String s1, java.lang.String s2)
          Get the distance between two strings
protected  double expectationStep(java.lang.String _s1, java.lang.String _s2, int lambda, boolean pos_training)
          Expectation part of the EM algorithm accumulates expectations of editop probabilities over example pairs Expectation is calculated based on two examples which are either duplicates (pos=true) or non-duplicates (pos=false).
protected  double[][][] forward(java.lang.String _s1, java.lang.String _s2)
          Calculate the forward matrices
 double getClampProb()
          Get the clamping probability value
 SelectedTag getConversionType()
          return the type of similarity to distance conversion
 boolean getNormalized()
          Get whether the distance is normalized by the sum of the string's lengths
 java.lang.String[] getOptions()
          Gets the current settings of WeightedDotP.
 boolean getUseGenerativeModel()
          Do we use the generative model or convert back to the additive model?
protected  void initCosts()
          initialize the costs using current values of the probabilities
protected  void initProbs()
          initialize the probabilities to some startup values
 boolean isDistanceBased()
          The computation of a metric can be either based on distance, or on similarity
 java.util.Enumeration listOptions()
          Returns an enumeration describing the available options.
protected  double logSum(double _logA, double _logB)
          Calculation of log(a+b) with a correction for machine precision
static void main(java.lang.String[] args)
           
protected  void maximizationStep()
          Maximization step of the EM algorithm
protected  void normalizeEmissionProbs()
          Normalize the probabilities of emission editops so that they sum to 1 for each state
protected  void normalizeTransitionProbs()
          Normalize the probabilities of transitions so that they sum to 1 for each state
static void print3dMatrix(double[][][] matrix)
           
 void printAlignmentMatrix(java.lang.String _s1, java.lang.String _s2, int idx, double[][][] matrix)
           
 void printMatrices(java.lang.String s1, java.lang.String s2)
          print out the three matrices
protected  void printOpProbs()
          print out some data in case things go wrong
protected  void resetOccurrences()
          reset the number of occurrences of all ops in the set
 void setClampProb(double clampProb)
          Set the clamping probability value
 void setConversionType(SelectedTag conversionType)
          Set the type of similarity to distance conversion.
 void setNormalized(boolean normalized)
          Set the distance to be normalized by the sum of the string's lengths
 int setNumIterations()
          Get the number of training iterations
 void setNumIterations(int numIterations)
          Set the number of training iterations
 void setOptions(java.lang.String[] options)
          Parses a given list of options.
 void setUseGenerativeModel(boolean useGenerativeModel)
          Set the distance to use the generative model or convert back to the additive model
 double similarity(java.lang.String string1, java.lang.String string2)
          Returns a similarity estimate between two strings.
 void trainMetric(java.util.ArrayList pairList)
          Train the distance parameters using provided examples using EM
protected  void updateLogProbs()
          store logs of all probabilities in m_editopLogProbs
 
Methods inherited from class weka.deduping.metrics.StringMetric
forName
 
Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

m_editopProbs

protected double[][] m_editopProbs

m_editopLogProbs

protected double[][] m_editopLogProbs

m_editopOccs

protected double[][] m_editopOccs

m_noopProb

protected double m_noopProb
Parameters for the generative model


m_noopLogProb

protected double m_noopLogProb
Parameters for the generative model


m_noopOccs

protected double m_noopOccs
Parameters for the generative model


m_endAtSubProb

protected double m_endAtSubProb

m_endAtSubLogProb

protected double m_endAtSubLogProb

m_endAtSubOccs

protected double m_endAtSubOccs

m_endAtGapProb

protected double m_endAtGapProb

m_endAtGapLogProb

protected double m_endAtGapLogProb

m_endAtGapOccs

protected double m_endAtGapOccs

m_gapStartProb

protected double m_gapStartProb

m_gapStartLogProb

protected double m_gapStartLogProb

m_gapStartOccs

protected double m_gapStartOccs

m_gapExtendProb

protected double m_gapExtendProb

m_gapExtendLogProb

protected double m_gapExtendLogProb

m_gapExtendOccs

protected double m_gapExtendOccs

m_gapEndProb

protected double m_gapEndProb

m_gapEndLogProb

protected double m_gapEndLogProb

m_gapEndOccs

protected double m_gapEndOccs

m_subProb

protected double m_subProb

m_subLogProb

protected double m_subLogProb

m_subOccs

protected double m_subOccs

m_editopCosts

protected double[][] m_editopCosts
parameters for the additive model, obtained from log-probs to speed up computations in the "testing" phase after weights have been learned


m_noopCost

protected double m_noopCost

m_endAtSubCost

protected double m_endAtSubCost

m_endAtGapCost

protected double m_endAtGapCost

m_gapStartCost

protected double m_gapStartCost

m_gapExtendCost

protected double m_gapExtendCost

m_gapEndCost

protected double m_gapEndCost

m_subCost

protected double m_subCost

m_useGenerativeModel

protected boolean m_useGenerativeModel
true if we are using a generative model for distance in the "testing" phase after learning the parameters By default we want to use the additive model that uses probabilities converted to costs


m_numIterations

protected int m_numIterations
Maximum number of iterations for training the model; usually converge in <10 iterations


m_normalized

protected boolean m_normalized
Normalization of edit distance by string length; equivalent to using the posterior probability in the generative model


m_clampProb

protected double m_clampProb
Minimal value of a probability parameter. Particularly important when training sets are small to prevent zero probabilities.


blank

protected final char blank
A handy constant for insertions/deletions, we treat them as substitution with a null character

See Also:
Constant Field Values

m_usedChars

protected char[] m_usedChars
TODO: given a corpus, populate this array with the characters that are actually encountered


CONVERSION_LAPLACIAN

public static final int CONVERSION_LAPLACIAN
We can have different ways of converting from distance to similarity

See Also:
Constant Field Values

CONVERSION_UNIT

public static final int CONVERSION_UNIT
See Also:
Constant Field Values

CONVERSION_EXPONENTIAL

public static final int CONVERSION_EXPONENTIAL
See Also:
Constant Field Values

TAGS_CONVERSION

public static final Tag[] TAGS_CONVERSION

m_conversionType

protected int m_conversionType
The method of converting, by default laplacian


m_verbose

protected boolean m_verbose
Constructor Detail

AffineProbMetric

public AffineProbMetric()
set up an instance of AffineProbMetric

Method Detail

forward

protected double[][][] forward(java.lang.String _s1,
                               java.lang.String _s2)
Calculate the forward matrices

Parameters:
_s1 - first string
_s2 - second string
Returns:
m_endAtSubProb*matrix[l1][l2][0] + m_endAtGapProb(matrix[l1][l2][1] + matrix[l1][l2][2]) extendains the distance value

backward

protected double[][][] backward(java.lang.String _s1,
                                java.lang.String _s2)
Calculate the backward matrices

Parameters:
_s1 - first string
_s2 - second string
Returns:
matrix[0][0][0] extendains the distance value

printMatrices

public void printMatrices(java.lang.String s1,
                          java.lang.String s2)
print out the three matrices


printAlignmentMatrix

public void printAlignmentMatrix(java.lang.String _s1,
                                 java.lang.String _s2,
                                 int idx,
                                 double[][][] matrix)

printOpProbs

protected void printOpProbs()
print out some data in case things go wrong


trainMetric

public void trainMetric(java.util.ArrayList pairList)
                 throws java.lang.Exception
Train the distance parameters using provided examples using EM

Specified by:
trainMetric in interface LearnableStringMetric
Parameters:
pairList - the training data as a list of StringPair's
Throws:
java.lang.Exception

expectationStep

protected double expectationStep(java.lang.String _s1,
                                 java.lang.String _s2,
                                 int lambda,
                                 boolean pos_training)
Expectation part of the EM algorithm accumulates expectations of editop probabilities over example pairs Expectation is calculated based on two examples which are either duplicates (pos=true) or non-duplicates (pos=false). Lambda is a weighting parameter, 1 by default.

Parameters:
_s1 - first string
_s2 - second string
lambda - learning rate parameter, 1 by default
pos_training - true if strings are matched, false if mismatched

maximizationStep

protected void maximizationStep()
Maximization step of the EM algorithm


normalizeEmissionProbs

protected void normalizeEmissionProbs()
Normalize the probabilities of emission editops so that they sum to 1 for each state


normalizeTransitionProbs

protected void normalizeTransitionProbs()
Normalize the probabilities of transitions so that they sum to 1 for each state


resetOccurrences

protected void resetOccurrences()
reset the number of occurrences of all ops in the set


initProbs

protected void initProbs()
initialize the probabilities to some startup values


initCosts

protected void initCosts()
initialize the costs using current values of the probabilities


updateLogProbs

protected void updateLogProbs()
store logs of all probabilities in m_editopLogProbs


distance

public double distance(java.lang.String s1,
                       java.lang.String s2)
Get the distance between two strings

Specified by:
distance in class StringMetric
Parameters:
s1 - first string
s2 - second string
Returns:
a value of this distance between these two strings

logSum

protected double logSum(double _logA,
                        double _logB)
Calculation of log(a+b) with a correction for machine precision


costDistance

public double costDistance(java.lang.String string1,
                           java.lang.String string2)
Calculate affine gapped distance using learned costs

Returns:
minimum number of deletions/insertions/substitutions to be performed to transform s1 into s2 (or vice versa)

print3dMatrix

public static void print3dMatrix(double[][][] matrix)

setNormalized

public void setNormalized(boolean normalized)
Set the distance to be normalized by the sum of the string's lengths

Parameters:
normalized - if true, distance is normalized by the sum of string's lengths

getNormalized

public boolean getNormalized()
Get whether the distance is normalized by the sum of the string's lengths

Returns:
if true, distance is normalized by the sum of string's lengths

setUseGenerativeModel

public void setUseGenerativeModel(boolean useGenerativeModel)
Set the distance to use the generative model or convert back to the additive model

Parameters:
useGenerativeModel - if true, the generative model is used

getUseGenerativeModel

public boolean getUseGenerativeModel()
Do we use the generative model or convert back to the additive model?


setClampProb

public void setClampProb(double clampProb)
Set the clamping probability value

Parameters:
clampProb - a lower bound for all probability values to prevent underflow

getClampProb

public double getClampProb()
Get the clamping probability value

Returns:
a lower bound for all probability values to prevent underflow

setNumIterations

public void setNumIterations(int numIterations)
Set the number of training iterations

Parameters:
numIterations - the number of iterations

setNumIterations

public int setNumIterations()
Get the number of training iterations

Returns:
the number of training iterations

clone

public java.lang.Object clone()
Create a copy of this metric

Specified by:
clone in class StringMetric
Returns:
another AffineMetric with the same exact parameters as this metric

getOptions

public java.lang.String[] getOptions()
Gets the current settings of WeightedDotP.

Specified by:
getOptions in interface OptionHandler
Returns:
an array of strings suitable for passing to setOptions() TODO!!!!

setOptions

public void setOptions(java.lang.String[] options)
                throws java.lang.Exception
Parses a given list of options. Valid options are:

-N normalize by length -m matchCost -s subCost -g gapStartCost -e gapExtendCost

Specified by:
setOptions in interface OptionHandler
Parameters:
options - the list of options as an array of strings
Throws:
java.lang.Exception - if an option is not supported

listOptions

public java.util.Enumeration listOptions()
Returns an enumeration describing the available options.

Specified by:
listOptions in interface OptionHandler
Returns:
an enumeration of all the available options. TODO!!!

isDistanceBased

public boolean isDistanceBased()
The computation of a metric can be either based on distance, or on similarity

Specified by:
isDistanceBased in class StringMetric

similarity

public double similarity(java.lang.String string1,
                         java.lang.String string2)
                  throws java.lang.Exception
Returns a similarity estimate between two strings. Similarity is obtained by inverting the distance value using one of three methods: CONVERSION_LAPLACIAN, CONVERSION_EXPONENTIAL, CONVERSION_UNIT.

Specified by:
similarity in class StringMetric
Parameters:
string1 - First string.
string2 - Second string.
Throws:
java.lang.Exception - if similarity could not be estimated.

setConversionType

public void setConversionType(SelectedTag conversionType)
Set the type of similarity to distance conversion. Values other than CONVERSION_LAPLACIAN, CONVERSION_UNIT, or CONVERSION_EXPONENTIAL will be ignored


getConversionType

public SelectedTag getConversionType()
return the type of similarity to distance conversion

Returns:
one of CONVERSION_LAPLACIAN, CONVERSION_UNIT, or CONVERSION_EXPONENTIAL

main

public static void main(java.lang.String[] args)