weka.deduping.metrics
Class InstanceMetric

java.lang.Object
  extended byweka.deduping.metrics.InstanceMetric
Direct Known Subclasses:
ClassifierInstanceMetric, SumInstanceMetric

public abstract class InstanceMetric
extends java.lang.Object

Abstract InstanceMetric class for writing metrics that calculate distance between instances describing database records


Field Summary
protected  int[] m_attrIdxs
          indeces of attributes which the metric works on
protected  int m_classIndex
          index of the class attribute
protected  StringMetric[][] m_metrics
           
protected  int m_numActualNegPairs
           
protected  int m_numActualPosPairs
          The actual number of training pairs used in the last training round
 
Constructor Summary
InstanceMetric()
           
 
Method Summary
abstract  void buildInstanceMetric(int[] attrIdxs)
          Generates a new InstanceMetric based on specified attributes.
abstract  double distance(Instance instance1, Instance instance2)
          Returns a distance value between two instances.
static InstanceMetric forName(java.lang.String metricName, java.lang.String[] options)
          Creates a new instance of a metric given it's class name and (optional) arguments to pass to it's setOptions method.
 int[] getAttrIdxs(Instances instances)
          This function takes instances, and returns an array of integers 0..(num_attributes-1)
 int[] getAttrIdxsWithoutLastClass(Instances instances)
          It is often the case that last attribute of the data is the class.
 int[] getAttrIndxs()
          Returns an array of attribute incece which will be used by the metric
 int getClassIndex(int classIndex)
          Get the index of the attribute is the class attribute
 int getNumActualNegPairs()
          Return the actual number of negative training instances used in the last training round
 int getNumActualPosPairs()
          Return the actual number of positive training instances used in the last training round
 int getNumAttributes()
          Get the number of attributes that the metric uses
abstract  boolean isDistanceBased()
          The computation of a metric can be either based on distance, or on similarity
 void setAttrIdxs(int[] attrIdxs)
          Specifies a list of attributes which will be used by the metric
 void setAttrIdxs(int startIdx, int endIdx)
          Specifies an interval of attributes which will be used by the metric
 void setClassIndex(int classIndex)
          Specify which attribute is the class attribute
abstract  double similarity(Instance instance1, Instance instance2)
          Returns a similarity estimate between two instances.
abstract  void trainInstanceMetric(Instances trainData, Instances testData)
          Create a new metric for operating on specified instances
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

m_attrIdxs

protected int[] m_attrIdxs
indeces of attributes which the metric works on


m_metrics

protected StringMetric[][] m_metrics

m_classIndex

protected int m_classIndex
index of the class attribute


m_numActualPosPairs

protected int m_numActualPosPairs
The actual number of training pairs used in the last training round


m_numActualNegPairs

protected int m_numActualNegPairs
Constructor Detail

InstanceMetric

public InstanceMetric()
Method Detail

buildInstanceMetric

public abstract void buildInstanceMetric(int[] attrIdxs)
                                  throws java.lang.Exception
Generates a new InstanceMetric based on specified attributes. Has to initialize all fields of the metric with default values.

Throws:
java.lang.Exception - if the distance metric has not been generated successfully.

trainInstanceMetric

public abstract void trainInstanceMetric(Instances trainData,
                                         Instances testData)
                                  throws java.lang.Exception
Create a new metric for operating on specified instances

Parameters:
trainData - instances that the metric will be trained on
testData - instances that the metric will be used on
Throws:
java.lang.Exception

setAttrIdxs

public void setAttrIdxs(int[] attrIdxs)
Specifies a list of attributes which will be used by the metric


getAttrIndxs

public int[] getAttrIndxs()
Returns an array of attribute incece which will be used by the metric

Returns:
an array of attribute indices

setAttrIdxs

public void setAttrIdxs(int startIdx,
                        int endIdx)
Specifies an interval of attributes which will be used by the metric


distance

public abstract double distance(Instance instance1,
                                Instance instance2)
                         throws java.lang.Exception
Returns a distance value between two instances.

Parameters:
instance1 - First instance.
instance2 - Second instance.
Throws:
java.lang.Exception - if distance could not be estimated.

similarity

public abstract double similarity(Instance instance1,
                                  Instance instance2)
                           throws java.lang.Exception
Returns a similarity estimate between two instances.

Parameters:
instance1 - First instance.
instance2 - Second instance.
Throws:
java.lang.Exception - if similarity could not be estimated.

getAttrIdxsWithoutLastClass

public int[] getAttrIdxsWithoutLastClass(Instances instances)
It is often the case that last attribute of the data is the class. This function takes instances, and returns an array of integers 0..(num_attributes-1 - 1) to exclude the class attribute

Returns:
array of integer indeces of attributes, excluding last one which is the class index

getAttrIdxs

public int[] getAttrIdxs(Instances instances)
This function takes instances, and returns an array of integers 0..(num_attributes-1)

Returns:
array of integer indeces of attributes

setClassIndex

public void setClassIndex(int classIndex)
Specify which attribute is the class attribute


getClassIndex

public int getClassIndex(int classIndex)
Get the index of the attribute is the class attribute


getNumAttributes

public int getNumAttributes()
Get the number of attributes that the metric uses


isDistanceBased

public abstract boolean isDistanceBased()
The computation of a metric can be either based on distance, or on similarity


getNumActualPosPairs

public int getNumActualPosPairs()
Return the actual number of positive training instances used in the last training round

Returns:
the true number of duplicate pairs used for training in the last round

getNumActualNegPairs

public int getNumActualNegPairs()
Return the actual number of negative training instances used in the last training round

Returns:
the true number of non-duplicate pairs used for training in the last round

forName

public static InstanceMetric forName(java.lang.String metricName,
                                     java.lang.String[] options)
                              throws java.lang.Exception
Creates a new instance of a metric given it's class name and (optional) arguments to pass to it's setOptions method. If the classifier implements OptionHandler and the options parameter is non-null, the classifier will have it's options set.

Parameters:
metricName - the fully qualified class name of the metric
options - an array of options suitable for passing to setOptions. May be null.
Returns:
the newly created metric ready for use.
Throws:
java.lang.Exception - if the metric name is invalid, or the options supplied are not acceptable to the metric