weka.deduping.metrics
Class SumInstanceMetric

java.lang.Object
  extended byweka.deduping.metrics.InstanceMetric
      extended byweka.deduping.metrics.SumInstanceMetric
All Implemented Interfaces:
OptionHandler, java.io.Serializable

public class SumInstanceMetric
extends InstanceMetric
implements OptionHandler, java.io.Serializable

SumInstanceMetric class simply adds values returned by StringMetrics on individual fields

See Also:
Serialized Form

Field Summary
protected  StringMetric m_metric
           
protected  int m_minCommonTokens
          We may require objects to have a minimum number of common tokens for them to be considered for distance computation
protected  int m_numNegPairs
           
protected  int m_numPosPairs
          The number of positive pairs desired for training
 StringMetric[] m_stringMetrics
          An array of StringMetrics that are to be used on each attribute
 
Fields inherited from class weka.deduping.metrics.InstanceMetric
m_attrIdxs, m_classIndex, m_metrics, m_numActualNegPairs, m_numActualPosPairs
 
Constructor Summary
SumInstanceMetric()
          A default constructor
 
Method Summary
 void buildInstanceMetric(int[] attrIdxs)
          Generates a new SumInstanceMetric based on specified attributes.
static java.lang.String concatStringArray(java.lang.String[] strings)
          A little helper to create a single String from an array of Strings
 double distance(Instance instance1, Instance instance2)
          Returns distance between two instances without using the weights.
 StringMetric getMetric()
          Get the baseline metric
 int getMinCommonTokens()
          Get the minimum number of common tokens that is required from objects to be considered for distance computation
 int getNumNegPairs()
          Get the number of different-class training pairs
 int getNumPosPairs()
          Get the number of same-class training pairs
 java.lang.String[] getOptions()
          Gets the current settings of Greedy Agglomerative Clustering
 PairwiseSelector getSelector()
          Get the pairwise selector for this metric
protected  java.util.ArrayList getStringList(Instances trainData, Instances testData, int attrIdx)
          An internal method for creating a list of strings for a particular attribute from two sets of instances: trianing and test data
protected static java.lang.String getTimestamp()
          Gets a string containing current date and time.
 boolean isDistanceBased()
          The computation of a metric can be either based on distance, or on similarity
 java.util.Enumeration listOptions()
          Returns an enumeration describing the available options
static int numCommonTokens(java.lang.String s1, java.lang.String s2)
          return the number of tokens that two strings have in commmon
 void setMetric(StringMetric metric)
          Set the baseline metric
 void setMinCommonTokens(int minCommonTokens)
          Set the minimum number of common tokens that is required from objects to be considered for distance computation
 void setNumNegPairs(int numNegPairs)
          Set the number of different-class training pairs
 void setNumPosPairs(int numPosPairs)
          Set the number of same-class training pairs
 void setOptions(java.lang.String[] options)
          Parses a given list of options.
 void setSelector(PairwiseSelector selector)
          Set the pairwise selector for this metric
 double similarity(Instance instance1, Instance instance2)
          Returns similarity between two instances without using the weights.
 void trainInstanceMetric(Instances trainData, Instances testData)
          Create a new metric for operating on specified instances
 
Methods inherited from class weka.deduping.metrics.InstanceMetric
forName, getAttrIdxs, getAttrIdxsWithoutLastClass, getAttrIndxs, getClassIndex, getNumActualNegPairs, getNumActualPosPairs, getNumAttributes, setAttrIdxs, setAttrIdxs, setClassIndex
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

m_stringMetrics

public StringMetric[] m_stringMetrics
An array of StringMetrics that are to be used on each attribute


m_metric

protected StringMetric m_metric

m_numPosPairs

protected int m_numPosPairs
The number of positive pairs desired for training


m_numNegPairs

protected int m_numNegPairs

m_minCommonTokens

protected int m_minCommonTokens
We may require objects to have a minimum number of common tokens for them to be considered for distance computation

Constructor Detail

SumInstanceMetric

public SumInstanceMetric()
A default constructor

Method Detail

buildInstanceMetric

public void buildInstanceMetric(int[] attrIdxs)
                         throws java.lang.Exception
Generates a new SumInstanceMetric based on specified attributes. Has to initialize all fields of the metric with default values.

Specified by:
buildInstanceMetric in class InstanceMetric
Throws:
java.lang.Exception - if the distance metric has not been generated successfully.

trainInstanceMetric

public void trainInstanceMetric(Instances trainData,
                                Instances testData)
                         throws java.lang.Exception
Create a new metric for operating on specified instances

Specified by:
trainInstanceMetric in class InstanceMetric
Parameters:
trainData - instances that the metric will be trained on
testData - instances that the metric will be used on
Throws:
java.lang.Exception

getStringList

protected java.util.ArrayList getStringList(Instances trainData,
                                            Instances testData,
                                            int attrIdx)
An internal method for creating a list of strings for a particular attribute from two sets of instances: trianing and test data


distance

public double distance(Instance instance1,
                       Instance instance2)
                throws java.lang.Exception
Returns distance between two instances without using the weights.

Specified by:
distance in class InstanceMetric
Parameters:
instance1 - First instance.
instance2 - Second instance.
Throws:
java.lang.Exception - if similarity could not be estimated.

similarity

public double similarity(Instance instance1,
                         Instance instance2)
                  throws java.lang.Exception
Returns similarity between two instances without using the weights.

Specified by:
similarity in class InstanceMetric
Parameters:
instance1 - First instance.
instance2 - Second instance.
Throws:
java.lang.Exception - if similarity could not be estimated.

isDistanceBased

public boolean isDistanceBased()
The computation of a metric can be either based on distance, or on similarity

Specified by:
isDistanceBased in class InstanceMetric

setMetric

public void setMetric(StringMetric metric)
Set the baseline metric

Parameters:
metric - the string metric to be used as the baseline on each string attribute

getMetric

public StringMetric getMetric()
Get the baseline metric


setSelector

public void setSelector(PairwiseSelector selector)
Set the pairwise selector for this metric

Parameters:
selector - a new pairwise selector

getSelector

public PairwiseSelector getSelector()
Get the pairwise selector for this metric


setNumPosPairs

public void setNumPosPairs(int numPosPairs)
Set the number of same-class training pairs

Parameters:
numPosPairs - the number of same-class training pairs to create for training

getNumPosPairs

public int getNumPosPairs()
Get the number of same-class training pairs

Returns:
the number of same-class training pairs to create for training

setNumNegPairs

public void setNumNegPairs(int numNegPairs)
Set the number of different-class training pairs

Parameters:
numNegPairs - the number of different-class training pairs to create for training

getNumNegPairs

public int getNumNegPairs()
Get the number of different-class training pairs

Returns:
the number of different-class training pairs to create for training

setMinCommonTokens

public void setMinCommonTokens(int minCommonTokens)
Set the minimum number of common tokens that is required from objects to be considered for distance computation

Parameters:
minCommonTokens - the minimum number of tokens in common that is required from objects to be considered for distance computation

getMinCommonTokens

public int getMinCommonTokens()
Get the minimum number of common tokens that is required from objects to be considered for distance computation

Returns:
the minimum number of tokens in common that is required from objects to be considered for distance computation

getTimestamp

protected static java.lang.String getTimestamp()
Gets a string containing current date and time.

Returns:
a string containing the date and time.

concatStringArray

public static java.lang.String concatStringArray(java.lang.String[] strings)
A little helper to create a single String from an array of Strings

Parameters:
strings - an array of strings

numCommonTokens

public static int numCommonTokens(java.lang.String s1,
                                  java.lang.String s2)
return the number of tokens that two strings have in commmon


listOptions

public java.util.Enumeration listOptions()
Returns an enumeration describing the available options

Specified by:
listOptions in interface OptionHandler
Returns:
an enumeration of all the available options

setOptions

public void setOptions(java.lang.String[] options)
                throws java.lang.Exception
Parses a given list of options. Valid options are:

-M metric options

StringMetric used

-C classifier options

Classifier used

Specified by:
setOptions in interface OptionHandler
Parameters:
options - the list of options as an array of strings
Throws:
java.lang.Exception - if an option is not supported

getOptions

public java.lang.String[] getOptions()
Gets the current settings of Greedy Agglomerative Clustering

Specified by:
getOptions in interface OptionHandler
Returns:
an array of strings suitable for passing to setOptions()