weka.deduping.metrics
Class ClassifierInstanceMetric

java.lang.Object
  extended byweka.deduping.metrics.InstanceMetric
      extended byweka.deduping.metrics.ClassifierInstanceMetric
All Implemented Interfaces:
OptionHandler, java.io.Serializable

public class ClassifierInstanceMetric
extends InstanceMetric
implements OptionHandler, java.io.Serializable

ClassifierInstanceMetric class employs a classifier that uses values returned by various StringMetric's on individual fields as features and outputs a confidence value that corresponds to similarity between records

See Also:
Serialized Form

Field Summary
protected  DistributionClassifier m_classifier
          Classifier that is used for estimating similarity between records
protected  Instances m_diffInstances
          A temporary dataset that contains diff-instances for training the classifier
protected  StringMetric[][] m_fieldMetrics
          The actual array of metrics
protected  int m_numNegPairs
           
protected  int m_numPosPairs
          The desired number of training pairs
protected  StringMetric[] m_stringMetrics
          StringMetric prototype that are to be used on each field
 
Fields inherited from class weka.deduping.metrics.InstanceMetric
m_attrIdxs, m_classIndex, m_metrics, m_numActualNegPairs, m_numActualPosPairs
 
Constructor Summary
ClassifierInstanceMetric()
          A default constructor
 
Method Summary
 void buildInstanceMetric(int[] attrIdxs)
          Generates a new ClassifierInstanceMetric that computes similarity between records using the specified attributes.
static java.lang.String concatStringArray(java.lang.String[] strings)
          A little helper to create a single String from an array of Strings
 double distance(Instance instance1, Instance instance2)
          Returns distance between two records
 DistributionClassifier getClassifier()
          Get the classifier
 int getNumNegPairs()
          Get the number of different-class training pairs
 int getNumPosPairs()
          Get the number of same-class training pairs
 java.lang.String[] getOptions()
          Gets the current settings of Greedy Agglomerative Clustering
 PairwiseSelector getSelector()
          Get the pairwise selector for this metric
protected  java.util.ArrayList getStringList(Instances trainData, Instances testData, int attrIdx)
          An internal method for creating a list of strings for a particular attribute from two sets of instances: trianing and test data
 StringMetric[] getStringMetrics()
          Get the baseline string metrics
protected static java.lang.String getTimestamp()
          Gets a string containing current date and time.
 boolean isDistanceBased()
          The computation can be either based on distance, or on similarity
 java.util.Enumeration listOptions()
          Returns an enumeration describing the available options
 void setClassifier(DistributionClassifier classifier)
          Set the classifier
 void setNumNegPairs(int numNegPairs)
          Set the number of different-class training pairs
 void setNumPosPairs(int numPosPairs)
          Set the number of same-class training pairs that is desired
 void setOptions(java.lang.String[] options)
          Parses a given list of options.
 void setSelector(PairwiseSelector selector)
          Set the pairwise selector for this metric
 void setStringMetrics(StringMetric[] metrics)
          Set the baseline metric
 double similarity(Instance instance1, Instance instance2)
          Returns similarity between two records
 void trainInstanceMetric(Instances trainData, Instances testData)
          Create a new metric for operating on specified instances
 
Methods inherited from class weka.deduping.metrics.InstanceMetric
forName, getAttrIdxs, getAttrIdxsWithoutLastClass, getAttrIndxs, getClassIndex, getNumActualNegPairs, getNumActualPosPairs, getNumAttributes, setAttrIdxs, setAttrIdxs, setClassIndex
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

m_classifier

protected DistributionClassifier m_classifier
Classifier that is used for estimating similarity between records


m_numPosPairs

protected int m_numPosPairs
The desired number of training pairs


m_numNegPairs

protected int m_numNegPairs

m_stringMetrics

protected StringMetric[] m_stringMetrics
StringMetric prototype that are to be used on each field


m_fieldMetrics

protected StringMetric[][] m_fieldMetrics
The actual array of metrics


m_diffInstances

protected Instances m_diffInstances
A temporary dataset that contains diff-instances for training the classifier

Constructor Detail

ClassifierInstanceMetric

public ClassifierInstanceMetric()
A default constructor

Method Detail

buildInstanceMetric

public void buildInstanceMetric(int[] attrIdxs)
                         throws java.lang.Exception
Generates a new ClassifierInstanceMetric that computes similarity between records using the specified attributes. Has to initialize all metric fields with default string metrics

Specified by:
buildInstanceMetric in class InstanceMetric
Parameters:
attrIdxs - the indeces of attributes that the metric will use
Throws:
java.lang.Exception - if the distance metric has not been generated successfully.

trainInstanceMetric

public void trainInstanceMetric(Instances trainData,
                                Instances testData)
                         throws java.lang.Exception
Create a new metric for operating on specified instances

Specified by:
trainInstanceMetric in class InstanceMetric
Parameters:
trainData - instances for training the metric
testData - instances that will be used for testing
Throws:
java.lang.Exception

getStringList

protected java.util.ArrayList getStringList(Instances trainData,
                                            Instances testData,
                                            int attrIdx)
An internal method for creating a list of strings for a particular attribute from two sets of instances: trianing and test data

Parameters:
trainData - a dataset of records in the training fold
testData - a dataset of records in the testing fold
attrIdx - the index of the attribute for which strings are to be collected
Returns:
a list of strings that occur for this attribute; duplicates are allowed

distance

public double distance(Instance instance1,
                       Instance instance2)
                throws java.lang.Exception
Returns distance between two records

Specified by:
distance in class InstanceMetric
Parameters:
instance1 - First record.
instance2 - Second record.
Throws:
java.lang.Exception - if distance could not be calculated.

similarity

public double similarity(Instance instance1,
                         Instance instance2)
                  throws java.lang.Exception
Returns similarity between two records

Specified by:
similarity in class InstanceMetric
Parameters:
instance1 - First instance.
instance2 - Second instance.
Throws:
java.lang.Exception - if similarity could not be calculated.

isDistanceBased

public boolean isDistanceBased()
The computation can be either based on distance, or on similarity

Specified by:
isDistanceBased in class InstanceMetric

setClassifier

public void setClassifier(DistributionClassifier classifier)
Set the classifier

Parameters:
classifier - the classifier

getClassifier

public DistributionClassifier getClassifier()
Get the classifier


setStringMetrics

public void setStringMetrics(StringMetric[] metrics)
Set the baseline metric

Parameters:
metrics - string metrics that will used on each string attribute

getStringMetrics

public StringMetric[] getStringMetrics()
Get the baseline string metrics

Returns:
the string metrics that are used for each field

setSelector

public void setSelector(PairwiseSelector selector)
Set the pairwise selector for this metric

Parameters:
selector - a new pairwise selector

getSelector

public PairwiseSelector getSelector()
Get the pairwise selector for this metric

Returns:
the pairwise selector

setNumPosPairs

public void setNumPosPairs(int numPosPairs)
Set the number of same-class training pairs that is desired

Parameters:
numPosPairs - the number of same-class training pairs to be created for training the classifier

getNumPosPairs

public int getNumPosPairs()
Get the number of same-class training pairs

Returns:
the number of same-class training pairs to create for training the classifier

setNumNegPairs

public void setNumNegPairs(int numNegPairs)
Set the number of different-class training pairs

Parameters:
numNegPairs - the number of different-class training pairs to create for training the classifier

getNumNegPairs

public int getNumNegPairs()
Get the number of different-class training pairs

Returns:
the number of different-class training pairs to create for training the classifier

getTimestamp

protected static java.lang.String getTimestamp()
Gets a string containing current date and time.

Returns:
a string containing the date and time.

concatStringArray

public static java.lang.String concatStringArray(java.lang.String[] strings)
A little helper to create a single String from an array of Strings

Parameters:
strings - an array of strings

listOptions

public java.util.Enumeration listOptions()
Returns an enumeration describing the available options

Specified by:
listOptions in interface OptionHandler
Returns:
an enumeration of all the available options

setOptions

public void setOptions(java.lang.String[] options)
                throws java.lang.Exception
Parses a given list of options. Valid options are:

-M metric options

StringMetric used

-C classifier options

Classifier used

Specified by:
setOptions in interface OptionHandler
Parameters:
options - the list of options as an array of strings
Throws:
java.lang.Exception - if an option is not supported

getOptions

public java.lang.String[] getOptions()
Gets the current settings of Greedy Agglomerative Clustering

Specified by:
getOptions in interface OptionHandler
Returns:
an array of strings suitable for passing to setOptions()