weka.deduping
Class BasicDeduper

java.lang.Object
  extended byweka.deduping.Deduper
      extended byweka.deduping.BasicDeduper
All Implemented Interfaces:
java.lang.Cloneable, OptionHandler, java.io.Serializable

public class BasicDeduper
extends Deduper
implements OptionHandler, java.io.Serializable

A basic deduper class that takes a set of objects and identifies disjoint subsets of duplicates

See Also:
Serialized Form

Field Summary
protected  int[] m_attrIdxs
          the attribute indeces on which to do deduping
protected  double[] m_classValues
          An array containing class values for instances (for faster statistics)
protected  int[] m_clusterAssignments
          temporary variable holding cluster assignments
protected  java.util.ArrayList m_clusters
          holds the clusters
protected  boolean m_debug
          verbose?
protected  double[][] m_distanceMatrix
          distance matrix containing the distance between each pair
protected  java.util.HashMap m_instancesHash
          instance hash, where each Integer index is hashed to an instance
protected  int m_numActualDupePairsTrain
           
protected  int m_numActualNonDupePairsTrain
           
protected  int m_numCurrentObjects
          Number of clusters in the process
protected  int m_numGoodPairs
           
protected  int m_numObjects
          The total number of true objects
protected  int m_numPotentialDupePairsTrain
           
protected  int m_numPotentialNonDupePairsTrain
           
protected  int m_numTotalPairs
          Statistics
protected  int m_numTotalPairsTest
           
protected  int m_numTotalPairsTrain
           
protected  int m_numTruePairs
           
protected  java.util.HashMap m_reverseInstancesHash
          reverse instance hash, where each instance is hashed to its Integer index
protected  Instances m_testInstances
          A set of instances to dedupe
protected  double m_testTimeStart
           
protected  double m_trainProportion
          The proportion of the training fold that should be used for training
protected  double m_trainTime
           
protected  boolean m_useBlocking
          Use blocking ?
 
Fields inherited from class weka.deduping.Deduper
m_statistics
 
Constructor Summary
BasicDeduper()
           
 
Method Summary
protected  void accumulateStatistics()
          Add the current state of things to statistics
 void buildDeduper(Instances trainFold, Instances testInstances)
          Given training data, build the metrics required by the deduper
protected  double clusterDistance(Cluster cluster1, Cluster cluster2)
          internal method that returns the distance between two clusters
static java.lang.String concatStringArray(java.lang.String[] strings)
          A little helper to create a single String from an array of Strings
protected  void createDistanceMatrix()
          Fill the distance matrix with values using the metric
 void findDuplicates(Instances testInstances, int numObjects)
          Identify duplicates within the testing data
 boolean getDebug()
          See whether debugging output is on/off
 InstanceMetric getMetric()
          Get the InstanceMetric that is used
 java.lang.String[] getOptions()
          Gets the current settings of Greedy Agglomerative Clustering
 double getTrainProportion()
          Get the amount of training
 boolean getUseBlocking()
          See whether blocking is on/off
protected  void hashInstances(Instances data)
          Create the hashtable from given Instances; keys are numeric indeces, values are actual Instances
 java.util.ArrayList initIntClusters()
          Computes the clusters from the cluster assignments
 java.util.Enumeration listOptions()
          Returns an enumeration describing the available options
protected  Cluster mergeClusters(int cluster1Idx, int cluster2Idx)
          Internal method to merge two clusters and update distances
protected  void mergeStep()
          Internal method that finds two most similar clusters and merges them
protected  int numCrossClusterTruePairs(Cluster cluster1, Cluster cluster2)
          Given two clusters, calculate the number of true pairs that will be added when the clusters are merged
protected  int numTruePairs(Instances instances)
          Given a test set, calculate the number of true pairs
 void printIntClusters()
          Outputs the current clustering
protected  void resetStatistics()
          Reset the current statistics
 void setDebug(boolean debug)
          Turn debugging output on/off
 void setMetric(InstanceMetric metric)
          Set the InstanceMetric that is used
 void setOptions(java.lang.String[] options)
          Parses a given list of options.
 void setTrainProportion(double trainProportion)
          Set the amount of training
 void setUseBlocking(boolean useBlocking)
          Turn debugging output on/off
 
Methods inherited from class weka.deduping.Deduper
forName, getStatistics
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

m_trainProportion

protected double m_trainProportion
The proportion of the training fold that should be used for training


m_distanceMatrix

protected double[][] m_distanceMatrix
distance matrix containing the distance between each pair


m_instancesHash

protected java.util.HashMap m_instancesHash
instance hash, where each Integer index is hashed to an instance


m_reverseInstancesHash

protected java.util.HashMap m_reverseInstancesHash
reverse instance hash, where each instance is hashed to its Integer index


m_attrIdxs

protected int[] m_attrIdxs
the attribute indeces on which to do deduping


m_numObjects

protected int m_numObjects
The total number of true objects


m_classValues

protected double[] m_classValues
An array containing class values for instances (for faster statistics)


m_numCurrentObjects

protected int m_numCurrentObjects
Number of clusters in the process


m_clusters

protected java.util.ArrayList m_clusters
holds the clusters


m_testInstances

protected Instances m_testInstances
A set of instances to dedupe


m_useBlocking

protected boolean m_useBlocking
Use blocking ?


m_clusterAssignments

protected int[] m_clusterAssignments
temporary variable holding cluster assignments


m_debug

protected boolean m_debug
verbose?


m_numTotalPairs

protected int m_numTotalPairs
Statistics


m_numGoodPairs

protected int m_numGoodPairs

m_numTruePairs

protected int m_numTruePairs

m_numTotalPairsTrain

protected int m_numTotalPairsTrain

m_numTotalPairsTest

protected int m_numTotalPairsTest

m_numPotentialDupePairsTrain

protected int m_numPotentialDupePairsTrain

m_numActualDupePairsTrain

protected int m_numActualDupePairsTrain

m_numPotentialNonDupePairsTrain

protected int m_numPotentialNonDupePairsTrain

m_numActualNonDupePairsTrain

protected int m_numActualNonDupePairsTrain

m_trainTime

protected double m_trainTime

m_testTimeStart

protected double m_testTimeStart
Constructor Detail

BasicDeduper

public BasicDeduper()
Method Detail

buildDeduper

public void buildDeduper(Instances trainFold,
                         Instances testInstances)
                  throws java.lang.Exception
Given training data, build the metrics required by the deduper

Specified by:
buildDeduper in class Deduper
Throws:
java.lang.Exception

findDuplicates

public void findDuplicates(Instances testInstances,
                           int numObjects)
                    throws java.lang.Exception
Identify duplicates within the testing data

Specified by:
findDuplicates in class Deduper
Parameters:
testInstances - a set of instances among which to identify duplicates
numObjects - the number of "true object" sets to create
Returns:
a list of object sets
Throws:
java.lang.Exception

initIntClusters

public java.util.ArrayList initIntClusters()
                                    throws java.lang.Exception
Computes the clusters from the cluster assignments

Throws:
java.lang.Exception - if clusters could not be computed successfully

mergeStep

protected void mergeStep()
                  throws java.lang.Exception
Internal method that finds two most similar clusters and merges them

Throws:
java.lang.Exception

clusterDistance

protected double clusterDistance(Cluster cluster1,
                                 Cluster cluster2)
internal method that returns the distance between two clusters


mergeClusters

protected Cluster mergeClusters(int cluster1Idx,
                                int cluster2Idx)
                         throws java.lang.Exception
Internal method to merge two clusters and update distances

Throws:
java.lang.Exception

hashInstances

protected void hashInstances(Instances data)
Create the hashtable from given Instances; keys are numeric indeces, values are actual Instances

Parameters:
data - Instances

createDistanceMatrix

protected void createDistanceMatrix()
                             throws java.lang.Exception
Fill the distance matrix with values using the metric

Throws:
java.lang.Exception

printIntClusters

public void printIntClusters()
                      throws java.lang.Exception
Outputs the current clustering

Throws:
java.lang.Exception - if something goes wrong

setTrainProportion

public void setTrainProportion(double trainProportion)
Set the amount of training

Parameters:
trainProportion - the proportion of the training set that will be used for learning

getTrainProportion

public double getTrainProportion()
Get the amount of training

Returns:
the proportion of the training set that will be used for learning

numTruePairs

protected int numTruePairs(Instances instances)
Given a test set, calculate the number of true pairs

Parameters:
instances - a set of objects, class has the true object ID

numCrossClusterTruePairs

protected int numCrossClusterTruePairs(Cluster cluster1,
                                       Cluster cluster2)
Given two clusters, calculate the number of true pairs that will be added when the clusters are merged

Parameters:
cluster1 - the first cluster to merge
cluster2 - the second cluster to merge

accumulateStatistics

protected void accumulateStatistics()
Add the current state of things to statistics


resetStatistics

protected void resetStatistics()
Reset the current statistics


setMetric

public void setMetric(InstanceMetric metric)
Set the InstanceMetric that is used

Parameters:
metric - the InstanceMetric that is used to dedupe

getMetric

public InstanceMetric getMetric()
Get the InstanceMetric that is used

Returns:
the InstanceMetric that is used to dedupe

setDebug

public void setDebug(boolean debug)
Turn debugging output on/off

Parameters:
debug - if true, debugging info will be printed

getDebug

public boolean getDebug()
See whether debugging output is on/off


setUseBlocking

public void setUseBlocking(boolean useBlocking)
Turn debugging output on/off


getUseBlocking

public boolean getUseBlocking()
See whether blocking is on/off


listOptions

public java.util.Enumeration listOptions()
Returns an enumeration describing the available options

Specified by:
listOptions in interface OptionHandler
Returns:
an enumeration of all the available options

setOptions

public void setOptions(java.lang.String[] options)
                throws java.lang.Exception
Parses a given list of options. Valid options are:

-M metric options

InstanceMetric used

Specified by:
setOptions in interface OptionHandler
Parameters:
options - the list of options as an array of strings
Throws:
java.lang.Exception - if an option is not supported

getOptions

public java.lang.String[] getOptions()
Gets the current settings of Greedy Agglomerative Clustering

Specified by:
getOptions in interface OptionHandler
Returns:
an array of strings suitable for passing to setOptions()

concatStringArray

public static java.lang.String concatStringArray(java.lang.String[] strings)
A little helper to create a single String from an array of Strings

Parameters:
strings - an array of strings