weka.deduping.blocking
Class Blocking

java.lang.Object
  extended byweka.deduping.blocking.Blocking
All Implemented Interfaces:
OptionHandler, java.io.Serializable

public class Blocking
extends java.lang.Object
implements OptionHandler, java.io.Serializable

This class takes a set of records, amalgamates them into single strings and creates an inverted index for that collection. It then can return the pairs of strings that are most alike. Largely borrowed from VectorSpaceMetric.

See Also:
Serialized Form

Field Summary
protected  java.util.HashMap m_instanceRefHash
          Strings are mapped to StringReferences in this hash
 java.util.ArrayList m_instanceRefs
          A list of all indexed instance.
protected  Instances m_instances
          The dataset that contains the instances
protected  java.util.TreeSet m_pairSet
          A TreeSet where the InstancePairs are stored for subsequent retrieval
protected  java.util.HashMap m_tokenHash
          A HashMap where tokens are indexed.
protected  Tokenizer m_tokenizer
          An underlying tokenizer that is used for converting strings into HashMapVectors
protected  boolean m_useIDF
          Should IDF weighting be used?
 
Constructor Summary
Blocking()
          Construct a vector space from a given set of examples
 
Method Summary
 void buildIndex(Instances instances)
          Given a list of strings, build the vector space
protected  void computeIDFandStringLengths()
          Compute the IDF factor for every token in the index and the length of the string vector for every string referenced in the index.
 void createPairSet()
          Populate m_pairSet with all the instancePairs that contain common tokens, so that they can be retrieved in the order of decreasing similarity later
 InstancePair[] getMostSimilarPairs(int numPairs)
          Return n most similar pairs
 java.lang.String[] getOptions()
          Gets the current settings of Blocking
protected static java.lang.String getTimestamp()
          Gets a string containing current date and time.
 Tokenizer getTokenizer()
          Get the tokenizer to use
 boolean getUseIDF()
          check whether IDF weighting is on/off
protected  void indexInstance(Instance instance, int idx, java.lang.String string, HashMapVector vector)
          Index a given Instance using its corresponding vector
protected  void indexToken(java.lang.String token, int count, InstanceReference instRef)
          Add a token occurrence to the index.
 java.util.Enumeration listOptions()
          Returns an enumeration describing the available options.
 void setOptions(java.lang.String[] options)
          Parses a given list of options.
 void setTokenizer(Tokenizer tokenizer)
          Set the tokenizer to use
 void setUseIDF(boolean useIDF)
          Turn IDF weighting on/off
 double similarity(InstanceReference iRef1, InstanceReference iRef2)
          Compute similarity between two strings
 int size()
          Return the number of tokens indexed.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

m_instances

protected Instances m_instances
The dataset that contains the instances


m_instanceRefHash

protected java.util.HashMap m_instanceRefHash
Strings are mapped to StringReferences in this hash


m_tokenHash

protected java.util.HashMap m_tokenHash
A HashMap where tokens are indexed. Each indexed token maps to a TokenInfo.


m_pairSet

protected java.util.TreeSet m_pairSet
A TreeSet where the InstancePairs are stored for subsequent retrieval


m_instanceRefs

public java.util.ArrayList m_instanceRefs
A list of all indexed instance. Elements are InstanceReference's.


m_tokenizer

protected Tokenizer m_tokenizer
An underlying tokenizer that is used for converting strings into HashMapVectors


m_useIDF

protected boolean m_useIDF
Should IDF weighting be used?

Constructor Detail

Blocking

public Blocking()
Construct a vector space from a given set of examples

Method Detail

buildIndex

public void buildIndex(Instances instances)
                throws java.lang.Exception
Given a list of strings, build the vector space

Throws:
java.lang.Exception

indexInstance

protected void indexInstance(Instance instance,
                             int idx,
                             java.lang.String string,
                             HashMapVector vector)
Index a given Instance using its corresponding vector


indexToken

protected void indexToken(java.lang.String token,
                          int count,
                          InstanceReference instRef)
Add a token occurrence to the index.

Parameters:
token - The token to index.
count - The number of times it occurs in the document.
instRef - A reference to the Instance it occurs in.

computeIDFandStringLengths

protected void computeIDFandStringLengths()
Compute the IDF factor for every token in the index and the length of the string vector for every string referenced in the index.


createPairSet

public void createPairSet()
Populate m_pairSet with all the instancePairs that contain common tokens, so that they can be retrieved in the order of decreasing similarity later


similarity

public double similarity(InstanceReference iRef1,
                         InstanceReference iRef2)
Compute similarity between two strings


getMostSimilarPairs

public InstancePair[] getMostSimilarPairs(int numPairs)
Return n most similar pairs


size

public int size()
Return the number of tokens indexed.

Returns:
the number of tokens indexed

setTokenizer

public void setTokenizer(Tokenizer tokenizer)
Set the tokenizer to use

Parameters:
tokenizer - the tokenizer that is used

getTokenizer

public Tokenizer getTokenizer()
Get the tokenizer to use

Returns:
the tokenizer that is used

setUseIDF

public void setUseIDF(boolean useIDF)
Turn IDF weighting on/off

Parameters:
useIDF - if true, all token weights will be weighted by IDF

getUseIDF

public boolean getUseIDF()
check whether IDF weighting is on/off

Returns:
if true, all token weights are weighted by IDF

getOptions

public java.lang.String[] getOptions()
Gets the current settings of Blocking

Specified by:
getOptions in interface OptionHandler
Returns:
an array of strings suitable for passing to setOptions()

getTimestamp

protected static java.lang.String getTimestamp()
Gets a string containing current date and time.

Returns:
a string containing the date and time.

setOptions

public void setOptions(java.lang.String[] options)
                throws java.lang.Exception
Parses a given list of options. Valid options are:

Specified by:
setOptions in interface OptionHandler
Parameters:
options - the list of options as an array of strings
Throws:
java.lang.Exception - if an option is not supported

listOptions

public java.util.Enumeration listOptions()
Returns an enumeration describing the available options.

Specified by:
listOptions in interface OptionHandler
Returns:
an enumeration of all the available options.