|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectweka.deduping.blocking.Blocking
This class takes a set of records, amalgamates them into single strings and creates an inverted index for that collection. It then can return the pairs of strings that are most alike. Largely borrowed from VectorSpaceMetric.
Field Summary | |
protected java.util.HashMap |
m_instanceRefHash
Strings are mapped to StringReferences in this hash |
java.util.ArrayList |
m_instanceRefs
A list of all indexed instance. |
protected Instances |
m_instances
The dataset that contains the instances |
protected java.util.TreeSet |
m_pairSet
A TreeSet where the InstancePairs are stored for subsequent retrieval |
protected java.util.HashMap |
m_tokenHash
A HashMap where tokens are indexed. |
protected Tokenizer |
m_tokenizer
An underlying tokenizer that is used for converting strings into HashMapVectors |
protected boolean |
m_useIDF
Should IDF weighting be used? |
Constructor Summary | |
Blocking()
Construct a vector space from a given set of examples |
Method Summary | |
void |
buildIndex(Instances instances)
Given a list of strings, build the vector space |
protected void |
computeIDFandStringLengths()
Compute the IDF factor for every token in the index and the length of the string vector for every string referenced in the index. |
void |
createPairSet()
Populate m_pairSet with all the instancePairs that contain common tokens, so that they can be retrieved in the order of decreasing similarity later |
InstancePair[] |
getMostSimilarPairs(int numPairs)
Return n most similar pairs |
java.lang.String[] |
getOptions()
Gets the current settings of Blocking |
protected static java.lang.String |
getTimestamp()
Gets a string containing current date and time. |
Tokenizer |
getTokenizer()
Get the tokenizer to use |
boolean |
getUseIDF()
check whether IDF weighting is on/off |
protected void |
indexInstance(Instance instance,
int idx,
java.lang.String string,
HashMapVector vector)
Index a given Instance using its corresponding vector |
protected void |
indexToken(java.lang.String token,
int count,
InstanceReference instRef)
Add a token occurrence to the index. |
java.util.Enumeration |
listOptions()
Returns an enumeration describing the available options. |
void |
setOptions(java.lang.String[] options)
Parses a given list of options. |
void |
setTokenizer(Tokenizer tokenizer)
Set the tokenizer to use |
void |
setUseIDF(boolean useIDF)
Turn IDF weighting on/off |
double |
similarity(InstanceReference iRef1,
InstanceReference iRef2)
Compute similarity between two strings |
int |
size()
Return the number of tokens indexed. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
protected Instances m_instances
protected java.util.HashMap m_instanceRefHash
protected java.util.HashMap m_tokenHash
protected java.util.TreeSet m_pairSet
public java.util.ArrayList m_instanceRefs
protected Tokenizer m_tokenizer
protected boolean m_useIDF
Constructor Detail |
public Blocking()
Method Detail |
public void buildIndex(Instances instances) throws java.lang.Exception
java.lang.Exception
protected void indexInstance(Instance instance, int idx, java.lang.String string, HashMapVector vector)
protected void indexToken(java.lang.String token, int count, InstanceReference instRef)
token
- The token to index.count
- The number of times it occurs in the document.instRef
- A reference to the Instance it occurs in.protected void computeIDFandStringLengths()
public void createPairSet()
public double similarity(InstanceReference iRef1, InstanceReference iRef2)
public InstancePair[] getMostSimilarPairs(int numPairs)
public int size()
public void setTokenizer(Tokenizer tokenizer)
tokenizer
- the tokenizer that is usedpublic Tokenizer getTokenizer()
public void setUseIDF(boolean useIDF)
useIDF
- if true, all token weights will be weighted by IDFpublic boolean getUseIDF()
public java.lang.String[] getOptions()
getOptions
in interface OptionHandler
protected static java.lang.String getTimestamp()
public void setOptions(java.lang.String[] options) throws java.lang.Exception
setOptions
in interface OptionHandler
options
- the list of options as an array of strings
java.lang.Exception
- if an option is not supportedpublic java.util.Enumeration listOptions()
listOptions
in interface OptionHandler
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |