|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectweka.deduping.metrics.StringMetric
weka.deduping.metrics.VectorSpaceMetric
This class uses a vector space to calculate similarity between two strings Some code borrowed from ir.vsr package by Raymond J. Mooney
Field Summary | |
static int |
CONVERSION_EXPONENTIAL
|
static int |
CONVERSION_LAPLACIAN
We can have different ways of converting from similarity to distance |
static int |
CONVERSION_UNIT
|
protected int |
m_conversionType
The method of converting, by default laplacian |
protected java.util.HashMap |
m_stringRefHash
Strings are mapped to StringReferences in this hash |
java.util.ArrayList |
m_stringRefs
A list of all indexed strings. |
protected java.util.HashMap |
m_tokenHash
A HashMap where tokens are indexed. |
protected Tokenizer |
m_tokenizer
An underlying tokenizer that is used for converting strings into HashMapVectors |
protected boolean |
m_useIDF
Should IDF weighting be used? |
static Tag[] |
TAGS_CONVERSION
|
Constructor Summary | |
VectorSpaceMetric()
Construct a vector space from a given set of examples |
Method Summary | |
void |
buildMetric(java.util.List strings)
Given a list of strings, build the vector space |
java.lang.Object |
clone()
Create a copy of this metric |
protected void |
computeIDFandStringLengths()
Compute the IDF factor for every token in the index and the length of the string vector for every string referenced in the index. |
double |
distance(java.lang.String string1,
java.lang.String string2)
Returns distance between two strings using the current conversion type (CONVERSION_LAPLACIAN, CONVERSION_EXPONENTIAL, CONVERSION_UNIT, ...) |
SelectedTag |
getConversionType()
return the type of similarity to distance conversion |
java.lang.String[] |
getOptions()
Gets the current settings of NGramTokenizer. |
Tokenizer |
getTokenizer()
Get the tokenizer to use |
boolean |
getUseIDF()
check whether IDF weighting is on/off |
protected void |
indexString(java.lang.String string,
HashMapVector vector)
Index a given string using its corresponding vector |
protected void |
indexToken(java.lang.String token,
int count,
StringReference strRef)
Add a token occurrence to the index. |
boolean |
isDistanceBased()
The computation of a metric can be either based on distance, or on similarity |
java.util.Enumeration |
listOptions()
Returns an enumeration describing the available options. |
void |
setConversionType(SelectedTag conversionType)
Set the type of similarity to distance conversion. |
void |
setOptions(java.lang.String[] options)
Parses a given list of options. |
void |
setTokenizer(Tokenizer tokenizer)
Set the tokenizer to use |
void |
setUseIDF(boolean useIDF)
Turn IDF weighting on/off |
double |
similarity(java.lang.String s1,
java.lang.String s2)
Compute similarity between two strings |
int |
size()
Return the number of tokens indexed. |
Methods inherited from class weka.deduping.metrics.StringMetric |
forName |
Methods inherited from class java.lang.Object |
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
protected java.util.HashMap m_stringRefHash
protected java.util.HashMap m_tokenHash
public java.util.ArrayList m_stringRefs
protected Tokenizer m_tokenizer
protected boolean m_useIDF
public static final int CONVERSION_LAPLACIAN
public static final int CONVERSION_UNIT
public static final int CONVERSION_EXPONENTIAL
public static final Tag[] TAGS_CONVERSION
protected int m_conversionType
Constructor Detail |
public VectorSpaceMetric()
Method Detail |
public void buildMetric(java.util.List strings) throws java.lang.Exception
buildMetric
in interface DataDependentStringMetric
strings
- a list of strings that the metric is built on
java.lang.Exception
protected void indexString(java.lang.String string, HashMapVector vector)
protected void indexToken(java.lang.String token, int count, StringReference strRef)
token
- The token to index.count
- The number of times it occurs in the document.strRef
- A reference to the String it occurs in.protected void computeIDFandStringLengths()
public double similarity(java.lang.String s1, java.lang.String s2)
similarity
in class StringMetric
s1
- first strings2
- second stringpublic boolean isDistanceBased()
isDistanceBased
in class StringMetric
public void setTokenizer(Tokenizer tokenizer)
tokenizer
- the tokenizer that is usedpublic Tokenizer getTokenizer()
public void setUseIDF(boolean useIDF)
useIDF
- if true, all token weights will be weighted by IDFpublic boolean getUseIDF()
public int size()
public double distance(java.lang.String string1, java.lang.String string2) throws java.lang.Exception
distance
in class StringMetric
string1
- First string.string2
- Second string.
java.lang.Exception
- if distance could not be estimated.public void setConversionType(SelectedTag conversionType)
public SelectedTag getConversionType()
public java.lang.Object clone()
clone
in class StringMetric
public java.lang.String[] getOptions()
getOptions
in interface OptionHandler
public void setOptions(java.lang.String[] options) throws java.lang.Exception
-S use stemming -R remove stopwords -N gram size
setOptions
in interface OptionHandler
options
- the list of options as an array of strings
java.lang.Exception
- if an option is not supportedpublic java.util.Enumeration listOptions()
listOptions
in interface OptionHandler
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |