|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectweka.deduping.metrics.StringMetric
weka.deduping.metrics.JaccardMetric
This class claculates similarity between two strings using the Jaccard metric Some code borrowed from ir.vsr package by Raymond J. Mooney
Field Summary | |
static int |
CONVERSION_EXPONENTIAL
|
static int |
CONVERSION_LAPLACIAN
We can have different ways of converting from similarity to distance |
static int |
CONVERSION_UNIT
|
protected int |
m_conversionType
The method of converting, by default laplacian |
protected java.util.HashMap |
m_stringRefHash
Strings are mapped to StringReferences in this hash |
java.util.ArrayList |
m_stringRefs
A list of all indexed strings. |
protected java.util.HashMap |
m_tokenHash
A HashMap where tokens are indexed. |
protected Tokenizer |
m_tokenizer
An underlying tokenizer that is used for converting strings into HashMapVectors |
static Tag[] |
TAGS_CONVERSION
|
Constructor Summary | |
JaccardMetric()
Construct a vector space from a given set of examples |
Method Summary | |
void |
buildMetric(java.util.List strings)
Given a list of strings, build the vector space |
java.lang.Object |
clone()
Create a copy of this metric |
double |
distance(java.lang.String string1,
java.lang.String string2)
Returns distance between two strings using the current conversion type (CONVERSION_LAPLACIAN, CONVERSION_EXPONENTIAL, CONVERSION_UNIT, ...) |
double |
distance1(java.lang.String s1,
java.lang.String s2)
|
SelectedTag |
getConversionType()
return the type of similarity to distance conversion |
java.lang.String[] |
getOptions()
Gets the current settings of NGramTokenizer. |
Tokenizer |
getTokenizer()
Get the tokenizer to use |
protected void |
indexString(java.lang.String string,
HashMapVector vector)
Index a given string using its corresponding vector |
protected void |
indexToken(java.lang.String token,
int count,
StringReference strRef)
Add a token occurrence to the index. |
boolean |
isDistanceBased()
The computation of a metric can be either based on distance, or on similarity |
java.util.Enumeration |
listOptions()
Returns an enumeration describing the available options. |
void |
setConversionType(SelectedTag conversionType)
Set the type of similarity to distance conversion. |
void |
setOptions(java.lang.String[] options)
Parses a given list of options. |
void |
setTokenizer(Tokenizer tokenizer)
Set the tokenizer to use |
double |
similarity(java.lang.String s1,
java.lang.String s2)
Compute similarity between two strings |
int |
size()
Return the number of tokens indexed. |
Methods inherited from class weka.deduping.metrics.StringMetric |
forName |
Methods inherited from class java.lang.Object |
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
protected java.util.HashMap m_stringRefHash
protected java.util.HashMap m_tokenHash
public java.util.ArrayList m_stringRefs
protected Tokenizer m_tokenizer
public static final int CONVERSION_LAPLACIAN
public static final int CONVERSION_UNIT
public static final int CONVERSION_EXPONENTIAL
public static final Tag[] TAGS_CONVERSION
protected int m_conversionType
Constructor Detail |
public JaccardMetric()
Method Detail |
public void buildMetric(java.util.List strings) throws java.lang.Exception
buildMetric
in interface DataDependentStringMetric
strings
- a list of strings that the metric is built on
java.lang.Exception
protected void indexString(java.lang.String string, HashMapVector vector)
protected void indexToken(java.lang.String token, int count, StringReference strRef)
token
- The token to index.count
- The number of times it occurs in the document.strRef
- A reference to the String it occurs in.public double similarity(java.lang.String s1, java.lang.String s2)
similarity
in class StringMetric
s1
- first strings2
- second stringpublic boolean isDistanceBased()
isDistanceBased
in class StringMetric
public void setTokenizer(Tokenizer tokenizer)
tokenizer
- the tokenizer that is usedpublic Tokenizer getTokenizer()
public int size()
public double distance1(java.lang.String s1, java.lang.String s2) throws java.lang.Exception
java.lang.Exception
public double distance(java.lang.String string1, java.lang.String string2) throws java.lang.Exception
distance
in class StringMetric
string1
- First string.string2
- Second string.
java.lang.Exception
- if distance could not be estimated.public void setConversionType(SelectedTag conversionType)
public SelectedTag getConversionType()
public java.lang.Object clone()
clone
in class StringMetric
public java.lang.String[] getOptions()
getOptions
in interface OptionHandler
public void setOptions(java.lang.String[] options) throws java.lang.Exception
-S use stemming -R remove stopwords -N gram size
setOptions
in interface OptionHandler
options
- the list of options as an array of strings
java.lang.Exception
- if an option is not supportedpublic java.util.Enumeration listOptions()
listOptions
in interface OptionHandler
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |