Package weka.deduping.metrics

Interface Summary
DataDependentStringMetric An interface for data-dependent metrics that are built on supplied data
LearnableStringMetric An interface for learnable string metrics
 

Class Summary
AffineMetric A measure of distance between two strings based on affine distance.
AffineProbMetric AffineProbMetric class implements a probabilistic model string edit distance with affine-cost gaps
ClassifierInstanceMetric ClassifierInstanceMetric class employs a classifier that uses values returned by various StringMetric's on individual fields as features and outputs a confidence value that corresponds to similarity between records
HashMapVector A data structure for a term vector for a document stored as a HashMap that maps tokens to Weight's that store the weight of that token in the document.
InstanceMetric Abstract InstanceMetric class for writing metrics that calculate distance between instances describing database records
JaccardMetric This class claculates similarity between two strings using the Jaccard metric Some code borrowed from ir.vsr package by Raymond J.
KernelVSMetric This class defines a basic string kernel based on vector space Some code borrowed from ir.vsr package by Raymond J.
NGramTokenizer This class defines a tokenizer that turns strings into HashMapVectors of n-grams
Porter The Porter stemmer for reducing words to their base stem form.
StringMetric An abstract class that returns a measure of similarity between strings
StringReference A simple data structure for storing a reference to a document file that includes information on the length of its document vector.
SumInstanceMetric SumInstanceMetric class simply adds values returned by StringMetrics on individual fields
TokenInfo A lightweight object for storing information about a token (a.k.a word, term) in an inverted index.
Tokenizer This abstract class defines a tokenizer that turns strings into HashMapVectors
TokenOccurrence A lightweight object for storing information about an occurrence of a token (a.k.a word, term) in a Document.
TokenString  
VectorSpaceMetric This class uses a vector space to calculate similarity between two strings Some code borrowed from ir.vsr package by Raymond J.
Weight A simple wrapper data structure for storing a double weight as an Object that can be put into lists, maps, etc.
WordTokenizer This class defines a tokenizer that turns strings into HashMapVectors using the native Java StringTokenizer