UTCS Artificial Intelligence
courses
talks/events
demos
people
projects
publications
software/data
labs
areas
admin
Employing Trainable String Similarity Metrics for Information Integration (2003)
Mikhail Bilenko
and
Raymond J. Mooney
The problem of identifying approximately duplicate records in databases is an essential step for the information integration processes. Most existing approaches have relied on generic or manually tuned distance metrics for estimating the similarity of potential duplicates. In this paper, we present a framework for improving duplicate detection using trainable measures of textual similarity. We propose to employ learnable text distance functions for each data field, and introduce an extended variant of learnable string edit distance based on an Expectation-Maximization(EM) training algorithm. Experimental results on a range of datasets show that this similarity metric is capable of adapting to the specific notions of similarity that are appropriate for different domains. Our overall system,
MARLIN
utilizes support vector machines to combine multiple similarity metrics, which are shown to perform better than ensembles of decision trees, which were employed for this task previously.
View:
PDF
,
PS
Citation:
In
Proceedings of the IJCAI-03 Workshop on Information Integration on the Web
, pp. 67-72, Acapulco, Mexico, August 2003.
Bibtex:
@inproceedings{bilenko:ijcai03-wkshp, title={Employing Trainable String Similarity Metrics for Information Integration}, author={Mikhail Bilenko and Raymond J. Mooney}, booktitle={Proceedings of the IJCAI-03 Workshop on Information Integration on the Web}, month={August}, address={Acapulco, Mexico}, pages={67-72}, url="http://www.cs.utexas.edu/users/ai-lab?bilenko:ijcai03-wkshp", year={2003} }
People
Mikhail Bilenko
Ph.D. Alumni
mbilenko [at] microsoft com
Raymond J. Mooney
Faculty
mooney [at] cs utexas edu
Areas of Interest
Machine Learning
Record Linkage & Duplicate Detection
Text Data Mining
Labs
Machine Learning