UTCS Artificial Intelligence
courses
talks/events
demos
people
projects
publications
software/data
labs
areas
admin
Adaptive Duplicate Detection Using Learnable String Similarity Measures (2003)
Mikhail Bilenko
and
Raymond J. Mooney
The problem of identifying approximately duplicate records in databases is an essential step for data cleaning and data integration processes. Most existing approaches have relied on generic or manually tuned distance metrics for estimating the similarity of potential duplicates. In this paper, we present a framework for improving duplicate detection using trainable measures of textual similarity. We propose to employ learnable text distance functions for each database field, and show that such measures are capable of adapting to the specific notion of similarity that is appropriate for the field's domain. We present two learnable text similarity measures suitable for this task: an extended variant of learnable string edit distance, and a novel vector-space based measure that employs a Support Vector Machine (SVM) for training. Experimental results on a range of datasets show that our framework can improve duplicate detection accuracy over traditional techniques.
View:
PDF
,
PS
Citation:
In
Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003)
, pp. 39-48, Washington, DC, August 2003.
Bibtex:
@inproceedings{bilenko:kdd03, title={Adaptive Duplicate Detection Using Learnable String Similarity Measures}, author={Mikhail Bilenko and Raymond J. Mooney}, booktitle={Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003)}, month={August}, address={Washington, DC}, pages={39-48}, url="http://www.cs.utexas.edu/users/ai-lab?bilenko:kdd03", year={2003} }
People
Mikhail Bilenko
Ph.D. Alumni
mbilenko [at] microsoft com
Raymond J. Mooney
Faculty
mooney [at] cs utexas edu
Areas of Interest
Machine Learning
Record Linkage & Duplicate Detection
Text Data Mining
Labs
Machine Learning