UTCS Artificial Intelligence
courses
talks/events
demos
people
projects
publications
software/data
labs
areas
admin
Learning to Combine Trained Distance Metrics for Duplicate Detection in Databases (2002)
Mikhail Bilenko
and
Raymond J. Mooney
The problem of identifying approximately duplicate records in databases has previously been studied as record linkage, the merge/purge problem, hardening soft databases, and field matching. Most existing approaches have focused on efficient algorithms for locating potential duplicates rather than precise similarity metrics for comparing records. In this paper, we present a domain-independent method for improving duplicate detection accuracy using machine learning. First, trainable distance metrics are learned for each field, adapting to the specific notion of similarity that is appropriate for the field's domain. Second, a classifier is employed that uses several diverse metrics for each field as distance features and classifies pairs of records as duplicates or non-duplicates. We also propose an extended model of learnable string distance which improves over an existing approach. Experimental results on real and synthetic datasets show that our method outperforms traditional techniques.
View:
PDF
,
PS
Citation:
Technical Report AI 02-296, Artificial Intelligence Laboratory, University of Texas at Austin.
Bibtex:
@TechReport{bilenko:tr02, title={Learning to Combine Trained Distance Metrics for Duplicate Detection in Databases}, author={Mikhail Bilenko and Raymond J. Mooney}, number={AI 02-296}, month={February}, address={Austin, TX}, institution={Artificial Intelligence Laboratory, University of Texas at Austin}, url="http://www.cs.utexas.edu/users/ai-lab?bilenko:tr02", year={2002} }
People
Mikhail Bilenko
Ph.D. Alumni
mbilenko [at] microsoft com
Raymond J. Mooney
Faculty
mooney [at] cs utexas edu
Areas of Interest
Machine Learning
Record Linkage & Duplicate Detection
Text Data Mining
Labs
Machine Learning