UTCS Artificial Intelligence
courses
talks/events
demos
people
projects
publications
software/data
labs
areas
admin
Adaptive Blocking: Learning to Scale Up Record Linkage (2006)
Mikhail Bilenko
, Beena Kamath,
Raymond J. Mooney
Many data mining tasks require computing similarity between pairs of objects. Pairwise similarity computations are particularly important in record linkage systems, as well as in clustering and schema mapping algorithms. Because the number of object pairs grows quadratically with the size of the dataset, computing similarity between all pairs is impractical and becomes prohibitive for large datasets and complex similarity functions. Blocking methods alleviate this problem by efficiently selecting approximately similar object pairs for subsequent distance computations, leaving out the remaining pairs as dissimilar. Previously proposed blocking methods require manually constructing an indexbased similarity function or selecting a set of predicates, followed by hand-tuning of parameters. In this paper, we introduce an adaptive framework for automatically learning blocking functions that are efficient and accurate. We describe two predicate-based formulations of learnable blocking functions and provide learning algorithms for training them. The effectiveness of the proposed techniques is demonstrated on real and simulated datasets, on which they prove to be more accurate than non-adaptive blocking methods.
View:
PDF
,
PS
Citation:
In
Proceedings of the Sixth IEEE International Conference on Data Mining (ICDM-06)
, pp. 87--96, Hong Kong, December 2006.
Bibtex:
@inproceedings{bilenko:icdm06, title={Adaptive Blocking: Learning to Scale Up Record Linkage}, author={Mikhail Bilenko and Beena Kamath and Raymond J. Mooney}, booktitle={Proceedings of the Sixth IEEE International Conference on Data Mining (ICDM-06)}, month={December}, address={Hong Kong}, pages={87--96}, url="http://www.cs.utexas.edu/users/ai-lab?bilenko:iiweb06", year={2006} }
People
Mikhail Bilenko
Ph.D. Alumni
mbilenko [at] microsoft com
Raymond J. Mooney
Faculty
mooney [at] cs utexas edu
Areas of Interest
Machine Learning
Record Linkage & Duplicate Detection
Labs
Machine Learning