UTCS Artificial Intelligence
courses
talks/events
demos
people
projects
publications
software/data
labs
areas
admin
On Evaluation and Training-Set Construction for Duplicate Detection (2003)
Mikhail Bilenko
and
Raymond J. Mooney
A variety of experimental methodologies have been used to evaluate the accuracy of duplicate-detection systems. We advocate presenting precision-recall curves as the most informative evaluation methodology. We also discuss a number of issues that arise when evaluating and assembling training data for adaptive systems that use machine learning to tune themselves to specific applications. We consider several different application scenarios and experimentally examine the effectiveness of alternative methods of collecting training data under each scenario. We propose two new approaches to collecting training data called static-active learning and weakly-labeled non-duplicates, and present experimental results on their effectiveness.
View:
PDF
,
PS
Citation:
In
Proceedings of the KDD-03 Workshop on Data Cleaning, Record Linkage, and Object Consolidation
, pp. 7-12, Washington, DC, August 2003.
Bibtex:
@inproceedings{bilenko:kdd03-wkshp, title={On Evaluation and Training-Set Construction for Duplicate Detection}, author={Mikhail Bilenko and Raymond J. Mooney}, booktitle={Proceedings of the KDD-03 Workshop on Data Cleaning, Record Linkage, and Object Consolidation}, month={August}, address={Washington, DC}, pages={7-12}, url="http://www.cs.utexas.edu/users/ai-lab?bilenko:kdd03-wkshp", year={2003} }
People
Mikhail Bilenko
Ph.D. Alumni
mbilenko [at] microsoft com
Raymond J. Mooney
Faculty
mooney [at] cs utexas edu
Areas of Interest
Machine Learning
Record Linkage & Duplicate Detection
Text Data Mining
Labs
Machine Learning