UTCS Artificial Intelligence
courses
talks/events
demos
people
projects
publications
software/data
labs
areas
admin
Two Approaches to Handling Noisy Variation in Text Mining (2002)
Un Yong Nahm
,
Mikhail Bilenko
, and
Raymond J. Mooney
Variation and noise in textual database entries can prevent text mining algorithms from discovering important regularities. We present two novel methods to cope with this problem: (1) an adaptive approach to ``hardening'' noisy databases by identifying duplicate records, and (2) mining ``soft'' association rules. For identifying approximately duplicate records, we present a domain-independent two-level method for improving duplicate detection accuracy based on machine learning. For mining soft matching rules, we introduce an algorithm that discovers association rules by allowing partial matching of items based on a textual similarity metric such as edit distance or cosine similarity. Experimental results on real and synthetic datasets show that our methods outperform traditional techniques for noisy textual databases.
View:
PDF
,
PS
Citation:
In
Papers from the Nineteenth International Conference on Machine Learning (ICML-2002) Workshop on Text Learning
, pp. 18-27, Sydney, Australia, July 2002.
Bibtex:
@InProceedings{nahm:icml-wkshp02, title={Two Approaches to Handling Noisy Variation in Text Mining}, author={Un Yong Nahm and Mikhail Bilenko and Raymond J. Mooney}, booktitle={Papers from the Nineteenth International Conference on Machine Learning (ICML-2002) Workshop on Text Learning}, month={July}, address={Sydney, Australia}, pages={18-27}, url="http://www.cs.utexas.edu/users/ai-lab?nahm:icml-wkshp02", year={2002} }
People
Mikhail Bilenko
Ph.D. Alumni
mbilenko [at] microsoft com
Raymond J. Mooney
Faculty
mooney [at] cs utexas edu
Un Yong Nahm
Ph.D. Alumni
pebronia [at] acm org
Areas of Interest
Machine Learning
Record Linkage & Duplicate Detection
Text Data Mining
Labs
Machine Learning