UTCS Artificial Intelligence
courses
talks/events
demos
people
projects
publications
software/data
labs
areas
admin
Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages (2013)
Dan Garrette
, Jason Mielens, and Jason Baldridge
Developing natural language processing tools for low-resource languages often requires creating resources from scratch. While a variety of semi-supervised methods exist for training from incomplete data, there are open questions regarding what types of training data should be used and how much is necessary. We discuss a series of experiments designed to shed light on such questions in the context of part-of-speech tagging. We obtain timed annotations from linguists for the low-resource languages Kinyarwanda and Malagasy (as well as English) and evaluate how the amounts of various kinds of data affect performance of a trained POS-tagger. Our results show that annotation of word types is the most important, provided a sufficiently capable semi-supervised learning infrastructure is in place to project type information onto a raw corpus. We also show that finite-state morphological analyzers are effective sources of type information when few labeled examples are available.
View:
PDF
Citation:
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL-2013)
(2013), pp. 583--592.
Bibtex:
@article{garrette:acl13, title={Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages}, author={Dan Garrette and Jason Mielens and Jason Baldridge }, booktitle={Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL-2013)}, month={August}, address={Sofia, Bulgaria}, pages={583--592}, url="http://www.cs.utexas.edu/users/ai-lab?garrette:acl13", year={2013} }
People
Dan Garrette
Ph.D. Alumni
dhg [at] cs utexas edu
Areas of Interest
Machine Learning
Natural Language Processing
Semi-Supervised Learning
Labs
Machine Learning