UTCS Artificial Intelligence
courses
talks/events
demos
people
projects
publications
software/data
labs
areas
admin
Using HTML Structure and Linked Pages to Improve Learning for Text Categorization (1999)
Michael B. Cline
Classifying web pages is an important task in automating the organization of information on the WWW, and learning for text categorization can help automate the development of such systems. This project explores using two aspects of HTML to improve learning for text categorization: 1) Using HTML tags such as titles, links, and headings to partition the text on a page and 2) Using the pages linked from a given page to augment its description. Initial experimental results on 26 categories from the Yahoo hierarchy demonstrate the promise of these two methods for improving the accuracy of a bag-of-words text classifier using a simple Bayesian learning algorithm.
View:
PDF
,
PS
Citation:
Technical Report AI 98-270, Department of Computer Sciences, University of Texas at Austin. Undergraduate Honors Thesis.
Bibtex:
@techreport{cline:honers99, title={Using HTML Structure and Linked Pages to Improve Learning for Text Categorization}, author={Michael B. Cline}, number={AI 98-270}, month={May}, school={Department of Computer Sciences, University of Texas at Austin}, address={Austin, TX}, institution={Department of Computer Sciences, University of Texas at Austin}, pages={21 pages}, note={Undergraduate Honors Thesis}, url="http://www.cs.utexas.edu/users/ai-lab?cline:honers99", year={1999} }
Areas of Interest
Machine Learning
Text Categorization and Clustering
Labs
Machine Learning