Publications: Text Categorization and Clustering
The ability to categorize natural-language documents and web pages into known
categories using supervised learning or to cluster them into meaningful new
categories using unsupervised learning has important applications in
information retrieval, information filtering, knowledge management, and
recommender systems. Our research has focused on applications of text learning
to
recommender systems and on
semi-supervised clustering of documents.
- Do Human Rationales Improve Machine Explanations?
[Details] [PDF] [Poster]
Julia Strout, Ye Zhang, Raymond J. Mooney
In Proceedings of the Second BlackboxNLP Workshop at ACL, 56-62, Florence, Italy, August 2019.Work on “learning with rationales” shows that humans providing explanations to a machine learning system can improve the system’s predictive accuracy. However, this work has not been connected to work in “explainable AI” which concerns machines explaining their reasoning to humans. In this work, we show
that learning with rationales can also improve the quality of the machine’s explanations as evaluated by human judges. Specifically, we present experiments showing that, for CNN-based text classification, explanations generated using “supervised attention” are judged superior to explanations generated using normal unsupervised attention.
ML ID: 373
- Knowledge Transfer Using Latent Variable Models
[Details] [PDF] [Slides (PDF)]
Ayan Acharya
PhD Thesis, Department of Electrical and Computer Engineering, The University of Texas at Austin, August 2015.In several applications, scarcity of labeled data is a challenging problem that
hinders the predictive capabilities of machine learning algorithms. Additionally, the
distribution of the data changes over time, rendering models trained with older data
less capable of discovering useful structure from the newly available data. Transfer
learning is a convenient framework to overcome such problems where the learning
of a model specific to a domain can benefit the learning of other models in other
domains through either simultaneous training of domains or sequential transfer of
knowledge from one domain to the others. This thesis explores the opportunities of
knowledge transfer in the context of a few applications pertaining to object recognition
from images, text analysis, network modeling and recommender systems,
using probabilistic latent variable models as building blocks. Both simultaneous
and sequential knowledge transfer are achieved through the latent variables, either
by sharing these across multiple related domains (for simultaneous learning) or by
adapting their distributions to fit data from a new domain (for sequential learning).
ML ID: 322
- Detecting Promotional Content in Wikipedia
[Details] [PDF] [Slides (PPT)]
Shruti Bhosale and Heath Vinicombe and Raymond J. Mooney
In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP 2013), 1851--1857, Seattle, WA, October 2013.This paper presents an approach for detecting promotional content in Wikipedia.
By incorporating stylometric features, including features based on n-gram and PCFG language
models, we demonstrate improved accuracy at identifying promotional articles, compared to using only lexical information and meta-features.
ML ID: 292
- Spherical Topic Models
[Details] [PDF] [Slides (PDF)]
Joseph Reisinger, Austin Waters, Bryan Silverthorn, and Raymond J. Mooney
In Proceedings of the 27th International Conference on Machine Learning (ICML 2010), 2010.We introduce the Spherical Admixture Model
(SAM), a Bayesian topic model for arbitrary L2
normalized data. SAM maintains the same hierarchical
structure as Latent Dirichlet Allocation
(LDA), but models documents as points on
a high-dimensional spherical manifold, allowing
a natural likelihood parameterization in terms of
cosine distance. Furthermore, SAM can model
word absence/presence at the document level,
and unlike previous models can assign explicit
negative weight to topic terms. Performance is
evaluated empirically, both through human ratings
of topic quality and through diverse classification
tasks from natural language processing
and computer vision. In these experiments, SAM
consistently outperforms existing models.
ML ID: 248
- Multi-Prototype Vector-Space Models of Word Meaning
[Details] [PDF] [Slides (PDF)]
Joseph Reisinger, Raymond J. Mooney
In Proceedings of the 11th Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-2010), 109-117, 2010.Current vector-space models of lexical semantics
create a single “prototype” vector to represent
the meaning of a word. However, due
to lexical ambiguity, encoding word meaning
with a single vector is problematic. This
paper presents a method that uses clustering
to produce multiple “sense-specific&rdquo vectors
for each word. This approach provides
a context-dependent vector representation of
word meaning that naturally accommodates
homonymy and polysemy. Experimental comparisons
to human judgements of semantic
similarity for both isolated words as well as
words in sentential contexts demonstrate the
superiority of this approach over both prototype
and exemplar based vector-space models.
ML ID: 241
- Spherical Topic Models
[Details] [PDF]
Joseph Reisinger, Austin Waters, Bryan Silverthorn, and Raymond Mooney
In NIPS'09 workshop: Applications for Topic Models: Text and Beyond, 2009.We introduce the Spherical Admixture Model (SAM), a Bayesian topic model over arbitrary L2 normalized data. SAM models documents as points on a high- dimensional spherical manifold, and is capable of representing negative word- topic correlations and word presence/absence, unlike models with multinomial document likelihood, such as LDA. In this paper, we evaluate SAM as a topic browser, focusing on its ability to model “negative” topic features, and also as a dimensionality reduction method, using topic proportions as features for difficult classification tasks in natural language processing and computer vision.
ML ID: 237
- Probabilistic Semi-Supervised Clustering with Constraints
[Details] [PDF]
Sugato Basu, Mikhail Bilenko, Arindam Banerjee and Raymond J. Mooney
In O. Chapelle and B. Sch{"{o}}lkopf and A. Zien, editors, Semi-Supervised Learning, Cambridge, MA, 2006. MIT Press.In certain clustering tasks it is possible to obtain limited supervision in the form of pairwise constraints, i.e., pairs of instances labeled as belonging to same or different clusters. The resulting problem is known as semi-supervised clustering, an instance of semi-supervised learning stemming from a traditional unsupervised learning setting. Several algorithms exist for enhancing clustering quality by using supervision in the form of constraints. These algorithms typically utilize the pairwise constraints to either modify the clustering objective function or to learn the clustering distortion measure. This chapter describes an approach that employs Hidden Markov Random Fields (HMRFs) as a probabilistic generative model for semi-supervised clustering, thereby providing a principled framework for incorporating constraint-based supervision into prototype-based clustering. The HMRF-based model allows the use of a broad range of clustering distortion measures, including Bregman divergences (e.g., squared Euclidean distance, KL divergence) and directional distance measures (e.g., cosine distance), making it applicable to a number of domains. The model leads to the HMRF-KMeans algorithm which minimizes an objective function derived from the joint probability of the model, and allows unification of constraint-based and distance-based semi-supervised clustering methods. Additionally, a two-phase active learning algorithm for selecting informative pairwise constraints in a query-driven framework is derived from the HMRF model, facilitating improved clustering performance with relatively small amounts of supervision from the user.
ML ID: 176
- Semi-supervised Clustering: Probabilistic Models, Algorithms and Experiments
[Details] [PDF]
Sugato Basu
PhD Thesis, University of Texas at Austin, 2005.Clustering is one of the most common data mining tasks, used frequently for data categorization and analysis in both industry and academia. The focus of our research is on semi-supervised clustering, where we study how prior knowledge, gathered either from automated information sources or human supervision, can be incorporated into clustering algorithms. In this thesis, we present probabilistic models for semi-supervised clustering, develop algorithms based on these models and empirically validate their performances by extensive experiments on data sets from different domains, e.g., text analysis, hand-written character recognition, and bioinformatics.
In many domains where clustering is applied, some prior knowledge is available either in the form of labeled data (specifying the category to which an instance belongs) or pairwise constraints on some of the instances (specifying whether two instances should be in same or different clusters). In this thesis, we first analyze effective methods of incorporating labeled supervision into prototype-based clustering algorithms, and propose two variants of the well-known KMeans algorithm that can improve their performance with limited labeled data.
We then focus on the problem of semi-supervised clustering with constraints and show how this problem can be studied in the framework of a well-defined probabilistic generative model of a Hidden Markov Random Field. We derive an efficient KMeans-type iterative algorithm, HMRF-KMeans, for optimizing a semi-supervised clustering objective function defined on the HMRF model. We also give convergence guarantees of our algorithm for a large class of clustering distortion measures (e.g., squared Euclidean distance, KL divergence, and cosine distance).
Finally, we develop an active learning algorithm for acquiring maximally informative pairwise constraints in an interactive query-driven framework, which to our knowledge is the first active learning algorithm for semi-supervised clustering with constraints.
Other interesting problems of semi-supervised clustering that we discuss in this thesis include (1) semi-supervised graph-based clustering using kernels, (2) using prior knowledge to improve overlapping clustering of data, (3) integration of both constraint-based and distance-based semi-supervised clustering methods using the HMRF model, and (4) model selection techniques that use the available supervision to automatically select the right number of clusters.
ML ID: 174
- Model-based Overlapping Clustering
[Details] [PDF]
A. Banerjee, C. Krumpelman, S. Basu, Raymond J. Mooney and Joydeep Ghosh
In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-05), 2005.While the vast majority of clustering algorithms are partitional, many real world datasets have inherently overlapping clusters. The recent explosion of analysis on biological datasets, which are frequently overlapping, has led to new clustering models that allow hard assignment of data points to multiple clusters. One particularly appealing model was proposed by Segal et al. in the context of probabilistic relational models (PRMs) applied to the analysis of gene microarray data. In this paper, we start with the basic approach of Segal et al. and provide an alternative interpretation of the model as a generalization of mixture models, which makes it easily interpretable. While the original model maximized likelihood over constant variance Gaussians, we generalize it to work with any regular exponential family distribution, and corresponding Bregman divergences, thereby making the model applicable for a wide variety of clustering distance functions, e.g., KL-divergence, Itakura-Saito distance, I-divergence. The general model is applicable to several domains, including high-dimensional sparse domains, such as text and recommender systems. We additionally offer several algorithmic modifications that improve both the performance and applicability of the model. We demonstrate the effectiveness of our algorithm through experiments on synthetic data as well as subsets of 20-Newsgroups and EachMovie datasets.
ML ID: 163
- A Probabilistic Framework for Semi-Supervised Clustering
[Details] [PDF]
Sugato Basu, Mikhail Bilenko, and Raymond J. Mooney
In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2004), 59-68, Seattle, WA, August 2004.Unsupervised clustering can be significantly improved using supervision in the form of pairwise constraints, i.e., pairs of instances labeled as belonging to same or different clusters. In recent years, a number of algorithms have been proposed for enhancing clustering quality by employing such supervision. Such methods use the constraints to either modify the objective function, or to learn the distance measure. We propose a probabilistic model for semi-supervised clustering based on Hidden Markov Random Fields (HMRFs) that provides a principled framework for incorporating supervision into prototype-based clustering. The model generalizes a previous approach that combines constraints and Euclidean distance learning, and allows the use of a broad range of clustering distortion measures, including Bregman divergences (e.g., Euclidean distance and I-divergence) and directional similarity measures (e.g., cosine similarity). We present an algorithm that performs partitional semi-supervised clustering of data by minimizing an objective function derived from the posterior energy of the HMRF model. Experimental results on several text data sets demonstrate the advantages of the proposed framework.
ML ID: 154
- Semi-supervised Clustering with Limited Background Knowledge
[Details] [PDF]
Sugato Basu
In Proceedings of the Ninth AAAI/SIGART Doctoral Consortium, 979--980, San Jose, CA, July 2004.
ML ID: 149
- Active Semi-Supervision for Pairwise Constrained Clustering
[Details] [PDF]
Sugato Basu, Arindam Banerjee, and Raymond J. Mooney
In Proceedings of the 2004 SIAM International Conference on Data Mining (SDM-04), April 2004.Semi-supervised clustering uses a small amount of supervised data to aid unsupervised learning. One typical approach specifies a limited number of must-link and cannot-link constraints between pairs of examples. This paper presents a pairwise constrained clustering framework and a new method for actively selecting informative pairwise constraints to get improved clustering performance. The clustering and active learning methods are both easily scalable to large datasets, and can handle very high dimensional data. Experimental and theoretical results confirm that this active querying of pairwise constraints significantly improves the accuracy of clustering when given a relatively small amount of supervision.
ML ID: 141
- Semi-supervised Clustering: Learning with Limited User Feedback
[Details] [PDF]
Sugato Basu
Technical Report, Cornell University, 2004.In many machine learning domains (e.g. text processing, bioinformatics), there is a large supply of unlabeled data but limited labeled data, which can be expensive to generate. Consequently, semi-supervised learning, learning from a combination of both labeled and unlabeled data, has become a topic of significant recent interest. In the proposed thesis, our research focus is on semi-supervised clustering, which uses a small amount of supervised data in the form of class labels or pairwise constraints on some examples to aid unsupervised clustering. Semi-supervised clustering can be either search-based, i.e., changes are made to the clustering objective to satisfy user-specified labels/constraints, or similarity-based, i.e., the clustering similarity metric is trained to satisfy the given labels/constraints. Our main goal in the proposed thesis is to study search-based semi-supervised clustering algorithms and apply them to different domains.
In our initial work, we have shown how supervision can be provided to clustering in the form of labeled data points or pairwise constraints. We have also developed an active learning framework for selecting informative constraints in the pairwise constrained semi-supervised clustering model, and proposed a method for unifying search-based and similarity-based techniques in semi-supervised clustering.
In this thesis, we want to study other aspects of semi-supervised clustering. Some of the issues we want to investigate include: (1) effect of noisy, probabilistic or incomplete supervision in clustering; (2) model selection techniques for automatic selection of number of clusters in semi-supervised clustering; (3) ensemble semi-supervised clustering. In our work so far, we have mainly focussed on generative clustering models, e.g. KMeans and EM, and ran experiments on clustering low-dimensional UCI datasets or high-dimensional text datasets. In future, we want to study the effect of semi-supervision on other clustering algorithms, especially in the discriminative clustering and online clustering framework. We also want to study the effectiveness of our semi-supervised clustering algorithms on other domains, e.g., web search engines (clustering of search results), astronomy (clustering of Mars spectral images) and bioinformatics (clustering of gene microarray data).
ML ID: 134
- Semi-supervised Clustering by Seeding
[Details] [PDF]
Sugato Basu, Arindam Banerjee, and Raymond J. Mooney
In Proceedings of 19th International Conference on Machine Learning (ICML-2002), 19-26, 2002.Semi-supervised clustering uses a small amount of labeled data to aid and bias the clustering of unlabeled data. This paper explores the use of labeled data to generate initial seed clusters, as well as the use of constraints generated from labeled data to guide the clustering process. It introduces two semi-supervised variants of KMeans clustering that can be viewed as instances of the EM algorithm, where labeled data provides prior information about the conditional distributions of hidden category labels. Experimental results demonstrate the advantages of these methods over standard random seeding and COP-KMeans, a previously developed semi-supervised clustering algorithm.
ML ID: 113
- Content-Based Book Recommending Using Learning for Text Categorization
[Details] [PDF]
Raymond J. Mooney and Loriene Roy
In Proceedings of the Fifth ACM Conference on Digital Libraries, 195-204, San Antonio, TX, June 2000.Recommender systems improve access to relevant products and information by making personalized suggestions based on previous examples of a user's likes and dislikes. Most existing recommender systems use collaborative filtering methods that base recommendations on other users' preferences. By contrast, content-based methods use information about an item itself to make suggestions. This approach has the advantage of being able to recommend previously unrated items to users with unique interests and to provide explanations for its recommendations. We describe a content-based book recommending system that utilizes information extraction and a machine-learning algorithm for text categorization. Initial experimental results demonstrate that this approach can produce accurate recommendations.
ML ID: 98
- Content-Based Book Recommending Using Learning for Text Categorization
[Details] [PDF]
Raymond J. Mooney and Loriene Roy
In Proceedings of the SIGIR-99 Workshop on Recommender Systems: Algorithms and Evaluation, Berkeley, CA, August 1999.Recommender systems improve access to relevant products and information by making personalized suggestions based on previous examples of a user's likes and dislikes. Most existing recommender systems use social filtering methods that base recommendations on other users' preferences. By contrast, content-based methods use information about an item itself to make suggestions. This approach has the advantage of being able to recommended previously unrated items to users with unique interests and to provide explanations for its recommendations. We describe a content-based book recommending system that utilizes information extraction and a machine-learning algorithm for text categorization. Initial experimental results demonstrate that this approach can produce accurate recommendations. These experiments are based on ratings from random samplings of items and we discuss problems with previous experiments that employ skewed samples of user-selected examples to evaluate performance.
ML ID: 96
- Using HTML Structure and Linked Pages to Improve Learning for Text Categorization
[Details] [PDF]
Michael B. Cline
Technical Report AI 98-270, Department of Computer Sciences, University of Texas at Austin, Austin, TX, May 1999. Undergraduate Honors Thesis.Classifying web pages is an important task in automating the organization of information on the WWW, and learning for text categorization can help automate the development of such systems. This project explores using two aspects of HTML to improve learning for text categorization: 1) Using HTML tags such as titles, links, and headings to partition the text on a page and 2) Using the pages linked from a given page to augment its description. Initial experimental results on 26 categories from the Yahoo hierarchy demonstrate the promise of these two methods for improving the accuracy of a bag-of-words text classifier using a simple Bayesian learning algorithm.
ML ID: 91
- Book Recommending Using Text Categorization with Extracted Information
[Details] [PDF]
Raymond J. Mooney, Paul N. Bennett, and Loriene Roy
In Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI-98)"-REC-WKSHP98, year="1998, 70-74, Madison, WI, 1998.Content-based recommender systems suggest documents, items, and services to users based on learning a profile of the user from rated examples containing information about the given items. Text categorization methods are very useful for this task but generally rely on unstructured text. We have developed a book-recommending system that utilizes semi-structured information about items gathered from the web using simple information extraction techniques. Initial experimental results demonstrate that this approach can produce fairly accurate recommendations.
ML ID: 86
- Text Categorization Through Probabilistic Learning: Applications to Recommender Systems
[Details] [PDF]
Paul N. Bennett
1998. Honors thesis, Department of Computer Sciences, The University of Texas at Austin.With the growth of the World Wide Web, recommender systems have received an increasing amount of attention. Many recommender systems in use today are based on collaborative filtering. This project has focused on LIBRA, a content-based book recommending system. By utilizing text categorization methods and the information available for each book, the system determines a user profile which is used as the basis of recommendations made to the user. Instead of the bag-of-words approach used in many other statistical text categorization approaches, LIBRA parses each text sample into a semi-structured representation. We have used standard Machine Learning techniques to analyze the performance of several algorithms on this learning task. In addition, we analyze the utility of several methods of feature construction and selection (i.e. methods of choosing the representation of an item that the learning algorithm actually uses). After analyzing the system we conclude that good recommendations are produced after a relatively small number of training examples. We also conclude that the feature selection method tested does not improve the performance of these algorithms in any systematic way, though the results indicate other feature selection methods may prove useful. Feature construction, however, while not providing a large increase in performance with the particular construction methods used here, holds promise of providing performance improvements for the algorithms investigated. This text assumes only minor familiarity with concepts of artificial intelligence and should be readable by the upper division computer science undergraduate familiar with basic concepts of probability theory and set theory.
ML ID: 85