CS 391L Machine Learning
Project Suggestions

1 Page Project Proposal Due: Nov. 2, 2006
Final Project Report Due: 5PM December 15, 2006

General Information and Resources

These are just suggestions, gathered from various on-going UT research projects related to machine learning. Feel free to propose your own idea, particularly one that relates to your own on-going research interests and projects. I encourage you to talk to me about your project ideas during office hours. The only real restriction is that projects should involve doing some research in machine learning rather than a writing a survey or discussion paper. The ideal goal is to produce a research result that could be published as a scientific conference paper like the one you read for Homework 2 (perhaps with some addtional follow-up work).

In addition to Weka Java ML code, there is C4.5 C code for decision trees in /u/mooney/C4.5/, FOIL C code for ILP in /u/mooney/foil/, Prolog code for ILP available for Aleph, and ML systems in C++ available in MLC++.

Also checkout data resources available at the Irvine ML Repository, the KDD Cup repository and the CoNLL shared task descriptions for the Conference on Natural Language Learning.

Markov Logic Networks

The Alchemy system, recently developed at the Univ. of Washington, combines statistical and relational learning to construct Markov Logic Networks, a cross between logic programs and Markov nets. Local contact is CS Ph.D. student Lily Mihalkova (lilyanam@cs.utexas.edu).

Improve the speed or accuracy of Alchemy's current greedy beam search algorithm. Implement and test various ways of limiting the search space or improving the output of the beam search algorithm.
Implement and test cost-sensitive discriminative weight training of MLNs by adapting the algorithm from Sen and Getoor, ICML-06 for use in Alchemy.

Bioinformatics

There is an active local group of researchers working on data mining for bioinformatics.

In particular there is a group competing in a scientific competition, called MouseFunc I, for predicting the function of mouse genes from various sources of information about the gene that is interested in help. Local contacts are Prof. Edward Marcotte in biochemistry (marcotte@icmb.utexas.edu) and ECE Ph.D. student Chase Krumpleman (krump@lans.ece.utexas.edu).

Below is another project idea from Prof. Edward Marcotte, and biology post-doc Christine Vogel (cvogel@mail.utexas.edu), contact them for details: We are interested in a particular version of these called 'synthetic lethal interactions', in which either gene can be removed individually without affecting the cell, but the removal of both kills the cell. In particular, the hubs in a synthetic lethal interaction network are important: they compensate for the loss of many other genes. I suspect these may often be genes critical for suppressing tumor formation, although we haven't formally tested this idea. A good project would be to try to predict which genes are hubs in these networks--that is, learn what properties separate hubs from non-hubs (with features based on expression data, interaction data, functions, etc). This would require playing with various classifiers, etc. on functional genomics data. If anyone got nice results on this, we have a collaborator in England that would be willing to experimentally assay the genes for their participation in synthetic lethal interactions, so the project would continue on beyond the class with real results.

Computer Graphics

Below is an idea from CS Prof. Okan Arikan (okan@cs.utexas.edu) for applying machine learning to computer graphics. Please contact him for more details.

Graphics applications (films, games, etc.) require very detailed digital content such as geometric models, textures, animations. Creating such content often involves repetitive operations. For example, if we're creating an old decrepit house, the artist would have to create the same dusty, eroded appearance (with spider webs distributed accordingly) everywhere in the house. The idea of this project is to learn a model of the user's edits and generalizing these edits to cut down the creation time for digital environments. In our old decrepit house example, if we let the user create a single room of this house, can we learn from this what it means to be old and decrepit and apply this to the rest of the house automatically. Can we learn and help the user as the user is performing these edits? Can we ever reach a state where after sufficient use of this system, we can develop a model of user's desired appearance and recreate it from very simple inputs on novel environments?

Computer Architecture

There is a large architecture project in the department called TRIPS. CS Prof. Kathryn McKinley (mckinley@cs.utexas.edu) has proposed two potential topics for machine learning research in this area. Please contact her for details.

We have a bunch of data from simulated annealing of different schedules on TRIPS. We examined this data by hand to extract scheduling heuristics. It would be interested to see if learning could do better.
I would also would like to use the same framework to generate register bank allocation and scheduling data from simulated annealing, and see if we can derive some register bank assignment heuristics either independently or together with schedules.

Business Applications

Prof. Maytal Saar-Tsechansky (Maytal.Saar-Tsechansky@mccombs.utexas.edu) in the School of Business works in machine learning and data mining and has proposed the following topics. Please see her for details.

Active Information Acquisition: Predictive models play a dominant role in numerous business intelligence tasks. A critical factor affecting the knowledge captured by such a model is the quality of the information, i.e., the training data from which the model is induced. For many tasks potentially pertinent information is not immediately available, but can be acquired at a cost. Traditionally, information acquisition and modeling are addressed independently. Data are collected irrespective of the modeling objectives. However, information acquisition and predictive modeling in fact are mutually dependent: newly acquired information affects the model induced from the data, and the knowledge captured by the model can help determine what new information would be most useful to acquire. Information acquisition policies take advantage of this relationship to producing acquisition schedules. An acquisition schedule is a ranking of potential information acquisitions, in this case currently unknown feature values or missing labels (target variables). An ideal acquisition schedule would rank most highly those acquisitions that would yield the largest improvement in model quality per unit cost. Prior work proposed algorithms for acquisition of either class labels (i.e., dependent/target variables) of training examples, or of feature values. In this project we propose and evaluate new comprehensive approaches for information acquisition to rank the acquisition of both labels and feature values that may be missing. We will evaluate new approaches on several business data sets where feature values and class labels are costly to acquire.

Information acquisition for compliance management domains: Compliance management pertains to the selection of tax reports to be audited, or health care claims to be scrutinized for fraud. Non-compliance is a substantial source of revenue loss that afflicts a variety of industries. The U.S. Department of Health and Human Services reported in 2001 that Medicare alone lost $11.9 billion to fraud or other improper payments to providers in 2000. The Internal Revenue Service reported recently that the gap between taxes paid and taxes owed was between $312 billion and $353 billion in 2001 (the latest year for which figures are available), with about one sixth of this amount eventually collected. Substantial losses have also been reported by the auto insurance industry, adding billions of dollars to auto premiums each year. In all these scenarios, at each point in time, an analyst must decide whether to audit a case in order to recover or prevent a revenue loss. Such decisions are increasingly being guided by predictive models built based on historical audit outcomes. Due to the vast transaction volume, review of all cases that might be related to fraud is prohibitively expensive. Thus effective predictive models that identify strong leads for further investigation are essential. However, there exists a hurdle shared by all compliance management sectors that renders model induction in this domain particularly challenging. As in all supervised learning settings, in order to induce a model it is necessary for training examples to be labeled, meaning that the value of the target variable (e.g., whether or not a particular claim is fraudulent) must be acquired. However, audits are carefully chosen to target only cases which are predicted to have a high probability of being fraudulent (and thus leading to revenue recovery). This leads to a severely biased training sample which cripples model induction and revenue collection. Such models do not detect new pockets of non- compliance effectively. Furthermore, rather than helping improve the model, new audit outcomes merely reinforce existing perceptions and do not provide useful information.

An alternative approach is to complement the biased sample with additional audits carefully selected to significantly improve the model itself (rather than to avoid imminent losses). Because audits are costly, it is essential to devise selective information acquisition mechanism that would identify informative audits that will particularly improve model performance for a given acquisition cost. Existing active learning policies either assume that no data is available ex-ante and that the sample is constructed exclusively by the active learner or that a representative sample of the underlying population is available ex-ante. These assumptions are fundamental to the acquisition policies' subsequent actions and are severally violated in the compliance management domain due to the biased audit data. Thus the active learner ought to leverage this knowledge of biasness to help identify new, informative acquisitions that can improve the model's performance. We have obtained a real data set on companies' sales tax audits which we will use in this study to evaluate new policies. The data include information about firms in a given state, sales tax audit results and the amounts of money paid by companies following the audits.

Computational Linguistics / Natural Language Processing

Prof. Katrin Erk (katrin.erk@gmail.com) in the Linguistics department has suggested the following two projects, contact her for details.

Guess word senses for items outside the lexicon: Suppose you want to tag each content word in your text with a sense. But your lexicon is lacking entries: a lot of words are not covered. However, the senses that you have in the lexicon are actually described as sets of words. For example, for the word "bank", you might have the senses "depository financial institution, bank, banking concern, banking company" and "bank, cant, camber". So if you encounter a word that is not covered by the lexicon, what you can do is to try to find another, similar word that is covered by the lexicon, and just use its sense(s). The idea would be to use a semantic similarity method that you learn from corpus data, e.g. the one proposed by Lin (1998) for this. You can evaluate your method on words that are covered by the lexicon, by pretending that they are not covered and seeing which similar words are proposed.

Detect occurrences of an idiom: It is becoming more and more obvious lately that multi-word expressions and idioms are not weird exceptions. They occur a lot. And if you cannot spot them, your meaning analysis of the text will go wrong. A text talking about "letting the cat out of the bag" need not be about cats. Now suppose you have training data labeled for whether a multi-word expression or idiom is present: Can you build a classifier that will spot the idiom in new text? The idea would be to use an existing parser as a basis. Occurrences of an idiom may vary syntactically, either because they are syntactically variable (let the cat out of the bag/cat is out of the bag) or because the parser doesn't treat occurrences uniformly. Idioms may also vary lexically, for example you will find occurrences of "let the secret out of the bag" or even "let Catbert out of the bag" (this is an attested occurrence). Features of the classifier could be syntactic subtrees local to the idiom in the training data, and semantic classes of words that tend to occur with the idiom.

Data Mining

The following suggestions come from Prof. Inderjit Dhillon's data mining research group.

Non-negative matrix factorization: Non-negative matrix factorization attempts to find a factorization of a matrix A into the product of two matrices B and C, where the entries of B and C are non-negative. This has been used recently in unsupervised learning for various applications. One project possibility is to code up and use non-negative matrix factorization for some application; possibilities include text analysis (modelling topics in text), image/face processing, or for problems in bioinformatics. A second idea would be explore sparse non-negative matrix factorization, which has been recently proposed. The contact for these projects is Suvrit Sra (suvrit@cs.utexas.edu).

Power-Law Graphs: Much large-scale graph data has node degrees that are power-law distributed (a standard example is a web graph). Clustering of such graphs is somewhat difficult due to these distributions---one project idea would be to explore clustering such graphs. Some work has been done recently using min balanced cuts to cluster these graphs, and there has been promise in using normalized cuts in this domain. An option would be to compare Graclus, software for computing the normalized cut in a graph, to other graph clustering methods to determine the most effective ways to compute clusters in power-law graphs. The contact for this project is Brian Kulis (kulis@cs.utexas.edu).

Experiments with SVMs: Several projects are possible for support vector machines. One possibility is to find a good application, create a support vector machine implementation, and perform experiments. Another possibility is to compare different existing implementations of SVMs (including SVM code developed at UT) in terms of speed and accuracy. A third idea is to explore different methods of regularization for SVMs (this would require some knowledge of optimization), and compare performance. Contacts for this project are Dongmin Kim (dmkim@cs.utexas.edu), Suvrit Sra (suvrit@cs.utexas.edu), and Brian Kulis(kulis@cs.utexas.edu).

CS 391L Machine Learning Project Suggestions