CS 391L Machine Learning
Project Suggestions
1 Page Project Proposal Due: Nov. 2, 2006
Final Project Report Due: 5PM December 15, 2006
General Information and Resources
These are just suggestions, gathered from various on-going UT research projects
related to machine learning. Feel free to propose your own idea, particularly
one that relates to your own on-going research interests and projects. I
encourage you to talk to me about your project ideas during office hours. The
only real restriction is that projects should involve doing some
research in machine learning rather than a writing a survey or discussion
paper. The ideal goal is to produce a research result that could be published
as a scientific conference paper like the one you read for Homework 2 (perhaps
with some addtional follow-up work).
In addition to Weka Java ML
code, there is C4.5 C code for decision trees in
/u/mooney/C4.5/
, FOIL C code for ILP in
/u/mooney/foil/
, Prolog code for ILP available for
Aleph, and ML systems in C++ available in MLC++.
Also checkout data resources available at the Irvine ML
Repository, the KDD
Cup repository and the CoNLL shared task
descriptions for the Conference on Natural Language Learning.
Markov Logic Networks
The Alchemy system,
recently developed at the Univ. of Washington, combines statistical and
relational learning to construct Markov Logic Networks, a cross between logic
programs and Markov nets. Local contact is CS Ph.D. student Lily Mihalkova
(lilyanam@cs.utexas.edu).
- Improve the speed or accuracy of Alchemy's current greedy beam search
algorithm. Implement and test various ways of limiting the search space or
improving the output of the beam search algorithm.
-
Implement and test cost-sensitive discriminative weight training of MLNs
by adapting the algorithm from Sen and Getoor, ICML-06 for use in Alchemy.
Bioinformatics
There is an active local group of researchers working on data mining for
bioinformatics.
In particular there is a group competing in a scientific competition, called MouseFunc I, for
predicting the function of mouse genes from various sources of information
about the gene that is interested in help. Local contacts are Prof. Edward
Marcotte in biochemistry (marcotte@icmb.utexas.edu) and ECE Ph.D. student Chase
Krumpleman (krump@lans.ece.utexas.edu).
Below is another project idea from Prof. Edward Marcotte, and biology post-doc
Christine Vogel (cvogel@mail.utexas.edu), contact them for details: We are
interested in a particular version of these called 'synthetic lethal
interactions', in which either gene can be removed individually without
affecting the cell, but the removal of both kills the cell. In particular, the
hubs in a synthetic lethal interaction network are important: they compensate
for the loss of many other genes. I suspect these may often be genes critical
for suppressing tumor formation, although we haven't formally tested this idea.
A good project would be to try to predict which genes are hubs in these
networks--that is, learn what properties separate hubs from non-hubs (with
features based on expression data, interaction data, functions, etc). This
would require playing with various classifiers, etc. on functional genomics
data. If anyone got nice results on this, we have a collaborator in England
that would be willing to experimentally assay the genes for their participation
in synthetic lethal interactions, so the project would continue on beyond the
class with real results.
Computer Graphics
Below is an idea from CS Prof. Okan Arikan (okan@cs.utexas.edu)
for applying machine learning to computer graphics. Please contact
him for more details.
Graphics applications (films, games, etc.) require very detailed
digital content such as geometric models, textures, animations. Creating
such content often involves repetitive operations. For example, if we're
creating an old decrepit house, the artist would have to create the same
dusty, eroded appearance (with spider webs distributed accordingly)
everywhere in the house. The idea of this project is to learn a model of the
user's edits and generalizing these edits to cut down the creation time for
digital environments. In our old decrepit house example, if we let the user
create a single room of this house, can we learn from this what it means to
be old and decrepit and apply this to the rest of the house automatically.
Can we learn and help the user as the user is performing these edits? Can we
ever reach a state where after sufficient use of this system, we can develop
a model of user's desired appearance and recreate it from very simple inputs
on novel environments?
Computer Architecture
There is a large architecture project in the department called TRIPS.
CS Prof. Kathryn McKinley (mckinley@cs.utexas.edu) has proposed two potential
topics for machine learning research in this area. Please contact
her for details.
-
We have a bunch of data from simulated annealing of different schedules on
TRIPS. We examined this data by hand to extract scheduling heuristics. It
would be interested to see if learning could do better.
-
I would also would like to use the same framework to generate register bank
allocation and scheduling data from simulated annealing, and see if we can
derive some register bank assignment heuristics either independently or
together with schedules.
Business Applications
Prof. Maytal Saar-Tsechansky (Maytal.Saar-Tsechansky@mccombs.utexas.edu) in the
School of Business works in machine learning and data mining and has
proposed the following topics. Please see her for details.
Active Information Acquisition:
Predictive models play a dominant role in numerous business intelligence
tasks. A critical factor affecting the knowledge captured by such a model is
the quality of the information, i.e., the training data from which the model is
induced. For many tasks potentially pertinent information is not immediately
available, but can be acquired at a cost. Traditionally, information
acquisition and modeling are addressed independently. Data are collected
irrespective of the modeling objectives. However, information acquisition and
predictive modeling in fact are mutually dependent: newly acquired information
affects the model induced from the data, and the knowledge captured by the
model can help determine what new information would be most useful to
acquire. Information acquisition policies take advantage of this relationship
to producing acquisition schedules. An acquisition schedule is a ranking of
potential information acquisitions, in this case currently unknown feature
values or missing labels (target variables). An ideal acquisition schedule
would rank most highly those acquisitions that would yield the largest
improvement in model quality per unit cost. Prior work proposed algorithms for
acquisition of either class labels (i.e., dependent/target variables) of
training examples, or of feature values. In this project we propose and
evaluate new comprehensive approaches for information acquisition to rank the
acquisition of both labels and feature values that may be missing. We will
evaluate new approaches on several business data sets where feature values and
class labels are costly to acquire.
Information acquisition for compliance management domains: Compliance
management pertains to the selection of tax reports to be audited, or health
care claims to be scrutinized for fraud. Non-compliance is a substantial source
of revenue loss that afflicts a variety of industries. The U.S. Department of
Health and Human Services reported in 2001 that Medicare alone lost $11.9
billion to fraud or other improper payments to providers in 2000. The Internal
Revenue Service reported recently that the gap between taxes paid and taxes
owed was between $312 billion and $353 billion in 2001 (the latest year for
which figures are available), with about one sixth of this amount eventually
collected. Substantial losses have also been reported by the auto insurance
industry, adding billions of dollars to auto premiums each year. In all these
scenarios, at each point in time, an analyst must decide whether to audit a
case in order to recover or prevent a revenue loss. Such decisions are
increasingly being guided by predictive models built based on historical audit
outcomes. Due to the vast transaction volume, review of all cases that might be
related to fraud is prohibitively expensive. Thus effective predictive models
that identify strong leads for further investigation are essential. However,
there exists a hurdle shared by all compliance management sectors that renders
model induction in this domain particularly challenging. As in all supervised
learning settings, in order to induce a model it is necessary for training
examples to be labeled, meaning that the value of the target variable (e.g.,
whether or not a particular claim is fraudulent) must be acquired. However,
audits are carefully chosen to target only cases which are predicted to have a
high probability of being fraudulent (and thus leading to revenue
recovery). This leads to a severely biased training sample which cripples model
induction and revenue collection. Such models do not detect new pockets of non-
compliance effectively. Furthermore, rather than helping improve the model, new
audit outcomes merely reinforce existing perceptions and do not provide useful
information.
An alternative approach is to complement the biased sample with additional
audits carefully selected to significantly improve the model itself (rather
than to avoid imminent losses). Because audits are costly, it is essential to
devise selective information acquisition mechanism that would identify
informative audits that will particularly improve model performance for a given
acquisition cost. Existing active learning policies either assume that no data
is available ex-ante and that the sample is constructed exclusively by the
active learner or that a representative sample of the underlying population is
available ex-ante. These assumptions are fundamental to the acquisition
policies' subsequent actions and are severally violated in the compliance
management domain due to the biased audit data. Thus the active learner ought
to leverage this knowledge of biasness to help identify new, informative
acquisitions that can improve the model's performance. We have obtained a real
data set on companies' sales tax audits which we will use in this study to
evaluate new policies. The data include information about firms in a given
state, sales tax audit results and the amounts of money paid by companies
following the audits.
Computational Linguistics / Natural Language Processing
Prof. Katrin Erk (katrin.erk@gmail.com) in the Linguistics department has
suggested the following two projects, contact her for details.
Guess word senses for items outside the lexicon:
Suppose you want to tag each content word in your text with a sense. But your
lexicon is lacking entries: a lot of words are not covered. However, the
senses that you have in the lexicon are actually described as sets of
words. For example, for the word "bank", you might have the senses "depository
financial institution, bank, banking concern, banking company" and "bank, cant,
camber". So if you encounter a word that is not covered by the lexicon, what
you can do is to try to find another, similar word that is covered by the
lexicon, and just use its sense(s). The idea would be to use a semantic
similarity method that you learn from corpus data, e.g. the one proposed by Lin
(1998) for this. You can evaluate your method on words that are covered by the
lexicon, by pretending that they are not covered and seeing which similar words
are proposed.
Detect occurrences of an idiom: It is becoming more and more obvious
lately that multi-word expressions and idioms are not weird exceptions. They
occur a lot. And if you cannot spot them, your meaning analysis of the text
will go wrong. A text talking about "letting the cat out of the bag" need not
be about cats. Now suppose you have training data labeled for whether a
multi-word expression or idiom is present: Can you build a classifier that will
spot the idiom in new text? The idea would be to use an existing parser as a
basis. Occurrences of an idiom may vary syntactically, either because they are
syntactically variable (let the cat out of the bag/cat is out of the bag) or
because the parser doesn't treat occurrences uniformly. Idioms may also vary
lexically, for example you will find occurrences of "let the secret out of the
bag" or even "let Catbert out of the bag" (this is an attested occurrence).
Features of the classifier could be syntactic subtrees local to the idiom in
the training data, and semantic classes of words that tend to occur with the
idiom.
Data Mining
The following suggestions come from Prof. Inderjit Dhillon's data mining
research group.
Non-negative matrix factorization:
Non-negative matrix factorization attempts to find a factorization of a matrix A
into the product of two matrices B and C, where the entries of B and C are
non-negative. This has been used recently in unsupervised learning for various
applications. One project possibility is to code up and use non-negative matrix
factorization for some application; possibilities include text analysis (modelling
topics in text), image/face processing, or for problems in bioinformatics. A
second idea would be explore sparse non-negative matrix factorization, which has
been recently proposed. The contact for these projects is Suvrit Sra
(suvrit@cs.utexas.edu).
Power-Law Graphs:
Much large-scale graph data has node degrees that are power-law distributed (a
standard example is a web graph). Clustering of such graphs is somewhat difficult
due to these distributions---one project idea would be to explore clustering such
graphs. Some work has been done recently using min balanced cuts to cluster these
graphs, and there has been promise in using normalized cuts in this domain. An
option would be to compare Graclus, software for computing the normalized cut in a
graph, to other graph clustering methods to determine the most effective ways to
compute clusters in power-law graphs. The contact for this project is Brian
Kulis (kulis@cs.utexas.edu).
Experiments with SVMs: Several projects are possible for support vector
machines. One possibility is to find a good application, create a support
vector machine implementation, and perform experiments. Another possibility is
to compare different existing implementations of SVMs (including SVM code
developed at UT) in terms of speed and accuracy. A third idea is to explore
different methods of regularization for SVMs (this would require some knowledge
of optimization), and compare performance. Contacts for this project are
Dongmin Kim (dmkim@cs.utexas.edu), Suvrit Sra (suvrit@cs.utexas.edu), and Brian
Kulis(kulis@cs.utexas.edu).