Class Projects

Here are a few ideas for class projects. Send me email if you are interested in one of the projects.

P1. Latent Semantic Indexing for Effective Query Retrieval.

A large collection of text documents can be condensed to a matrix A , where each column represents a document and each row stands for a word. A_{ij} is nonzero if the word i occurs in document j. The total number of rows equals all the distinct words in all the documents (non-content bearing words such as "and", "the", "a" etc. are deleted), while the total number of columns in A equals the total number of documents. The matrix A will be sparse when the number of documents is large, since each document will typically contain only a small subset of the total number of words.

This project will involve the use of the SVD (Singular Value Decomposition) and another recent decomposition, Concept Decomposition, in order to do query retrieval that is better than keyword search. By exploiting statistical co-occurrences of words, the above mathematical techniques can help in alleviating the problems of synonymy (many words for one concept) and polysemy (one word used in different contexts).

Goal of Project: Understand the basic ideas, and implement a few in software. For the software implementations, you would need some experience in using tools such as Lex (for parsing) and implementing hash tables .

Resources

  • Introductory paper shows how matrix decompositions may be used for this application.
  • LSI page at Bellcore.
  • Paper on Text Clustering and Concept Decompositions.
  • LSI page at Univ. of Tennessee, Knoxville.
  • Bibliography.
  • The SMART data sets.
  • The TREC home page.
  • Common resources .
  • P2. Distributed Creation of Vector-Space Model.

    Vector-Space Models for text are frequently used, for example, see description of Project P1. Very large collections of text documents are becoming increasingly common. Distributed or parallel computing is essential to handle these large data sets.

    This project would involve parallelizing the creation of a vector-space model. The input would be a large set of text documents (more than can be handled on 1 processor), and the output needs to be the term-document matrix (see project P1). Data structures, such as distributed hash tables, will need to be carefully implemented for this purpose.

    P3. Clustering with Concept Vectors.

    Given a large unstructured collection of text documents, an important problem is to cluster the documents into disjoint conceptual categories. An example of such categorization is the Yahoo! hierarchy of documents. Yahoo!'s hierarchy is created manually. This project will explore automatic methods for clustering.

    Resources

  • Paper on Text Clustering and Concept Decompositions.
  • P4. Mining Newsgroup Data.

    Take newsgroup text data, and do content (LSI, clustering) and link analysis (hubs and authorities, PageRank) in order to summarize the contents of the newsgroup, and identify the important postings and posters to the newsgroup. See Project P1 and P3.

    P5. Email Classification, Mining.

    Automatically sort email into folders (by rules specified by the owner, by content, etc.). Use the metadata (email address, organization) etc to enhance the classifier. Create a link graph (who sends email to whom) to identify clusters or communities. Identify important emailers (hubs & authorities, pagerank).

    P6. Web Traffic Analysis.

    Analyze traffic on the CS (www.cs.utexas.edu) domain. Come up with a page ranking scheme for each page on the CS domain based on the web traffic (it should be possible to get data logs for traffic on the CS machines).

    P7. Web Crawling.

    Resources

  • Efficient crawling through URL ordering by Junghoo Cho, Hector Garcia-Molina and Lawrence Page.
  • P8. Focused Crawling.

    Although you can search for information on the World-Wide Web using search engines such as AltaVista, Google, HotBot etc., often you may desire to create a website that contains exhaustive information about one particular topic. For example, Disney might want to create a website that contains educational and entertainment links for kids. Or as a researcher in data mining, I might be interested in maintaining a repository of documents containing papers and web pages just about data mining. For these purposes, only a small fraction of the web needs to be considered. Focused crawling is about data structures (databses?), efficient web crawling, classification algorithms in order to enable efficient sifting of the entire internet. Once you have the subset of the web you are interested in, you may want to organize it into categories such as Yahoo!, enable special-purpose querying, automatically scan the relevant websites to detect new material, etc.

    Resources

  • Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery by Soumen Chakrabarti, Martin van den Berg and Byron Dom.
  • Automatic construction of Internet Portals with Machine Learning by A. K. McCallum, K. Nigam, J. Rennie and K. Seymore.
  • Using Reinforcement Learning to Spider the Web Efficiently. Jason Rennie and Andrew McCallum.
  • P9. Spectral Graph Partitioning.

    A graph G = (V,E) is a set of vertices and edges. A classical hard (NP-complete) problem is to partition V into two parts such that the number of edges that cross the two partitions is minimum among all possible 2^|V| partitions.

    Many heuristics exist for approximating the optimal solution. One successful global heuristic partitions the graph based on eigenvectors of the associated adjacency or Laplacian matrix associated with the graph. Most large graphs that arise in applications lead to sparse matrices, so Lanczos based algorithms for computing eigenvectors are commonly used. There is a lot of existing literature on using eigenvectors for spectral graph partitioning. An overview of graph partitioning, and various algorithms is available in lectures notes ( [1] & [2] ) by Jim Demmel .

    Goal of Project: Use existing Lanczos based software for computing eigenvectors of both adjacency and Laplacian matrices. Compare with other graph partitioning algorithms. Some experience with Lanczos algorithms is essential.

    Additional Resources

  • METIS consists of multilevel graph partitioning algorithms. Papers and software is available.
  • Some graph partitioning codes.
  • Some benchmark circuit graphs.
  • Common resources .
  • P11. Visualizing High-Dimensional Data.

    Clustering and classification are two important problems in machine learning. Clustering is a method in unsupervised learning and is the grouping of similar objects into clusters. Classification is a technique in supervised learning, where given class labels for some data points, the problem is to deduce class labels for newly arriving data points. For example, given class labels for handwritten digit samples (0,1,2,...,9), recognize the digit given a digit sample.

    Many clustering and classification algorithms abound. On a particular data set, some algorithms perform better than others. This performance crucially depends on the data distributions, which can vastly vary depending on the application. Thus, to evaluate clusterings or to design a good classifier, a good understanding of the data is imperative.

    Visualization is an effective tool for this task. A popular scheme for understanding high dimensional data is to take projections onto 2-dimensional planes. The trick is to choose these two dimensional projections to alleviate the inevitable loss of information in projecting from high (10-1000) dimensions to just two dimensions. One scheme for cluster or class visualization is given here .

    Goal of Project: Understand existing projection schemes, such as, Kohonen's self-organizing maps, projection pursuit, techniques in support vector machines (SVMs), multidimensional scaling(MDS). Implement some of these schemes. Experience in graphics and visualization would be handy, but is not essential.

    Additional Resources

  • CViz software.
  • Bibliography.
  • XGobi is a system for multivariate data visualization by Deborah Swayne, Di Cook, Andreas Buja at Bellcore. The same page contains XGvis that can draw discrete graphs using MDS(Multidimensional Scaling) and was developed by Andreas Buja, Deborah F. Swayne, Michael L. Littman, Nathaniel Dean. Free Software is available from the provided link.
  • WEBSOM can plot 2-d maps of tect documents using Kohonen's Self-Organizing Maps for Internet Exploration. The above link has a demo for visually browsing newsgroup data.
  • P12. Visualizing Text Data.

    This project would produce a visualization tool in JAVA that would enable visual browsing and navigation of a text data set. Ideas from project P11 will be used to visualize text as high-dimensional data.

  • CViz software can visualize text too.
  • WEBSOM can plot 2-d maps of tect documents using Kohonen's Self-Organizing Maps for Internet Exploration. The above link has a demo for visually browsing newsgroup data.
  • P13. Text Classification using Support Vector Machines.

    P14. Face Recognition.

    This project would use the SVD (and other techniques?) for the problem of face recognition, i.e., given a database of face images and a query face, what face is most similar to the query face?

    Resources

  • Eigenfaces and Face Recognition at the MIT Media Lab.
  • Papers and Faces Database by Larry Sirovich.
  • Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection by Peter Belhumeur and Jo Hespanha and David Kriegman, July 1997.
  • Class Project that contains a report and some code too.
  • Face Database at AT&T Labs, Cambridge.
  • Papers about fast computation of eigenvectors of images (1 and 2).
  • Common Resources

  • Here is a simple of representing sparse matrices very similar to the Harwell-Boeing format. Please use this format for this course.
  • More general sparse matrix formats are given here.
  • Even more sparse matrix resources.
  • Some sample Text Data sets, Associated Matrices and even their Clusterings are available here.
  • Software to compute the SVD of a large, sparse matrix. This code is essentially the same as Michael Berry's las2 subroutine from SVDPACK.
  • C++ classes implementing various data structures, such as, hash tables may be found at SGI's STL site.
  • A standard list of English stopwords is available from the SMART ftp site. Other stopword lists may vary slightly.