A large collection of text documents can be condensed to a matrix A , where each column represents a document and each row stands for a word. A_{ij} is nonzero if the word i occurs in document j. The total number of rows equals all the distinct words in all the documents (non-content bearing words such as "and", "the", "a" etc. are deleted), while the total number of columns in A equals the total number of documents. The matrix A will be sparse when the number of documents is large, since each document will typically contain only a small subset of the total number of words.
By using techniques in linear algebra, it is possible to "intelligently" process a large collection of documents. Two applications are
Additional Resources
An important problem in many applications is to compute the leading singular values and singular vectors (SVD) of a matrix A . However, this computation can be too expensive when A is large. A way to approximate the SVD of A is to compute the SVD of an appropriately chosen submatrix of A . Recently, this idea has been applied to processing large collections of text documents (see Project 1 above). See some papers ( [1] & [2] ) by Ravi Kannan and colleagues.
Goal of Project: Understand the sub-sampling strategy outlined in the above papers. Implement the strategy and apply it to the matrices that arise from handling large text collections.A graph G = (V,E) is a set of vertices and edges. A classical hard (NP-complete) problem is to partition V into two parts such that the number of edges that cross the two partitions is minimum among all possible 2^|V| partitions.
Many heuristics exist for approximating the optimal solution. One successful global heuristic partitions the graph based on eigenvectors of the associated adjacency or Laplacian matrix associated with the graph. Most large graphs that arise in applications lead to sparse matrices, so Lanczos based algorithms for computing eigenvectors are commonly used. There is a lot of existing literature on using eigenvectors for spectral graph partitioning. An overview of graph parititioning, and various algorithms is available in lectures notes ( [1] & [2] ) by Jim Demmel .
Goal of Project: Use existing Lanczos based software for computing eigenvectors of both adjacency and Laplacian matrices. Compare with other graph partitioning algorithms. Some experience with Lanczos algorithms is essential.
Additional Resources
Clustering and classification are two important problems in machine learning. Clustering is a method in unsupervised learning and is the grouping of similar objects into clusters. Classification is a technique in supervised learning, where given class labels for some data points, the problem is to deduce class labels for newly arriving data points. For example, given class labels for handwritten digit samples (0,1,2,...,9), recognize the digit given a digit sample.
Many clustering and classification algorithms abound. On a particular data set, some algorithms perform better than others. This performance crucially depends on the data distributions, which can vastly vary depending on the application. Thus, to evaluate clusterings or to design a good classifier, a good understanding of the data is imperative.
Visualization is an effective tool for this task. A popular scheme for understanding high dimensional data is to take projections onto 2-dimensional planes. The trick is to choose these two dimensional projections to alleviate the inevitable loss of information in projecting from high (10-1000) dimensions to just two dimensions. One scheme for cluster or class visualization is given here .
Goal of Project: Understand existing projection schemes, such as, Kohonen's self-organizing maps, projection pursuit, techniques in support vector machines (SVMs), multidimensional scaling(MDS). Implement some of these schemes. Experience in graphics and visualization would be handy, but is not essential.Additional Resources
Most matrices that arise in real-life applications are large and sparse. Efficient algorithms to process these matrices attempt to utilize their structure (position and pattern) of nonzero elements. Thus, it is helpful to ba able to visually see this structure. MATLAB has some basic commands, such as "spy" etc. Recently, new software called MatView has been developed (by Jim Kohl at Sandia National Labs) for this purpose, and is available here .
Goal of Project: Experiment with various MATLAB and MatView techniques for visualizing large, sparse matrices. Use these tools to discover the structure of matrices arising from graph partioning (see Project 3 above), handling of large text collections (see Project 1 above), etc.Additional Resources