CS 395T- Large-scale Data Mining

Homework 3

Clustering

The main goal of this homework is to experiment with some clustering techniques.

Before you start the experiments,

Read this paper to understand the spherical k-means algorithm.
Read this paper to understand the first variation technique.
Read this paper to understand k-means using Kullback-Leibler divergences with prior and first variation.

After you've understood the papers,

Download the MC program and compile it.
- Run MC on the Classic3 data set (/stage/projects1/inderjit/cs395t_lsdm/Data/classic3/) with options: l=0.2, u=15, t=tfn (make sure you understand these options). MC will generate a sparse matrix in CCS format, which is stored in several files. All the files should share the same prefix, classic3, in their names.
- Choose 100 documents from each category (cisi, med, cran) of classic3, and generate a matrix with MC for these 300 documents, with options: l=0.2, u=15, t=tfn.
- Run MC on the cmu.news 20_cleaned data set (/stage/projects1/inderjit/cs395t_lsdm/Data/cmu.news20_cleaned/) with the same options.
Download the gmeans code and compile it. Read the README to understand all the options.
- Run gmeans with option '-a s' (option for spherical k-means algorithm) on the Classic3 and the 300-document matrices, with and without first varition '-f'. A true label file for Classic3 is /stage/projects1/inderjit/cs395t_lsdm/Data/Data.moreinfo/classic3/classic3_truelabels.3; A true label file for the 300-document data set is /stage/projects1/inderjit/cs395t_lsdm/Data/info.classic300/classic300_truelabel
- Run gmeans with option '-a k' (option for information-theoretic clustering algorithm) on the Classic3 and the 300-document matrices, with and without the prior option '-L c', and with and without first varition.
- Repeat the above experiments with the cmu.news 20_cleaned matrix. A true label file is /stage/projects1/inderjit/cs395t_lsdm/Data/info.cmu20/cmu.news20_cleaned_truelabels.20
Download this Java cluster browser program to generate web pages illustrating your clustering results. (see a sample browser on clusters of 114,000 NSF award abstracts and a sample browser on clusters of UTCS professors' web pages )
Code up the non-negative matrix factorization algorithm in Lee & Seung's paper :"Algorithms for non-negative matrix factorization". Use the factor matrices to cluster (as decribed in class and this paper).

Answer the following questions:

    1. Report your clustering results on classic3, the 300 document set and cmu.news 20_cleaned. For each clustering, submit the confusion matrix (or a summary for the cmu.news 20_cleaned data set) and objective function value.
    2. Test the clustering programs on your email (or other text collection). How did the clustering programs perform?
    3. Are your clustering results good? If not, explain why.
    4. In your opinion which of the clustering techniques is the best? Why?

Due date: Nov. 17, 2003