CS 395T- Large-scale Data Mining
Homework 3
Clustering
    The main goal of this homework is to 
experiment with some clustering techniques. 
    Before you start the experiments,
  - Read this 
  paper to understand the spherical k-means algorithm. 
  
- Read this paper 
  to understand the first variation 
  technique. 
  
- Read this 
  paper to understand k-means using Kullback-Leibler divergences with prior and first variation. 
After you've understood the papers,
  - Download the MC program 
  and compile it. 
  
    - Run MC on the Classic3 data set 
    (/stage/projects1/inderjit/cs395t_lsdm/Data/classic3/) 
    with options: l=0.2, u=15, t=tfn (make sure you understand these options). 
    MC will generate a sparse matrix in CCS 
    format, which is stored in several files. All the files should share the 
    same prefix, classic3, in their 
    names. 
 
- Choose 100 documents from each category (cisi, med, cran) of 
    classic3, and generate a matrix with MC for these 300 
    documents, with options: l=0.2, u=15, t=tfn. 
    
- Run MC on the cmu.news 20_cleaned  data set (/stage/projects1/inderjit/cs395t_lsdm/Data/cmu.news20_cleaned/) with the same 
    options.  
 
- Download the gmeans 
  code and compile it. Read the README 
  to understand all the options.
 
    - Run gmeans with option '-a s' (option for spherical k-means algorithm) 
    on the Classic3 and the 300-document matrices, with and without first 
    varition '-f'. A true label file for Classic3 is 
    /stage/projects1/inderjit/cs395t_lsdm/Data/Data.moreinfo/classic3/classic3_truelabels.3; 
    A true label file for 
    the 300-document data set is  
    /stage/projects1/inderjit/cs395t_lsdm/Data/info.classic300/classic300_truelabel 
    
 
- Run gmeans with option '-a k' (option for information-theoretic 
    clustering algorithm) on the Classic3 and the 300-document 
    matrices, with and without the prior 
    option '-L c', and with and without first 
    varition. 
    
- Repeat the above experiments with the cmu.news 20_cleaned matrix. 
    A true label file is /stage/projects1/inderjit/cs395t_lsdm/Data/info.cmu20/cmu.news20_cleaned_truelabels.20
 
 
- Download this Java cluster 
  browser program to generate web pages illustrating your clustering 
  results. (see a sample 
  browser on clusters of 114,000 NSF award abstracts and a sample 
  browser on clusters of UTCS professors' web pages ) 
  
- Code up the non-negative matrix factorization algorithm in Lee & 
  Seung's paper :"Algorithms for 
  non-negative matrix factorization". Use the factor matrices to cluster (as 
  decribed in class and this 
  paper).
 
    Answer the following questions: 
    1. Report your clustering results on classic3, the 
300 document set and cmu.news 20_cleaned. For each clustering, submit the 
confusion matrix (or a summary for the cmu.news 20_cleaned data set) 
and objective function value. 
    2. Test the clustering programs on your email (or other text collection). How did the 
clustering programs perform? 
    3. Are your clustering results good? If not, explain why. 
    4. In your opinion which of the clustering techniques is 
the best? Why? 
Due date: Nov. 17, 2003