CS 395T- Large-scale Data Mining
Homework 3
Clustering
The main goal of this homework is to
experiment with some clustering techniques.
Before you start the experiments,
- Read this
paper to understand the spherical k-means algorithm.
- Read this paper
to understand the first variation
technique.
- Read this
paper to understand k-means using Kullback-Leibler divergences with prior and first variation.
After you've understood the papers,
- Download the MC program
and compile it.
- Run MC on the Classic3 data set
(/stage/projects1/inderjit/cs395t_lsdm/Data/classic3/)
with options: l=0.2, u=15, t=tfn (make sure you understand these options).
MC will generate a sparse matrix in CCS
format, which is stored in several files. All the files should share the
same prefix, classic3, in their
names.
- Choose 100 documents from each category (cisi, med, cran) of
classic3, and generate a matrix with MC for these 300
documents, with options: l=0.2, u=15, t=tfn.
- Run MC on the cmu.news 20_cleaned data set (/stage/projects1/inderjit/cs395t_lsdm/Data/cmu.news20_cleaned/) with the same
options.
- Download the gmeans
code and compile it. Read the README
to understand all the options.
- Run gmeans with option '-a s' (option for spherical k-means algorithm)
on the Classic3 and the 300-document matrices, with and without first
varition '-f'. A true label file for Classic3 is
/stage/projects1/inderjit/cs395t_lsdm/Data/Data.moreinfo/classic3/classic3_truelabels.3;
A true label file for
the 300-document data set is
/stage/projects1/inderjit/cs395t_lsdm/Data/info.classic300/classic300_truelabel
- Run gmeans with option '-a k' (option for information-theoretic
clustering algorithm) on the Classic3 and the 300-document
matrices, with and without the prior
option '-L c', and with and without first
varition.
- Repeat the above experiments with the cmu.news 20_cleaned matrix.
A true label file is /stage/projects1/inderjit/cs395t_lsdm/Data/info.cmu20/cmu.news20_cleaned_truelabels.20
- Download this Java cluster
browser program to generate web pages illustrating your clustering
results. (see a sample
browser on clusters of 114,000 NSF award abstracts and a sample
browser on clusters of UTCS professors' web pages )
- Code up the non-negative matrix factorization algorithm in Lee &
Seung's paper :"Algorithms for
non-negative matrix factorization". Use the factor matrices to cluster (as
decribed in class and this
paper).
Answer the following questions:
1. Report your clustering results on classic3, the
300 document set and cmu.news 20_cleaned. For each clustering, submit the
confusion matrix (or a summary for the cmu.news 20_cleaned data set)
and objective function value.
2. Test the clustering programs on your email (or other text collection). How did the
clustering programs perform?
3. Are your clustering results good? If not, explain why.
4. In your opinion which of the clustering techniques is
the best? Why?
Due date: Nov. 17, 2003