CS 395T- Large-scale Data Mining
Homework 2
Clustering
The main goal of this homework is to experiment
with
some clustering techniques.
-
Read this
paper to understand the spherical k-means algorithm.
-
Download the spkmeans
code and compile it.
- Download the data matrices here. Understand
the CCS
matrix format.
-
Download the Java cluster
browser program that generates a sequence of web pages
illustrating
your clustering results. (see a sample
browser )
-
Download the Metis graph partitioning software
and install it.
-
Here
is a C program that transforms the matrix in CCS format to the input for Metis.
-
Run Metis on the same matrix used for spkmeans (note
that
this corresponds to a bipartite graph between words and documents).
-
Run Metis on the graph of documents, where an edge between two
documents
has weight equal to their cosine similarity.
-
Write a hierarchical agglomerative clustering program in C/C++/Java and
then run it on the same matrix.
-
Download the spmeans
code and compile it. Read its documentation.
-
Run spmeans
on the classic3 matrix.
-
Run the various clustering techniques on the matrix of 300 documents.
-
Run the various clustering techniques on cmu.news 20_cleaned.
Answer the following questions:
1. Report your clustering results using the
various techniques
on classic3, the 300 document set and cmu.news 20_cleaned.
For each clustering, submit the confusion matrix and objective function
value (if available).
2. What is the number of clusters output by spmeans
for each of the data sets? Is it 3 for the classic3
data?
3. Are your clustering results good? If not, explain
why.
4. In your opinion which of the clustering
techniques
is the best? Why?