CS 395T- Large-scale Data Mining

Homework 2

Clustering

The main goal of this homework is to experiment with some clustering techniques.

Read this paper to understand the spherical k-means algorithm.
Download the spkmeans code and compile it.
Download the data matrices here. Understand the CCS matrix format.
Download the Java cluster browser program that generates a sequence of web pages illustrating your clustering results. (see a sample browser )
Download the Metis graph partitioning software and install it.
Here is a C program that transforms the matrix in CCS format to the input for Metis.
Run Metis on the same matrix used for spkmeans (note that this corresponds to a bipartite graph between words and documents).
Run Metis on the graph of documents, where an edge between two documents has weight equal to their cosine similarity.
Write a hierarchical agglomerative clustering program in C/C++/Java and then run it on the same matrix.
Download the spmeans code and compile it. Read its documentation.
Run spmeans on the classic3 matrix.
Run the various clustering techniques on the matrix of 300 documents.
Run the various clustering techniques on cmu.news 20_cleaned.

Answer the following questions:

    1. Report your clustering results using the various techniques on classic3, the 300 document set and cmu.news 20_cleaned. For each clustering, submit the confusion matrix and objective function value (if available).
    2. What is the number of clusters output by spmeans for each of the data sets? Is it 3 for the classic3 data?
    3. Are your clustering results good? If not, explain why.
    4. In your opinion which of the clustering techniques is the best? Why?