Gmeans: A clustering tool in ping-pong style

The program:

Gmeans 1.0 is a C++ program for clustering. At the heart of the program are the k-means type of clustering algorithms with four different distance (similarity) measures, six various initialization methods and a powerful local search strategy called first variation (see the papers for details). It generates one-way, hard clustering of a given dataset. For co-clustering software, click here.

The four distance (similarity) measures:

cosine: the k-means using cosine is also called spherical k-means. It is often used for text clustering, where each document is represented as a vector of word occurrences (vector space model). Each vector has to be (L2) normalized, i.e. each vector represents a point on a unit sphere in very high dimensional space (usually thousands, or even tens of thousand dimension).The cluster centroid is (L2) normalized summation of all the vectors in the cluster. Thus the centroid is also on the unit sphere. The similarity between two vectors is the dot product of the two vectors, i.e. the cosine of the angle between the two vectors.
Euclidean distance: when people talk about k-means, they often refer to this algorithm, in which the distance between two vectors is the Euclidean distance of the them and the cluster centroid is simply the mean of the vectors in the cluster.
Kullback-Leibler divergence: each vector has to be (L1) normalized for it is considered as a probability distribution. And the distance between probabilities is measured with the Kullback-Leibler divergence. The cluster centroid is mean of (weighted) vectors in the cluster.

Prior: due to the fact that KL-divergence between two probabilities can be infinite, we introduce 'prior' to avoid zero probabilities. The prior has similar effect on the clustering as the temperature parameter in deterministic annealing. '-L c #' is the option for the prior, where # is the prior value. Default is no prior. In the program, the prior is halved for each iteration. Typical starting values for prior ranges from 10,000-1,000,000.

Diametric distance: each vector has to be (L2) normalized, i.e. each vector lies on a unit sphere. The cluster centroid is the dominant right singular vector of the cluster submatrix, i.e. the submatrix consisting only of the vectors in the cluster. The similarity measure is the squared dot product between data vector and the centroids. This measure has been used for identifying anti-correlated gene clusters.

First variation

In the batch k-means type of algorithms, data points are moved based on distance from them to various clusters: points are usually assigned to the closest cluster, which guarantees monotonicity of change in objective function value. However, sometimes moving points to far clusters can generate better objective function values than to close clusters (examples can be found in this paper below). This local search strategy refines a given clustering by incrementally moving data points between clusters, then checks the change in objective function value to see if this move is good or bad. A chain of first variations can be done together, in which temporary bad moves are allowed as long as they are "locally" best. Then a sunsequnce of the first variations in the chain may be actually acknowledged if that subsequence leads to a better objective function. '-f #' , where # is the chain length.

It can substantially improve the clustering in high-dimensional space especially when cluster sizes are small, in which case moving one data item from one cluster to another will change the centroids a lot. And this is the situation where k-means often generates poor clustering results. For example, in one set of experiments we run k-means, on 30 points evenly from 3 clusters, 100 times with random initialization. We observe that k-means stops right after random initializtion in all runs. First variaiton may be useful, for example, at the late stage of divisive hierarchical clustering when the clusters get smaller and smaller.
However the computation time may increase much due to first variation.

Split

The program contain a split function that will generate a singleton cluster each time with a data item farthest from its cluster centroid. The function is called when

k-means generates empty clusters.

the user wants to get a sequence of clusters, say 5 to 13 clusters.

The program returns as many clusters as the user specifies in the comand line using '-c' option. So when k-means generates empty clusters, split will fill in each empty cluster with one data item sequentially. For all the four measures, generating singletons will always gives better objective function values if there is no penalty for the number of clusters since the more complicated the model is the better it will fit the data. We do have a penalty option '-w #' that linearly penalizes the number of clusters.

Input and output

Its main input is a data matrix which may be stored in one or multiple files depending the format of the matrix. Each input data item is represented as a column vector of the input matrix, i.e. each data item is represented as a vector of features. Thus the number of columns equals to the number of data items to be clustered; and the number of rows equals to the dimension of the feature space. The program accepts either a sparse or dense matrix, given different options (-F d/t/s). Default is sparse format. Note: for the dense matrix, you can give either the matrix itself or its transpose.

The input sparse matrix in CCS format is stored in 6 files, at least 4 of which will be read in by the program. See the sample input files: classic3_col_ccs, classic3_dim, classic3_docs, classic3_words, classic3_row_ccs, classic3_tfn_nz. This program exploits the sparsity of the data matrix thus is computationally efficient. In one of our case studies, we ever clustered about 114,000 NSF award abstracts within about 5 minutes on a Sun Ultra10 workstation. And the clustering result is illustrated by a sequence of web pages generated by a Java program called "ClusterBrowser". In another case study, Gmeans was used to cluster all the 34 UTCS faculty home pages in order to recover the 9 research categories specified here and the clustering result is illustrated here.
When you want to cluster text documents, you may use this software to generate the sparse matrix in CCS format from documents.
When the input is a dense matrix, there are two options: -F d/t because sometimes each row in the matrix is used to represent a data item therefore we have to tell the program to transpose the matrix before it does any computation. For either case the two integers in the first row of the file storing the matrix give the row and column dimension of the input matrix. Here is an example of dense matrix in which each row represents a data item (gene expression).

The output of the program is a file which contains the cluster index for each item. The number in the first row is the number of data items clustered. See a sample output.

6 ways of initialization: '-i' option

random ID ('-i r'): randomly assign each data item a cluster index ranging from 0 to #-1, where # is the number of clusters specified by the user with '-c #' option.

random perturb (-i p): generate the centroid for the whole data and then purterb around it. The magnitude of purterbation can be set using '-p #'. Default is 0.5.

initializing for a file (-i f file_name): read cluster index of each data from a file which has the same format as the sample output. If a cluster index for some data is negative, the corresponding data item will not be used to generate the initial cluster centroids.

random seed (-i c): randomly pick # data items as the initial cluster centroids.

farthest_0 (-i s): randomly pick 1 data item as the first cluster centroid, after that pick a item which is 'farthest' to all the previoud cluster centroids already picked until all the cluster centroids are picked.

farthest_1 (-i S): different from 5 in the way first cluster centroid is pick -- pick the data item 'farthest' to the centroid of the whole data set, the rest is the same as 5. Default is 6.

Report

Some statistics will be reported on the screen including objective function value; and if '-T' option is also given, you will see confusion matrix, cluster purity, entropy, mutual information of the confusion matrix, micro-average precision, F-measure, computation time, etc. '-T' option gives the file which contains the true label for each data item. The format of this file is exactly the same as the output file. If the user want to see how different the results of two runs of Gmeans are he can give the output of one run to the '-T' option and the output of the other run as the initialization.

Evaluation

If the user just wants to evaluate a clustering result, '-U clustering_result_file' option is used. Then the program reads in the data matrix and this clustering_result_file (in the format described above) and computes the objective function value and other statistics if the true labels are known.

Download

README of the code is here.
The code is released under the GNU Public License (GPL).
The code has been compiled using gcc 3.0.3 in Solaris and Linux.
However, bug reports and comments are always appreciated.

Please enter your e-mail address and press the left button below.

Then if you see the following message, the software has been sent to you via e-mail.
Otherwise, request the software directly from me.

" Thanks for downloading Gmeans software.
Good luck!
Yuqiang Guan"

Citation

You are welcome to use the code under the terms of the GNU Public License (GPL), however please acknowledge its use with a citation:

Concept Decompositions for Large Sparse Text Data using Clustering, I. S. Dhillon, and D. M. Modha, Machine Learning, vol. 42:1, pages 143-175, January 2001. Download: [ps, pdf]
(An earlier version appears as IBM Research Report RJ 10147, July 8, 1999.)
Efficient Clustering of Very Large Document Collections, I. Dhillon, J. Fan and Y. Guan, Invited book chapter in Data Mining for Scientific and Engineering Applications, Kluwer, pages 357-381, 2001. Download: [ps, pdf, HTML]
Iterative Clustering of High Dimensional Text Data Augmented by Local Search, I. S. Dhillon, Y. Guan and J. Kogan, Proceedings of The Second IEEE Data Mining conference 2002, Maebashi, Japan December 9 - 12, 2002. Download: [ps, pdf]
(Also, appears as Technical Report, "Refining Clusters in High Dimensional Text Data", The University of Texas at Austin, Department of Computer Sciences. TR-02-03. January 2002. 18 pages.)
Information-Theoretic Clustering of Sparse Co-Occurrence Data, I.S. Dhillon and Y. Guan, Proceedings of The Third IEEE International Conference on Data Mining, Melbourne, Florida, USA November 19 - 22, 2003. Download: [ps, pdf]
(A longer version appears as UTCS Technical Report #TR-03-39, September 2003. [Abstract & Download])
(Also, appears as "Clustering Large and Sparse Co-Occurrence Data", Workshop on Clustering High-Dimensional Data and its Applications at The Third SIAM International Conference on Data Mining, May 2003. Download: [ps, pdf])
Diametrical Clustering for identifying anti-correlated gene clusters, I. S. Dhillon, E. Marcotte and R. Usman, Bioinformatics, vol. 19(13), pages 1612-1619, 2003. Download: [ps, pdf]
(Also, appears as UTCS Technical Report #TR-02-49, September 2002. [Abstract & Download])

BiBTeX entries:

@ARTICLE{dhillon:modha:mlj01, AUTHOR = {Dhillon, I. S. and Modha, D. S.}, TITLE = { Concept decompositions for large sparse text data using clustering}, JOURNAL = {Machine Learning}, YEAR = {2001}, MONTH = {Jan}, VOLUME = {42}, NUMBER = {1}, PAGES = {143--175} }

@INCOLLECTION{dhillon:fan:guan00, AUTHOR = {Dhillon, I. S. and Fan, J. and Guan, Y.}, TITLE = {Efficient Clustering of Very Large Document Collections}, BOOKTITLE = {Data Mining for Scientific and Engineering Applications}, PUBLISHER = {Kluwer Academic Publishers}, EDITOR = {R. Grossman, C. Kamath, V. Kumar and R. Namburu}, YEAR = {2001}, PAGES = {357-381}, NOTE = {Invited book chapter}}

@INPROCEEDINGS{dhillon:guan:kogan:2002, AUTHOR = {Dhillon, I. S. and Guan, Y. and Kogan, J.}, TITLE = {Iterative Clustering of High Dimensional Text Data Augmented by Local Search}, PAGES = {}, BOOKTITLE = {Proceedings of the 2002 IEEE International Conference on Data Mining}, YEAR = {2002}, EDITOR = {}, PUBLISHER = {} }

@INPROCEEDINGS{dhillon:guan:2003a, AUTHOR = {Dhillon, I. S. and Guan, Y.}, TITLE = {Clustering Large and Sparse Co-occurrence Data}, PAGES = {}, BOOKTITLE = {Proceedings of the Workshop on Clustering High-Dimensional Data and its Applications at the Third SIAM International Conference on Data Mining}, YEAR = {2003} }