Gmeans 1.0 is a C++ program for
clustering. At the heart of the program are the
k-means
type of clustering algorithms
with four different distance (similarity) measures, six various
initialization
methods and a powerful local search strategy called
first variation
(see the
papers for details). It generates
one-way, hard clustering of a given dataset. For co-clustering
software, click
here.
- The four distance (similarity) measures:
- cosine: the k-means
using cosine is also called spherical
k-means. It is often used for text clustering, where each document
is represented as a vector of word occurrences (vector space model).
Each vector has to be (L2) normalized, i.e. each vector
represents a point on a
unit sphere in very high dimensional space (usually thousands, or even
tens of
thousand dimension).The cluster centroid is (L2) normalized summation
of all the vectors in the cluster. Thus the centroid is also on the
unit sphere. The similarity between two
vectors is the dot product of the two vectors, i.e. the cosine of the
angle
between the two vectors.
- Euclidean distance: when people talk about k-means, they
often refer to this algorithm, in which the distance between two
vectors is
the Euclidean distance of the them and the cluster centroid
is
simply the mean of the vectors in the cluster.
- Kullback-Leibler divergence: each vector has to be (L1)
normalized for it is considered as a probability distribution.
And
the distance between probabilities is measured with the
Kullback-Leibler divergence. The cluster centroid is mean of (weighted)
vectors in the cluster.
- Prior: due to the fact that KL-divergence between two
probabilities can be infinite, we introduce 'prior' to avoid zero
probabilities. The prior has similar effect on the clustering as
the temperature parameter in deterministic annealing. '-L c #' is the
option for the prior, where
# is the prior value. Default is
no prior. In the program, the prior is halved for each iteration.
Typical starting
values for prior ranges from 10,000-1,000,000.
- Diametric distance: each vector has to be (L2)
normalized, i.e. each vector lies on a unit sphere. The cluster
centroid is the dominant right singular vector of the cluster
submatrix, i.e. the submatrix consisting only of the vectors in the
cluster. The similarity measure is the squared dot product between data
vector and the centroids. This measure has been used for identifying
anti-correlated gene clusters.
In the batch k-means
type of algorithms, data points are moved based on distance from them
to various clusters: points are usually assigned to the closest
cluster, which guarantees monotonicity of change in objective function
value. However, sometimes moving points to far clusters can generate
better objective function values than to close clusters (examples can
be found in this paper below). This local
search strategy refines a given clustering by incrementally
moving data points between clusters, then checks the change in
objective function value to see if this move is good or bad. A
chain of first variations can be done together, in which temporary bad
moves are allowed as long as they are "locally" best. Then a sunsequnce
of
the first variations in the chain may be actually acknowledged if that
subsequence leads to a better objective function. '-f #' , where # is
the chain length.
- It can substantially improve the clustering in high-dimensional
space especially when cluster sizes are small, in which case moving one
data item from one
cluster to another will change the centroids a lot. And this is the
situation
where k-means often generates
poor clustering results. For example, in one set of experiments we run k-means, on 30 points evenly from 3
clusters, 100 times with random initialization. We observe that k-means stops right after random
initializtion in all runs. First variaiton may be useful, for example,
at
the late stage of divisive hierarchical clustering when the clusters
get
smaller and smaller.
- However the computation time may increase much due to first
variation.
The program contain a split function that will generate a
singleton cluster each time with a data item farthest from its cluster
centroid. The function is called when
- k-means
generates empty clusters.
- the user wants to get a sequence of clusters, say 5 to 13
clusters.
The program returns as many clusters as the user specifies in the
comand line using '-c' option. So when k-means generates empty
clusters, split will fill in each empty cluster with one data
item sequentially. For all the four measures, generating singletons
will always gives better objective function values if there is no
penalty for the number of clusters since the more complicated the model
is the better it will fit the data. We do have a penalty option '-w #'
that linearly penalizes the number of clusters.
Its main input is a data matrix which may be stored in one
or multiple files depending the format of the matrix. Each input data
item is represented as a column vector of the input matrix, i.e. each
data item is represented as a vector of features. Thus the number of
columns equals to the number of data items to be clustered; and the
number of rows equals to the dimension of the feature space. The
program accepts either a sparse or dense matrix, given different
options (-F d/t/s). Default is sparse format. Note: for the dense
matrix, you can give either the matrix itself or its transpose.
- The input sparse matrix in CCS
format is stored in 6 files, at least 4 of which will be read in by the
program. See the sample input files: classic3_col_ccs,
classic3_dim,
classic3_docs,
classic3_words, classic3_row_ccs,
classic3_tfn_nz.
This program exploits the sparsity of the data matrix thus is
computationally efficient. In one of our case studies, we ever
clustered about 114,000 NSF award
abstracts within about 5 minutes on a Sun Ultra10 workstation. And the clustering
result is illustrated by a sequence of web pages generated by a
Java program called "ClusterBrowser".
In another case study, Gmeans was used to cluster all the 34 UTCS
faculty home pages in order to recover the 9 research categories
specified here
and the clustering result is illustrated here.
- When
you want to cluster text documents, you may use this
software to generate the sparse matrix in CCS format from documents.
- When the input is a dense matrix, there are two options: -F d/t
because sometimes each row in the matrix is used to represent a data
item therefore we have to tell the program to transpose the matrix
before it
does any computation. For either case the two integers in the first row
of the file storing the matrix give the row and column dimension of the
input matrix. Here is an example
of dense matrix in which each row represents a data item (gene
expression).
The output of the program is a file which contains the
cluster index for each item. The number in the first row is the number
of data items clustered. See a sample
output.
- 6 ways of initialization: '-i' option
- random ID ('-i r'): randomly assign each data item a cluster
index ranging from 0 to #-1, where # is the number of clusters
specified by the user with '-c #' option.
- random perturb (-i p): generate the centroid for the whole data
and then purterb around it. The magnitude of purterbation can be set
using '-p #'. Default is 0.5.
- initializing for a file (-i f file_name): read cluster index of
each data from a file which has the same format as the sample
output. If a cluster index for some data is negative, the
corresponding data item will not be used to generate the initial
cluster centroids.
- random seed (-i c): randomly pick # data items as the initial
cluster centroids.
- farthest_0 (-i s): randomly pick 1 data item as the first
cluster centroid, after that pick a item which is 'farthest' to all the
previoud cluster centroids already picked until all the cluster
centroids are picked.
- farthest_1 (-i S): different from 5 in the way first cluster
centroid is pick -- pick the data item 'farthest' to the centroid of
the whole data set, the rest is the same as 5. Default is 6.
Some statistics will be reported on the screen including
objective function value; and if '-T' option is also given, you will
see confusion matrix, cluster purity, entropy, mutual information of
the confusion matrix, micro-average precision, F-measure, computation
time, etc. '-T' option gives the file which contains the true label for
each data item. The format of this
file is exactly the same as the output file. If the user want to see
how
different the results of two runs of Gmeans are he can give the output
of
one run to the '-T' option and the output of the other run as the
initialization.
If the user just wants to evaluate a clustering result, '-U
clustering_result_file' option is used. Then the program reads in the
data matrix and this clustering_result_file (in the format described
above) and computes the objective function value and other statistics
if the true labels are known.
- Concept Decompositions for Large Sparse Text Data using Clustering,
I. S. Dhillon, and D. M. Modha,
Machine Learning, vol. 42:1, pages 143-175, January 2001.
Download: [ps,
pdf]
(An earlier version appears as IBM Research Report RJ 10147, July 8, 1999.)
-
Efficient Clustering of Very Large Document Collections,
I. Dhillon, J. Fan and Y. Guan,
Invited book chapter in Data Mining for Scientific and Engineering Applications, Kluwer, pages 357-381, 2001.
Download: [ps,
pdf,
HTML]
-
Iterative Clustering of High Dimensional Text Data Augmented by Local Search,
I. S. Dhillon, Y. Guan and J. Kogan,
Proceedings of The Second IEEE Data Mining conference 2002, Maebashi, Japan December 9 - 12, 2002.
Download: [ps,
pdf]
(Also, appears as Technical Report, "Refining Clusters in High Dimensional
Text Data", The University of Texas at Austin, Department of Computer
Sciences. TR-02-03. January 2002. 18 pages.)
- Information-Theoretic Clustering of Sparse
Co-Occurrence Data,
I.S. Dhillon and Y. Guan,
Proceedings of The Third IEEE International Conference on Data Mining, Melbourne, Florida, USA November 19 - 22, 2003.
Download: [ps,
pdf]
(A longer version appears as UTCS Technical Report #TR-03-39, September 2003.
[Abstract & Download])
(Also, appears as "Clustering Large and Sparse Co-Occurrence Data", Workshop on Clustering High-Dimensional Data and its Applications
at The Third SIAM International Conference on Data Mining, May 2003.
Download: [ps,
pdf])
- Diametrical Clustering for identifying anti-correlated gene clusters,
I. S. Dhillon, E. Marcotte and R. Usman,
Bioinformatics, vol. 19(13), pages 1612-1619, 2003.
Download: [ps,
pdf]
(Also, appears as UTCS Technical Report #TR-02-49, September 2002.
[Abstract & Download])
-
BiBTeX entries:
@ARTICLE{dhillon:modha:mlj01,
AUTHOR = {Dhillon, I. S. and Modha, D.
S.},
TITLE = { Concept decompositions for
large sparse text data using clustering},
JOURNAL = {Machine Learning},
YEAR = {2001},
MONTH = {Jan},
VOLUME = {42},
NUMBER = {1},
PAGES = {143--175} }
@INCOLLECTION{dhillon:fan:guan00,
AUTHOR = {Dhillon, I. S. and Fan, J. and
Guan, Y.},
TITLE = {Efficient Clustering of Very
Large Document Collections},
BOOKTITLE = {Data Mining for Scientific
and Engineering Applications},
PUBLISHER = {Kluwer Academic
Publishers},
EDITOR = {R. Grossman, C. Kamath, V.
Kumar and R. Namburu},
YEAR = {2001},
PAGES = {357-381},
NOTE = {Invited book chapter}}
@INPROCEEDINGS{dhillon:guan:kogan:2002,
AUTHOR = {Dhillon, I. S. and Guan, Y.
and Kogan, J.},
TITLE = {Iterative Clustering of High
Dimensional Text Data Augmented by Local Search},
PAGES = {},
BOOKTITLE = {Proceedings of the 2002 IEEE
International Conference on Data Mining},
YEAR = {2002},
EDITOR = {},
PUBLISHER = {} }
@INPROCEEDINGS{dhillon:guan:2003a,
AUTHOR = {Dhillon, I. S. and Guan, Y.},
TITLE = {Clustering Large and Sparse
Co-occurrence Data},
PAGES = {},
BOOKTITLE = {Proceedings of the Workshop
on Clustering High-Dimensional Data and
its Applications at the Third
SIAM International Conference on Data Mining},
YEAR = {2003} }