Co-clustering Software (Version 1.0)

Announcement

The new version (Version 1.1) of Co-cluster Software is available.
Some of datasets that are used in our experiments are available.

The program

Co-cluster (Version 1.0) is a C++ program written by Hyuk Cho, Yuqiang Guan and Suvrit Sra, which implements three co-clustering algorithms: information-theoretic co-clustering algorithm and two types of minimum sum-squared residue co-clustering algorithms (see the papers for details). In our implementation, all the algorithms have the ping-pong structure, i.e., a batch algorithm followed by corresponding chain of first variations. Each algorithm also has five variations, based on in what order to update the row or column centroids.

Input

The input matrix to be co-clustered can be either sparse matrix in CCS format or a dense matrix. In case of sparse matrix, the sample input files will look like: example1_col_ccs, example1_dim, example1_row_ccs, example1_txx_nz. In case of dense matrix, the sample files will look like: example1_dim, example1. '-F' option controls the format of input matrix.
Initial seeding file may be in two different format. The simple format has only two lines: the first line contains cluster ID for each row of the input matrix and the second line contains cluster ID for each column. A more complicated format describes each co-cluster by giving the number of row and column in that co-cluster, and the IDs of the rows and columns in the co-cluster. The clustering output also has these two formats. '-i' option controls initial seeding.
Sometimes true label file for column or row or both exsits. '-T' controls that.

Output

The output file formats are the same as the initial seeding files for sometimes we may want to initialize the clustering with the output of previous run. '-O' option controls this.

Other command options

'-a' chooses co-clustering algorithms: info. theo. co-clustering (default) or min. sum-squared residue co-clusterings
'-c' gives number of column clusters and '-r' gives number of row clusters
'-F' specifies input matrix format.
'-t' scaling method, for CCS only
'-T' specifies true label files
'-O' spceifies output file names
'-l' gives the first variation chain length (default 0)
'-R' gives the number of random runs; this will produces the average objective funtion value and the variance
'-p' gives different prior options
'-e' gives threshold for batch loop and first variation
'-d' gives different level of dump information
'-I' allows rows of input matrix are negated
'-V' gives variation of selected algorithm ('-a' option)

Download

README of the code is here.
The code is released under the GNU Public License (GPL).
The code has been compiled using gcc 3.0.3 in Solaris and Linux.
However, bug reports and comments are always appreciated.

Please enter your e-mail address and press the left button below.

Then if you see the following message, the software has been sent to you via e-mail.
Otherwise, request the software directly from me.

" Thanks for downloading Cocluster software.
Good luck!
Hyuk Cho"

Citation

You are welcome to use the code under the terms of the GNU Public License (GPL), however please acknowledge its use with a citation:

Co-clustering of Human Cancer Microarrays using Minimum Sum-Squared Residue Co-clustering, H. Cho and I.S. Dhillon, IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), vol. 5:3, pages 385-400, July 2008.
Download: [pdf]
Minimum Sum-Squared Residue Co-clustering of Gene Expression Data, H. Cho, I.S. Dhillon, Y. Guan and S. Sra, Proceedings of The fourth SIAM International Conference on Data Mining, pages 114-125, April 2004.
Download: [ps, pdf]
Information Theoretic Clustering of Sparse Co-Occurrence Data I.S. Dhillon and Y. Guan, Proceedings of The Third IEEE International Conference on Data Mining, pages 517-520, November 2003.
Download: [ps, pdf]
(A longer version appears as UTCS Technical Report #TR-03-39, September 2003. [Abstract & Download])
(Also, appears as "Clustering Large and Sparse Co-Occurrence Data", Workshop on Clustering High-Dimensional Data and its Applications at The Third SIAM International Conference on Data Mining, May 2003. Download: [ps, pdf])
Information-Theoretic Co-clustering, I. S. Dhillon, S. Mallela, and D. S. Modha, Proceedings of The Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD), pages 89-98, August 2003.
Download: [ps, pdf]
(Also, appears as UTCS Technical Report #TR-03-12, April 2003. [Abstract & Download])