Co-clustering Software (Version 1.0)
Announcement
The program
Co-cluster (Version 1.0) is a C++ program written by
Hyuk Cho, Yuqiang Guan and Suvrit Sra, which
implements three
co-clustering algorithms: information-theoretic co-clustering algorithm
and two types of minimum sum-squared residue co-clustering algorithms
(see the papers for details). In our
implementation, all the algorithms have the ping-pong structure, i.e.,
a batch algorithm followed by corresponding chain of first variations.
Each algorithm also has five variations, based on in what order to
update the row or column centroids.
- The input matrix to be co-clustered can be either sparse matrix
in CCS
format or a dense matrix. In case of sparse matrix, the sample
input files will look like: example1_col_ccs,
example1_dim,
example1_row_ccs,
example1_txx_nz.
In case of dense matrix, the sample files will look like: example1_dim,
example1.
'-F' option controls the format of input matrix.
- Initial seeding file may be in two different format. The simple
format has only two lines: the first line contains cluster ID for
each row of the input matrix and the second line contains cluster ID
for each column. A more complicated
format describes each co-cluster by giving the number of row and
column in that co-cluster, and the IDs of the rows and columns in the
co-cluster. The clustering output also has these two formats. '-i'
option controls initial seeding.
- Sometimes true label file for column
or row
or both exsits. '-T' controls that.
- The output file formats are the same as the initial seeding
files for sometimes we may want to initialize the clustering with the
output of previous run. '-O' option controls this.
- '-a' chooses co-clustering algorithms: info. theo.
co-clustering (default) or min. sum-squared residue co-clusterings
- '-c' gives number of column clusters and '-r' gives number of
row clusters
- '-F' specifies input matrix format.
- '-t' scaling method, for CCS only
- '-T' specifies true label files
- '-O' spceifies output file names
- '-l' gives the first variation chain length (default 0)
- '-R' gives the number of random runs; this will produces the
average objective funtion value and the variance
- '-p' gives different prior options
- '-e' gives threshold for batch loop and first variation
- '-d' gives different level of dump information
- '-I' allows rows of input matrix are negated
- '-V' gives variation of selected algorithm ('-a' option)
Download
Citation
You are welcome to use the code under the terms of the GNU Public License (GPL), however please acknowledge its use with a citation:
- Co-clustering of Human Cancer Microarrays using Minimum Sum-Squared Residue Co-clustering,
H. Cho and I.S. Dhillon,
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), vol. 5:3, pages 385-400, July 2008.
Download: [pdf]
- Minimum Sum-Squared Residue Co-clustering of Gene Expression Data,
H. Cho, I.S. Dhillon, Y. Guan and S. Sra,
Proceedings of The fourth SIAM International Conference on Data Mining, pages 114-125, April 2004.
Download: [ps,
pdf]
- Information Theoretic Clustering of Sparse Co-Occurrence Data
I.S. Dhillon and Y. Guan,
Proceedings of The Third IEEE International Conference on Data Mining, pages 517-520, November 2003.
Download: [ps,
pdf]
(A longer version appears as UTCS Technical Report #TR-03-39, September 2003.
[Abstract & Download])
(Also, appears as "Clustering Large and Sparse Co-Occurrence Data", Workshop on Clustering High-Dimensional Data and its Applications
at The Third SIAM International Conference on Data Mining, May 2003.
Download: [ps,
pdf])
- Information-Theoretic Co-clustering,
I. S. Dhillon, S. Mallela, and D. S. Modha,
Proceedings of The Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD), pages 89-98, August 2003.
Download: [ps,
pdf]
(Also, appears as UTCS Technical Report #TR-03-12, April 2003.
[Abstract & Download])