MC is a C++ program that creates vector-space models from text documents that can be used for text mining applications. MC provides an efficient multi-threaded implementation that can process very large document collections. For example, MC took 1,189 seconds using only 17.5 MBytes of main memory to process a sample collection of about 114,000 documents (the experiment was run on a Sun Ultra10 workstation). More details on MC and its use in a fast clustering algorithm are available in this paper.
The MC program:
The application does not:
MC was developed on the Sun Solaris operating system. It is known to compile on Linux platforms. Most UNIX systems should be compatible with MC.
The code is released under the GNU Public License (GPL).
Dhillon, I. S. and Modha, D. M., "Concept Decompositions for Large Sparse Text Data using Clustering", Machine Learning, 42:1, pages 143-175, Jan, 2001. Dhillon, I. S. and Fan, J. and Guan, Y., "Efficient Clustering of Very Large Document Collections", invited book chapter in Data Mining for Scientific and Engineering Applications, Kluwer Academic Publishers, 2001.
Here are the BiBTeX entries:
@ARTICLE{dhillon:modha:mlj01, AUTHOR = {Dhillon, I. S. and Modha, D. S.}, TITLE = { Concept decompositions for large sparse text data using clustering}, JOURNAL = {Machine Learning}, YEAR = {2001}, MONTH = {Jan}, VOLUME = {42}, NUMBER = {1}, PAGES = {143--175} } @INCOLLECTION{dhillon:fan:guan00, AUTHOR = {Dhillon, I. S. and Fan, J. and Guan, Y.}, TITLE = {Efficient Clustering of Very Large Document Collections}, BOOKTITLE = {Data Mining for Scientific and Engineering Applications}, PUBLISHER = {Kluwer Academic Publishers}, EDITOR = {R. Grossman, C. Kamath, V. Kumar and R. Namburu}, YEAR = {2001}, PAGES = {}, NOTE = {Invited book chapter} }
The latest source code for the program can be downloaded from here.
Unfortunately we do not have time to help users with all their compilation and usage problems. Feel free to send email asking for help or to give us feedback. But please do not necessarily expect us to have time to help. Most appreciated are bug reports accompanied by fixes.