This contains text datasets in the Weka Sparse Arff format. The original text documents have been tokenized and pre-processed to remove stop-words and filter words with very high/low frequencies, using the MC code of James Fan. A perl script has been used after that to convert the CCS files to sparse Arff format, and to replace html tokens by generic tokenIDs.

tfidf: Words frequencies in the datasets have been tfidf normalized.
frequency: Original word frequencies are provided.

Here are some non-sparse arff files that are not in the WEKA data release:

protein: Protein dataset used by Xing et al.

Back to RISC