This contains text datasets in the Weka Sparse Arff format. The
original text documents have been tokenized and pre-processed to remove stop-words and filter words with very high/low frequencies,
using the MC code of James Fan. A perl script has been used after that to
convert the CCS files to sparse Arff format, and to replace html tokens by generic tokenIDs.
tfidf: Words frequencies in the datasets have been tfidf normalized.
frequency: Original word frequencies are
provided.
protein: Protein dataset used by Xing et al.