Making Weka Text-friendly
 
 
- Preprocess text by making wrapper calls to:
- 
- Mooney’s  IR package: Tokenize, Porter Stemming, TFIDF
- McCallum’s  BOW package: Tokenize, Stem, TFIDF, Information-theoretic pruning, N-gram tokens, different smoothing algorithms
- Fan’s MC toolkit: Tokenize, TFIDF, pruning, CCS format
 
- No inverted index in Weka: OK if not doing IR, but KNN is inefficient
- 
- May want to integrate VSR package of IR with Weka
 
- Probability underflow currently: have to do calculations with logs
- 
- NaiveBayes, KNN, etc: Can have 2 versions of each (sparse, dense)
 
- Sparse vector format:
- 
- Weka’s SparseInstance
- IR’s hashMapVector