Making Weka Text-friendly
Preprocess text by making wrapper calls to:
- Mooney’s IR package: Tokenize, Porter Stemming, TFIDF
- McCallum’s BOW package: Tokenize, Stem, TFIDF, Information-theoretic pruning, N-gram tokens, different smoothing algorithms
- Fan’s MC toolkit: Tokenize, TFIDF, pruning, CCS format
No inverted index in Weka: OK if not doing IR, but KNN is inefficient
- May want to integrate VSR package of IR with Weka
Probability underflow currently: have to do calculations with logs
- NaiveBayes, KNN, etc: Can have 2 versions of each (sparse, dense)
Sparse vector format:
- Weka’s SparseInstance
- IR’s hashMapVector