A basic text categorization framework is available in
/u/mooney/cs343-code/ir/classifiers/
. See the Javadoc
documentation for this code. Right now, ir.classifiers has Naive
Bayes and an incomplete Perceptron
classifier in it. Both of them extend the abstract class Classifier.
The NaiveBayes classifier performs text categorization using the Naive Bayes method, and does Laplace smoothing. It also stores all probabilities internally in log values to prevent underflow problems.
The Perceptron classifier performs text categorization using a separate PerceptronUnit for each category. The category with the highest confidence value returned by its PerceptronUnit is chosen when text documents are classified.
The Yahoo-science document collection in
/u/mooney/ir-code/corpora/yahoo-science/
contains 897 pages: about
300 random samples each from the Yahoo indices for biology, physics, and chemistry. Right now, the
code is set up to categorize the yahoo-science collection into 3 categories -
bio, phys, chem. The NaiveBayes and Perceptron classes both have
train() methods that take in a vector of training documents as Example
objects. They also have a test() method that takes in a test Example and
categorizes it as a bio, phys or chem document.
In vector-space representation a text document is stored as a sparse vector. The size of the vector equals the total number of words in the vocabulary over all documents; each token in the vocabulary is associated with a particular component of the vector. For a particular document, values of the components of its vector correspond to term frequencies - the number of occurrences of each token in this document. Since documents rarely include all words from the vocabulary, the vector is very sparse: most components are 0, corresponding to words that do not occur in the particular document.
Because storing the full vector consisting mostly of zeros for each document would be very inefficient, a document is represented inside the Example object by a HashMapVector. This class is a part of ir.vsr package, you will have to utilize its methods for this assignment. It represents a document vector as a hashtable, where keys are the tokens that occur in a document, and values store the term frequencies.
The CVLearningCurve class generates the learning curves with k-fold (default k=10) cross validation for a classifier. The classifier object is passed as an argument to the CVLearningCurve constructor. TestNaiveBayes and TestPerceptron classes create a NaiveBayes classifier object and a Perceptron classifier object respectively, and run 10-fold cross-validation on the corresponding classifier on the yahoo-science dataset.
The output of CVLearningCurve is two ".gplot" files that gnuplot can use to generate the k-fold learning curve (plot) in postscript. Gnuplot is located in /lusr/gnu/bin/ - you may need to add this to your PATH environment variable (or use the full path when executing the program). For example, running TestNaiveBayes writes "NaiveBayes.gplot" and "NaiveBayesTrain.gplot" files to the current working directory that contain learning curves for the test and training data, respectively. To run gnuplot on a script file, just execute "gnuplot filename.gplot > filename.ps". You can view the ".ps" file with "gv filename.ps" (Ghostview). See a sample trace of running the NaiveBayes classifier on the yahoo-science document collection, and the learning curve produced for the testing data.
The Perceptron classifier is only partially implemented and your task is to complete it. As mentioned above, the Perceptron classifier utilizes multiple PerceptronUnits, one per category. The real-valued output of the classify() method of each PerceptronUnit should be the amount by which the net input exceeds the threshold. This corresponds to the confidence in the prediction of being positive, and is negative if the class is negative. Your task is to fill in classify() and the trainCategory() methods for the ir.classifiers.PerceptronUnit class. As stated in the declaration of the trainCategory method, it trains the perceptron to only fire for examples in the given category, treating examples from all other categories as negative examples. Implement the Perceptron training method based on the pseudocode presented in class. The "-debug" flag to TestPerceptron allows you to insert conditional print statements to aid debugging (see the use of "-debug" in NaiveBayes). Be sure to document your code well.
First, copy the latest version of the ir.* packages from /u/mooney/cs343-code/ir/, the files have changed since Project 1.
Hint 1: You don't need to peruse all the code in ir.* packages. The
classes that you should be familiar with for this assignment are
PerceptronUnit,
Perceptron,
Example,
and HashMapVector.
Hint 2: Remember that each Example
object in the training set stores the values of its features in a HashMapVector.
One can iterate through the Map.Entry's
in the hashmap's entrySet,
where the key of each Map.Entry is the token, and the value is its frequency in
the document represented by a Weight
object. Thus, inputs to the perceptron are integer values, which is a natural
extension of the binary-input perceptron presented in class.
Hint 3: The maximum number of training epochs parameter is currently
set to 100 in PerceptronUnit class. A full run of TestPerceptron with the
correct solution to this project takes approximately 15 minutes on my PC.
You may consider reducing the number of default points or the number of
folds in CVLearningCurve while debugging; be sure, however, to change it back for
your actual data collection runs once you are done debugging.
If Java gives an out-of-memory error, try executing the "java" command
with the "-Xmx256m" option. It allocates more memory to the
virtual machine, preventing out-of-memory errors.
Next, train and test both NaiveBayes and Perceptron on the yahoo-science data by running TestNaiveBayes and TestPerceptron as shown above. You will obtain learning curves for both classifiers for training and testing data. As mentioned above, the curves can be plotted using the gnuplot utility by executing "gnuplot filename.gplot > filename.ps" and then viewing the Postscript file filename.ps by executing "gv filename.ps".
Combine these four curves into two comparitive plots manually: one for training
data, one for testing data. To do this for the training curves, change the line
in PerceptronTrain.gplot file from
plot 'PerceptronTrain.data' title "PerceptronTrain"
to
plot 'PerceptronTrain.data' title "PerceptronTrain", 'NaiveBayesTrain.data'
title "NaiveBayesTrain"
Create the combined plot for testing data
analogously. Name these plots "accuracy-train.ps" and "accuracy-test.ps".
These graphs will allow you to clearly see the relative performance of the two
sytems with respect to their ability to fit training data and to accurately predict
the category of previously unseen documents.
Write a 1-2 page report on your results, containing insightful discussion of the comparative performance of the two algorithms. Discuss the differences in test and training accuracy given different numbers of training exmaples (i.e., at different points on the learning curve). Also discuss the relative training time and test time of the two methods (this is presented at the end of the traces). Hint: Perceptron's test accuracy should be reasonably comparable to NaiveBayes and its training accuracy should be uniformly better, if not, your implementation is buggy.
In submitting your solution, follow these instructions on
using the turnin utility.
For this project, submit a directory that contains: