CS 391L Machine Learning: Homework 1

CS 391L Machine Learning
Homework 1: Active Learning with Version Spaces

Due: Sept. 27, 2007

Current Version-Space System

Like all of the programming assignments in the course, this one will use the Weka package of machine learning software in Java. See the information available at http://www.cs.waikato.ac.nz/~ml/weka/ . The local version of Weka used in the class is available at /u/mooney/cs391L-code/weka/. See the README file in this directory. A tutorial for Weka is available at /u/mooney/cs391L-code/weka/Tutorial.pdf. The Javadoc web-based documentation is available at /u/mooney/cs391L-code/weka/doc/packages.html.

Since the current Weka does not include a version-space algorithm, I have written one that is available in /u/mooney/cs391L-code/weka/weka/classifiers/vspace/. The class ConjunctiveVersionSpace is an implementation for learning conjunctive, nominal feature descriptions. The code is commented and follows the basic conventions of a Weka classifier. Some simple datasets for testing this code are in the "figure" data files in /u/mooney/cs391L-code/weka/data/. The trace at http://www.cs.utexas.edu/users/mooney/cs391L/hw1/trace.txt shows running this system on the "figure" data files. Note that the flag "-P" causes ConjunctiveVersionSpace to produce an incremental trace of processing each example (see the Weka documentation for an explanation of the other options). See the comments in VersionSpace for an explanation of other important options: -R that reorders instances for training and -C that determines how test instances are classified when the version space has not converged.

Due to the restriction to conjunctive binary concepts and computational demands, applying the system to large, realistic data sets is difficult. By simplifying a version of the classic soybean disease diagnosis data, I have created a dataset soybean-reduced, available in /u/mooney/cs391L-code/weka/data/, that is suitable for ConjunctiveVersionSpace. This data contains 68 instances described with 32 features (symptoms) in 4 classes (diseases). Since VersionSpace only handles binary concept learning, the Weka meta learner MultiClassClassifier (in the weka.classifier.meta package) must be used to learn one binary concept per category. MultiClassClassifier requires a DistributionClassifier as a base classifier. Therefore, VersionSpace is a DistributionClassifier that returns a distribution (a probability for each class). The trace.txt file contains a trace of running on this data set. The "-R PN" option is used to have VersionSpace reorder the positive examples first which greatly improves training time, for reasons that are easily explained.

Active Sample Selection

This assignment concerns active selection of training examples. Using just unlabeled instances, the best instances for labeling and use as training examples are those that come closest to "splitting" the version space into two equal halfs. This is the same as the idea of picking the instances that cause the most "disagreement" amongst the various hypotheses in the version space regarding their categorization, or that are the least "certain" with respect to their categorization given the current version space. Given a set of unlabeled instances, the best one to use as the next training example can be selected based on such criteria. This is called sample selection.

Write sample-selection code for the basic version-space system. You should create a specialized class ActiveConjunctiveVersionSpace that extends ConjunctiveVersionSpace. Given a data set that contains a complete set of instances (all the examples in the instance space) for a specific concept (such as figure-all-red.arff and figure-all-red-circle.arff in /u/mooney/cs391-code/weka/data/), your system should incrementally pick the best training example and use it to update the version space. Obviously, it cannot use the class label of an example when making this selection, since this must be unavailable to the learner until it has chosen a specific example and then obtains its label from the teacher. It should repeatedly select training examples until it converges to a single hypothesis, measuring the number of training examples needed until the concept is uniquely, correctly identified. Note that this requires altering the current training process as implemented in the methods build-classifier and train. It will also require writing a method that enumerates all the hypotheses in the version space given the S and G sets. This in turn may require adding one or more additional methods to ConjunctiveGeneralization.

The main method for ActiveConjunctiveVersionSpace should run repeated random experiments of active learning where ties in the example selection criteria are broken randomly (this can easily be achieved by simply randomly mixing up the order of the complete set of examples before each trial). It should report the average number of training examples required before convergence. The first command-line argument should be the ARFF input file name, the second should be the number of random trials to average over. See the trace file of my sample solution active-trace.txt. Your system should produce a similar output when given the flag "-P" to produce a detailed trace, including showing all the hypotheses in the enumerated version space (together with its size) and the score for the best instance which is the difference between the best fraction of splitting the hypotheses and the optimal split of 1/2 (so the best possible score is 0 and the worst possible is 0.5).

Since checking the whole version space may be intractable for larger problems, also allow the option of using just the union of the S and G sets to approximately represent the range of consistent hypotheses. Also allow the option of randomly selecting training examples for comparison as a baseline. The command line for ActiveConjunctiveVersionSpace should take as the two first arguments the input data file and the number of random trials to run and then an option "-S" that takes the values "VS" for using the entire version space, "SG" for just using the S and G sets, and "R" for random selection, and an optional "-P" flag for producing a detailed trace, as shown in active-trace.txt.

Compare active selection of examples to random selection in terms of the average number of training examples needed to converge to a single hypothesis for a given concept. Average results over a thousand random trials. Test the systems using the simple figure data in figure-all-red.arff and figure-all-red-circle.arff. See active-trace.txt for a small sample run with a detailed trace of just 2 trials. Also compare selecting examples based on just the S and G sets to selecting them based on enumerating and testing the entire version space.

Include a summary of your results and a written discussion of the relative performance of random selection, S-G selection, and complete version-space selection with respect to the number of training examples and computational time needed to converge on a correct definition. Discuss and explain any differences in the number of examples needed to learn the two different concepts "circle" and "red circle". Follow the instructions for submitting homework solutions. Include in your electronically submitted directory a detailed (-P) trace called hw1-solution-trace showing two random trials for each of the three selection methods learning both of the test concepts.