Like all of the programming assignments in the course, this one will use the
Weka package of machine learning software in Java. See the information
available at
http://www.cs.waikato.ac.nz/~ml/weka/ . The local version of Weka used in
the class is available at /u/mooney/cs391L-code/weka/
. See
the README
file in this
directory. A tutorial for Weka is available at
/u/mooney/cs391L-code/weka/Tutorial.pdf
.
The Javadoc web-based documentation is available at
/u/mooney/cs391L-code/weka/doc/packages.html
.
Since the current Weka does not include a version-space algorithm, I have
written one that is available in
/u/mooney/cs391L-code/weka/weka/classifiers/vspace/
. The class
ConjunctiveVersionSpace
is an implementation for learning
conjunctive, nominal feature descriptions. The code is commented and follows
the basic conventions of a Weka classifier. Some simple datasets for testing
this code are in the "figure" data files in
/u/mooney/cs391L-code/weka/data/
. The trace at
http://www.cs.utexas.edu/users/mooney/cs391L/hw1/trace.txt shows running this
system on the "figure" data files. Note that the flag "-P" causes
ConjunctiveVersionSpace
to produce an incremental trace of
processing each example (see the Weka documentation for an explanation of the
other options). See the comments in VersionSpace
for an explanation
of other important options: -R
that reorders instances for
training and -C
that determines how test instances are
classified when the version space has not converged.
Due to the restriction to conjunctive binary concepts and computational
demands, applying the system to large, realistic data sets is difficult. By
simplifying a version of the classic soybean disease diagnosis data, I have
created a dataset soybean-reduced
, available in
/u/mooney/cs391L-code/weka/data/
, that is suitable for
ConjunctiveVersionSpace
. This data contains 68 instances
described with 32 features (symptoms) in 4 classes (diseases). Since
VersionSpace
only handles binary concept learning, the Weka meta
learner MultiClassClassifier
(in the
weka.classifier.meta
package) must be used to learn one binary
concept per category. MultiClassClassifier
requires a
DistributionClassifier
as a base classifier. Therefore,
VersionSpace
is a DistributionClassifier
that returns
a distribution (a probability for each class). The
trace.txt
file contains a trace of running on this data set.
The "-R PN" option is used to have VersionSpace
reorder the
positive examples first which greatly improves training time, for reasons that
are easily explained.
This assignment concerns active selection of training examples. Using just unlabeled instances, the best instances for labeling and use as training examples are those that come closest to "splitting" the version space into two equal halfs. This is the same as the idea of picking the instances that cause the most "disagreement" amongst the various hypotheses in the version space regarding their categorization, or that are the least "certain" with respect to their categorization given the current version space. Given a set of unlabeled instances, the best one to use as the next training example can be selected based on such criteria. This is called sample selection.
Write sample-selection code for the basic version-space system. You should
create a specialized class ActiveConjunctiveVersionSpace
that
extends ConjunctiveVersionSpace
. Given a data set that contains a
complete set of instances (all the examples in the instance space) for a
specific concept (such as figure-all-red.arff
and
figure-all-red-circle.arff
in
/u/mooney/cs391-code/weka/data/
), your system should incrementally
pick the best training example and use it to update the version space.
Obviously, it cannot use the class label of an example when making this
selection, since this must be unavailable to the learner until it has chosen a
specific example and then obtains its label from the teacher. It should
repeatedly select training examples until it converges to a single hypothesis,
measuring the number of training examples needed until the concept is uniquely,
correctly identified. Note that this requires altering the current training
process as implemented in the methods build-classifier
and
train
. It will also require writing a method that enumerates all
the hypotheses in the version space given the S and G sets. This in turn may
require adding one or more additional methods to
ConjunctiveGeneralization
.
The main
method for ActiveConjunctiveVersionSpace
should run repeated random experiments of active learning where ties in the
example selection criteria are broken randomly (this can easily be achieved by
simply randomly mixing up the order of the complete set of examples before each
trial). It should report the average number of training examples required
before convergence. The first command-line argument should be the ARFF input
file name, the second should be the number of random trials to average
over. See the trace file of my sample solution active-trace.txt
. Your system should
produce a similar output when given the flag "-P" to produce a detailed trace,
including showing all the hypotheses in the enumerated version space (together
with its size) and the score for the best instance which is the difference
between the best fraction of splitting the hypotheses and the optimal split of
1/2 (so the best possible score is 0 and the worst possible is 0.5).
Since checking the whole version space may be intractable for larger problems,
also allow the option of using just the union of the S and G sets to
approximately represent the range of consistent hypotheses. Also allow the
option of randomly selecting training examples for comparison as a baseline.
The command line for ActiveConjunctiveVersionSpace
should take as
the two first arguments the input data file and the number of random trials to
run and then an option "-S" that takes the values "VS" for using the entire
version space, "SG" for just using the S and G sets, and "R" for random
selection, and an optional "-P" flag for producing a detailed trace, as
shown in active-trace.txt
.
Compare active selection of examples to random selection in terms of the
average number of training examples needed to converge to a single hypothesis
for a given concept. Average results over a thousand random trials. Test the
systems using the simple figure data in figure-all-red.arff
and
figure-all-red-circle.arff
. See active-trace.txt
for a small sample run
with a detailed trace of just 2 trials. Also compare selecting examples based
on just the S and G sets to selecting them based on enumerating and testing the
entire version space.
Include a summary of your results and a written discussion of the relative
performance of random selection, S-G selection, and complete version-space
selection with respect to the number of training examples and computational
time needed to converge on a correct definition. Discuss and explain any
differences in the number of examples needed to learn the two different concepts
"circle" and "red circle". Follow the instructions for submitting homework solutions.
Include in your electronically submitted directory a detailed (-P
)
trace called hw1-solution-trace
showing two random trials for each
of the three selection methods learning both of the test concepts.