weka.classifiers.bayes
Class SemiSupEM

java.lang.Object
  extended byweka.classifiers.Classifier
      extended byweka.classifiers.DistributionClassifier
          extended byweka.classifiers.bayes.SemiSupEM
All Implemented Interfaces:
java.lang.Cloneable, OptionHandler, SemiSupClassifier, java.io.Serializable

public class SemiSupEM
extends DistributionClassifier
implements SemiSupClassifier, OptionHandler

Semi supervised learner that uses EM initialized with labeled data and then runs EM iterations on the unlabeled data to improve the model. See: Kamal Nigam, Andrew McCallum, Sebastian Thrun and Tom Mitchell. Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning, 39(2/3). pp. 103-134. 2000. Assumes use of a base classifier that is a SoftClassifer that accepts training data with a soft class distribution rather than a hard assignment, i.e. SoftClassifiedInstances. Sample soft classifiers are NaiveBayesSimpleSoft and NaiveBayesSimpleSparseSoft

See Also:
Serialized Form

Field Summary
protected  SoftClassifiedInstances m_AllInstances
          Complete set of labeled and unlabeled instances for EM
protected  SoftClassifier m_Classifier
          Base classifier that supports soft classified instances
protected  Instances m_LabeledInstances
          Hard Labeled data
protected  double m_Lambda
          Weight of unlabeled examples during EM training versus labeled examples (see Nigam et al.)
protected  int m_max_iterations
          maximum iterations to perform
protected  double[] m_MaxArray
          The maximum values for numeric attributes.
protected  double[] m_MinArray
          The minimum values for numeric attributes.
protected static double m_minLogLikelihoodIncr
           
protected  java.util.Random m_Random
          random numbers and seed
protected  int m_rseed
           
protected  boolean m_seedUnseenClasses
          Create soft labeled Seed for unseen classes
protected  Instances m_UnlabeledData
          Original set of unlabeled Instances
protected  SoftClassifiedInstances m_UnlabeledInstances
          Soft labeled version of unlabeled data
protected  boolean m_verbose
          Verbose?
 
Constructor Summary
SemiSupEM()
          Simple constructor, must set options using command line or GUI
 
Method Summary
 void buildClassifier(Instances data)
          Generates the classifier.
protected  java.lang.String classDistributionString(SoftClassifiedInstance inst)
           
 java.lang.String classifierTipText()
           
protected  double distance(Instance first, Instance second)
          Calculates the distance between two instances
 double[] distributionForInstance(Instance instance)
          Calculates the class membership probabilities for the given test instance.
protected  double eStep()
           
protected  Instance farthestInstance(Instances candidateInsts, Instances insts)
          Return the instance in candidateInsts that is farthest from any instance in insts
 SoftClassifier getClassifier()
          Get the classifier used as the classifier
 boolean getDebug()
          Get debug mode
 double getLambda()
           
 int getMaxIterations()
          Get the maximum number of iterations
 java.lang.String[] getOptions()
          Gets the current settings of EM.
 int getSeed()
          Get the random number seed
 boolean getSeedUnseenClasses()
           
 java.lang.String globalInfo()
          Returns a string describing this clusterer
protected  void initModel()
          Intialize model using appropriate set of data
protected  void iterate()
          Run EM iterations until likelihood stops increasing significantly or max iterations exhausted
 java.lang.String lambdaTipText()
           
 java.util.Enumeration listOptions()
          Returns an enumeration describing the available options..
 double logSum(double[] logProbs)
          Sums log of probabilities using special method for summing in log space
static void main(java.lang.String[] argv)
          Main method for testing this class.
 java.lang.String maxIterationsTipText()
          Returns the tip text for this property
protected  double minimumDistance(Instance inst, Instances insts)
          Return the distance from inst to the closest instance in insts
protected  void mStep()
           
protected  double norm(double x, int i)
          Normalizes a given value of a numeric attribute.
protected  void resetOptions()
          Reset to default options
 java.lang.String seedTipText()
          Returns the tip text for this property
 java.lang.String seedUnseenClassesTipText()
           
 void setClassifier(SoftClassifier newClassifier)
          Set the classifier for boosting.
 void setDebug(boolean v)
          Set debug mode - verbose output
 void setLambda(double v)
           
 void setMaxIterations(int i)
          Set the maximum number of iterations to perform
protected  void setMinMax(Instances insts)
          Compute and store min max values for each numeric feature
 void setOptions(java.lang.String[] options)
          Parses a given list of options.
 void setSeed(int s)
          Set the random number seed
 void setSeedUnseenClasses(boolean v)
           
 void setUnlabeled(Instances unlabeled)
          Provide unlabeled data to the classifier.
protected  void softLabelClasses(SoftClassifiedInstance inst, java.util.List classes)
          Soft label inst as being equally likely to be in an of the given classes
protected  java.util.ArrayList unseenClasses(Instances insts)
          Return a list of class values for which there are no instances in insts
protected  void updateMinMax(Instance instance)
          Updates the minimum and maximum values for all the attributes based on a new instance.
protected  void weightInstances(Instances insts, double weight)
          Weighted all given instances with given weight
 
Methods inherited from class weka.classifiers.DistributionClassifier
calculateEntropy, calculateLabeledInstanceMargin, calculateMargin, classifyInstance
 
Methods inherited from class weka.classifiers.Classifier
forName, makeCopies
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

m_UnlabeledData

protected Instances m_UnlabeledData
Original set of unlabeled Instances


m_UnlabeledInstances

protected SoftClassifiedInstances m_UnlabeledInstances
Soft labeled version of unlabeled data


m_LabeledInstances

protected Instances m_LabeledInstances
Hard Labeled data


m_AllInstances

protected SoftClassifiedInstances m_AllInstances
Complete set of labeled and unlabeled instances for EM


m_Classifier

protected SoftClassifier m_Classifier
Base classifier that supports soft classified instances


m_Lambda

protected double m_Lambda
Weight of unlabeled examples during EM training versus labeled examples (see Nigam et al.)


m_Random

protected java.util.Random m_Random
random numbers and seed


m_rseed

protected int m_rseed

m_max_iterations

protected int m_max_iterations
maximum iterations to perform


m_seedUnseenClasses

protected boolean m_seedUnseenClasses
Create soft labeled Seed for unseen classes


m_verbose

protected boolean m_verbose
Verbose?


m_minLogLikelihoodIncr

protected static double m_minLogLikelihoodIncr

m_MinArray

protected double[] m_MinArray
The minimum values for numeric attributes.


m_MaxArray

protected double[] m_MaxArray
The maximum values for numeric attributes.

Constructor Detail

SemiSupEM

public SemiSupEM()
Simple constructor, must set options using command line or GUI

Method Detail

globalInfo

public java.lang.String globalInfo()
Returns a string describing this clusterer

Returns:
a description of the evaluator suitable for displaying in the explorer/experimenter gui

listOptions

public java.util.Enumeration listOptions()
Returns an enumeration describing the available options..

Valid options are:

-V
Verbose.

-I
Terminate after this many iterations if EM has not converged.

-S
Specify random number seed.

-M
Set the minimum allowable standard deviation for normal density calculation.

Specified by:
listOptions in interface OptionHandler
Returns:
an enumeration of all the available options.

setOptions

public void setOptions(java.lang.String[] options)
                throws java.lang.Exception
Parses a given list of options.

Specified by:
setOptions in interface OptionHandler
Parameters:
options - the list of options as an array of strings
Throws:
java.lang.Exception - if an option is not supported

resetOptions

protected void resetOptions()
Reset to default options


seedTipText

public java.lang.String seedTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setSeed

public void setSeed(int s)
Set the random number seed

Parameters:
s - the seed

getSeed

public int getSeed()
Get the random number seed

Returns:
the seed

maxIterationsTipText

public java.lang.String maxIterationsTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setMaxIterations

public void setMaxIterations(int i)
                      throws java.lang.Exception
Set the maximum number of iterations to perform

Parameters:
i - the number of iterations
Throws:
java.lang.Exception - if i is less than 1

getMaxIterations

public int getMaxIterations()
Get the maximum number of iterations

Returns:
the number of iterations

setDebug

public void setDebug(boolean v)
Set debug mode - verbose output

Parameters:
v - true for verbose output

getDebug

public boolean getDebug()
Get debug mode

Returns:
true if debug mode is set

setSeedUnseenClasses

public void setSeedUnseenClasses(boolean v)

getSeedUnseenClasses

public boolean getSeedUnseenClasses()

seedUnseenClassesTipText

public java.lang.String seedUnseenClassesTipText()

setLambda

public void setLambda(double v)

getLambda

public double getLambda()

lambdaTipText

public java.lang.String lambdaTipText()

setClassifier

public void setClassifier(SoftClassifier newClassifier)
Set the classifier for boosting.

Parameters:
newClassifier - the Classifier to use.

getClassifier

public SoftClassifier getClassifier()
Get the classifier used as the classifier

Returns:
the classifier used as the classifier

classifierTipText

public java.lang.String classifierTipText()

getOptions

public java.lang.String[] getOptions()
Gets the current settings of EM.

Specified by:
getOptions in interface OptionHandler
Returns:
an array of strings suitable for passing to setOptions()

setUnlabeled

public void setUnlabeled(Instances unlabeled)
Provide unlabeled data to the classifier.

Specified by:
setUnlabeled in interface SemiSupClassifier

buildClassifier

public void buildClassifier(Instances data)
                     throws java.lang.Exception
Generates the classifier.

Specified by:
buildClassifier in class Classifier
Parameters:
data - set of instances serving as training data
Throws:
java.lang.Exception - if the classifier has not been generated successfully

weightInstances

protected void weightInstances(Instances insts,
                               double weight)
Weighted all given instances with given weight


initModel

protected void initModel()
                  throws java.lang.Exception
Intialize model using appropriate set of data

Throws:
java.lang.Exception

unseenClasses

protected java.util.ArrayList unseenClasses(Instances insts)
Return a list of class values for which there are no instances in insts


farthestInstance

protected Instance farthestInstance(Instances candidateInsts,
                                    Instances insts)
Return the instance in candidateInsts that is farthest from any instance in insts


minimumDistance

protected double minimumDistance(Instance inst,
                                 Instances insts)
Return the distance from inst to the closest instance in insts


softLabelClasses

protected void softLabelClasses(SoftClassifiedInstance inst,
                                java.util.List classes)
                         throws java.lang.Exception
Soft label inst as being equally likely to be in an of the given classes

Throws:
java.lang.Exception

iterate

protected void iterate()
                throws java.lang.Exception
Run EM iterations until likelihood stops increasing significantly or max iterations exhausted

Throws:
java.lang.Exception

eStep

protected double eStep()
                throws java.lang.Exception
Throws:
java.lang.Exception

logSum

public double logSum(double[] logProbs)
Sums log of probabilities using special method for summing in log space


classDistributionString

protected java.lang.String classDistributionString(SoftClassifiedInstance inst)

mStep

protected void mStep()
              throws java.lang.Exception
Throws:
java.lang.Exception

distributionForInstance

public double[] distributionForInstance(Instance instance)
                                 throws java.lang.Exception
Calculates the class membership probabilities for the given test instance.

Specified by:
distributionForInstance in class DistributionClassifier
Parameters:
instance - the instance to be classified
Returns:
predicted class probability distribution
Throws:
java.lang.Exception - if distribution can't be computed

distance

protected double distance(Instance first,
                          Instance second)
Calculates the distance between two instances

Parameters:
first - the first instance
second - the second instance
Returns:
the distance between the two given instances

norm

protected double norm(double x,
                      int i)
Normalizes a given value of a numeric attribute.

Parameters:
x - the value to be normalized
i - the attribute's index

setMinMax

protected void setMinMax(Instances insts)
Compute and store min max values for each numeric feature


updateMinMax

protected void updateMinMax(Instance instance)
Updates the minimum and maximum values for all the attributes based on a new instance.

Parameters:
instance - the new instance

main

public static void main(java.lang.String[] argv)
Main method for testing this class.

Parameters:
argv - the options