Specifies the name of the distance metric class that should be used
- See Also:
Clusterer
,
OptionHandler
,
Serialized Form
Method Summary |
int |
assignClusterToInstance(Instance instance)
Classifies the instance using the current clustering |
int[] |
bestInstancesForActiveLearning(int numActive)
Returns the indices of the best numActive instances for active learning |
InstancePair[] |
bestPairsForActiveLearning(int numActive)
Returns the list of best pairs for active learning |
void |
buildClusterer(Instances data)
Generates a clusterer. |
void |
buildClusterer(Instances labeledData,
Instances unlabeledData,
int classIndex,
Instances totalTrainWithLabels,
int startingIndexOfTest)
Clusters unlabeledData and labeledData (with labels removed),
using labeledData as seeds |
void |
buildClusterer(Instances labeledData,
Instances unlabeledData,
int classIndex,
int numClusters)
Clusters unlabeledData and labeledData (with labels removed),
using labeledData as seeds |
void |
buildClusterer(Instances labeledData,
Instances unlabeledData,
int classIndex,
int numClusters,
int startingIndexOfTest)
Clusters unlabeledData and labeledData (with labels removed),
using labeledData as seeds |
void |
buildClusterer(Instances data,
int num_clusters)
Cluster given instances to form the specified number of clusters. |
protected void |
calculateObjectiveFunction()
calculates objective function |
int |
clusterInstance(Instance instance)
Checks if instance has to be normalized and classifies the
instance using the current clustering |
static java.lang.String |
concatStringArray(java.lang.String[] strings)
A little helper to create a single String from an array of Strings |
protected void |
findBestAssignments()
E-step of the KMeans clustering algorithm -- find best cluster assignments |
SelectedTag |
getAlgorithm()
Get the KMeans algorithm type. |
java.util.ArrayList |
getClusters()
Computes the final clusters from the cluster assignments, for external access |
double |
getConcentration()
Return the concentration |
double |
getDefaultPerturb()
Get default perturbation value |
double |
getExtraPhase1RunFraction()
Return the number of extra phase1 runs |
java.util.ArrayList |
getIndexClusters()
Computes the clusters from the cluster assignments, for external access |
Instances |
getInstances()
Return training instances |
Metric |
getMetric()
Get the distance metric |
protected java.lang.String |
getMetricSpec()
Gets the classifier specification string, which contains the class name of
the classifier and any options to the classifier |
int |
getNumClusters()
Return the number of clusters |
double |
getObjFunConvergenceDifference()
Get the minimum value of the objective function difference required for convergence |
java.lang.String[] |
getOptions()
Gets the current option settings for the OptionHandler. |
int |
getRandomSeed()
Return the random number seed |
boolean |
getSeedable()
Turn seeding on and off |
SelectedTag |
getSeedingMethod()
Get the seeding method used. |
Clusterer |
getThisClusterer()
We always want to implement SemiSupClusterer from a class extending Clusterer. |
boolean |
getVerbose()
get the verbosity level of the clusterer |
protected void |
initializeClusterer()
Initializes the cluster centroids - initial M step |
java.util.Enumeration |
listOptions()
Returns an enumeration of all the available options.. |
static void |
main(java.lang.String[] args)
Main method for testing this class. |
protected double[] |
meanOrMode(Instances insts)
Fast version of meanOrMode - streamlined from Instances.meanOrMode for efficiency
Does not check for missing attributes, assumes numeric attributes, assumes Sparse instances |
java.lang.String |
metricName()
Get the distance metric name |
void |
normalize(Instance inst)
Normalizes Instance or SparseInstance |
protected void |
normalizeByWeight(Instance inst)
This function divides every attribute value in an instance by
the instance weight -- useful to find the mean of a cluster in
Euclidean space |
void |
normalizeInstance(Instance inst)
Normalizes the values of a normal Instance |
void |
normalizeSparseInstance(Instance inst)
Normalizes the values of a SparseInstance |
int |
numberOfClusters()
A duplicate function to conform to Clusterer abstract class. |
double |
objectiveFunction()
returns objective function |
int[] |
oldBestInstancesForActiveLearning(int numActive)
|
void |
printClusters()
Prints clusters |
void |
printIndexClusters()
Outputs the current clustering |
void |
resetClusterer()
Reset all values that have been learned |
boolean |
seedable()
We can have clusterers that don't utilize seeding |
void |
seedClusterer(java.util.HashMap seedHash)
Read the seeds from a hastable, where every key is an instance and every value is:
the cluster assignment of that instance
seedVector vector containing seeds |
void |
setAlgorithm(SelectedTag algo)
Set the KMeans algorithm. |
void |
setConcentration(double w)
Set the concentration |
void |
setDefaultPerturb(double p)
Set default perturbation value |
void |
setExtraPhase1RunFraction(double w)
Set the number of extra phase1 runs |
void |
setInstances(Instances instances)
Sets training instances |
void |
setMetric(Metric m)
Set the distance metric |
void |
setMetricName(java.lang.String metricName)
Set the distance metric |
void |
setNumClusters(int n)
Set the number of clusters to generate |
void |
setObjFunConvergenceDifference(double objFunConvergenceDifference)
Set the minimum value of the objective function difference required for convergence |
void |
setOptions(java.lang.String[] options)
Parses a given list of options. |
void |
setRandomSeed(int s)
Set the random number seed |
void |
setSeedable(boolean seedable)
Turn seeding on and off |
void |
setSeedHash(java.util.HashMap seedhash)
Set the m_SeedHash |
void |
setSeedingMethod(SelectedTag seedingMethod)
Set the seeding method. |
void |
setVerbose(boolean verbose)
set the verbosity level of the clusterer |
protected Instance |
sumInstances(Instance inst1,
Instance inst2)
Finds sum of instances (handles sparse and non-sparse) |
java.lang.String |
toString()
return a string describing this clusterer |
void |
trainClusterer(Instances instances)
Train the clusterer using specified parameters |
protected void |
updateClusterCentroids()
M-step of the KMeans clustering algorithm -- updates cluster centroids |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
m_FinalClusters
protected java.util.ArrayList m_FinalClusters
- holds the clusters
m_IndexClusters
protected java.util.ArrayList m_IndexClusters
- holds the instance indices in the clusters
m_SeedHash
protected java.util.HashMap m_SeedHash
- holds the ([seed instance] -> [clusterLabel of seed instance]) mapping
m_metric
protected Metric m_metric
- distance Metric
m_metricBuilt
protected boolean m_metricBuilt
- has the metric has been constructed? a fix for multiple buildClusterer's
m_StartingIndexOfTest
protected int m_StartingIndexOfTest
- starting index of test data in unlabeledData if transductive clustering
isSparseInstance
protected boolean isSparseInstance
- indicates whether instances are sparse
m_objFunDecreasing
protected boolean m_objFunDecreasing
- Is the objective function increasing or decreasing? Depends on type
of metric used: for similarity-based metric, increasing, for distance-based - decreasing
m_metricName
protected java.lang.String m_metricName
- Name of metric
m_skipHash
protected java.util.HashSet m_skipHash
- Points that are to be skipped in the clustering process
because they are collapsed to zero
m_currIdx
protected int m_currIdx
- Index of the current element in the E-step
m_Iterations
protected int m_Iterations
- keep track of the number of iterations completed before convergence
SEEDING_CONSTRAINED
public static final int SEEDING_CONSTRAINED
- See Also:
- Constant Field Values
SEEDING_SEEDED
public static final int SEEDING_SEEDED
- See Also:
- Constant Field Values
TAGS_SEEDING
public static final Tag[] TAGS_SEEDING
m_SeedingMethod
protected int m_SeedingMethod
- seeding method, by default seeded
ALGORITHM_SIMPLE
public static final int ALGORITHM_SIMPLE
- Define possible algorithms
- See Also:
- Constant Field Values
ALGORITHM_SPHERICAL
public static final int ALGORITHM_SPHERICAL
- See Also:
- Constant Field Values
TAGS_ALGORITHM
public static final Tag[] TAGS_ALGORITHM
m_Algorithm
protected int m_Algorithm
- algorithm, by default spherical
m_ObjFunConvergenceDifference
protected double m_ObjFunConvergenceDifference
- min difference of objective function values for convergence
m_Objective
protected double m_Objective
- value of objective function
m_Verbose
protected boolean m_Verbose
- Verbose?
m_TotalTrainWithLabels
protected Instances m_TotalTrainWithLabels
- training instances with labels
m_Instances
protected Instances m_Instances
- training instances
m_NumClusters
protected int m_NumClusters
- number of clusters to generate, default is 3
m_FastMode
protected boolean m_FastMode
- m_FastMode = true => fast computation of meanOrMode in centroid calculation, useful for high-D data sets
m_FastMode = false => usual computation of meanOrMode in centroid calculation
m_ClusterCentroids
protected Instances m_ClusterCentroids
- holds the cluster centroids
m_GlobalCentroid
protected Instance m_GlobalCentroid
- holds the global centroids
m_DefaultPerturb
protected double m_DefaultPerturb
- holds the default perturbation value for randomPerturbInit
m_Concentration
protected double m_Concentration
- weight of the concentration
m_ExtraPhase1RunFraction
protected double m_ExtraPhase1RunFraction
- number of extra phase1 runs
m_ClusterAssignments
protected int[] m_ClusterAssignments
- temporary variable holding cluster assignments while iterating
m_randomSeed
protected int m_randomSeed
- holds the random Seed, useful for randomPerturbInit
m_Seedable
protected boolean m_Seedable
- semisupervision
SeededKMeans
public SeededKMeans()
SeededKMeans
public SeededKMeans(Metric metric)
objectiveFunction
public double objectiveFunction()
- returns objective function
- Specified by:
objectiveFunction
in interface SemiSupClusterer
getThisClusterer
public Clusterer getThisClusterer()
- We always want to implement SemiSupClusterer from a class extending Clusterer.
We want to be able to return the underlying parent class.
- Specified by:
getThisClusterer
in interface SemiSupClusterer
- Returns:
- parent Clusterer class
buildClusterer
public void buildClusterer(Instances data,
int num_clusters)
throws java.lang.Exception
- Cluster given instances to form the specified number of clusters.
- Parameters:
data
- instances to be clusterednum_clusters
- number of clusters to create
- Throws:
java.lang.Exception
- if something goes wrong.
buildClusterer
public void buildClusterer(Instances labeledData,
Instances unlabeledData,
int classIndex,
int numClusters,
int startingIndexOfTest)
throws java.lang.Exception
- Clusters unlabeledData and labeledData (with labels removed),
using labeledData as seeds
- Specified by:
buildClusterer
in interface SemiSupClusterer
- Parameters:
labeledData
- labeled instances to be used as seedsunlabeledData
- unlabeled instancesclassIndex
- attribute index in labeledData which holds class infonumClusters
- number of clustersstartingIndexOfTest
- from where test data starts in unlabeledData, useful if clustering is transductive
- Throws:
java.lang.Exception
- if something goes wrong.
buildClusterer
public void buildClusterer(Instances labeledData,
Instances unlabeledData,
int classIndex,
Instances totalTrainWithLabels,
int startingIndexOfTest)
throws java.lang.Exception
- Clusters unlabeledData and labeledData (with labels removed),
using labeledData as seeds
- Parameters:
labeledData
- labeled instances to be used as seedsunlabeledData
- unlabeled instancesclassIndex
- attribute index in labeledData which holds class infostartingIndexOfTest
- from where test data starts in unlabeledData, useful if clustering is transductive
- Throws:
java.lang.Exception
- if something goes wrong.
buildClusterer
public void buildClusterer(Instances labeledData,
Instances unlabeledData,
int classIndex,
int numClusters)
throws java.lang.Exception
- Clusters unlabeledData and labeledData (with labels removed),
using labeledData as seeds
- Parameters:
labeledData
- labeled instances to be used as seedsunlabeledData
- unlabeled instancesclassIndex
- attribute index in labeledData which holds class infonumClusters
- number of clusters
- Throws:
java.lang.Exception
- if something goes wrong.
resetClusterer
public void resetClusterer()
throws java.lang.Exception
- Reset all values that have been learned
- Specified by:
resetClusterer
in interface SemiSupClusterer
- Throws:
java.lang.Exception
seedable
public boolean seedable()
- We can have clusterers that don't utilize seeding
initializeClusterer
protected void initializeClusterer()
- Initializes the cluster centroids - initial M step
findBestAssignments
protected void findBestAssignments()
throws java.lang.Exception
- E-step of the KMeans clustering algorithm -- find best cluster assignments
- Throws:
java.lang.Exception
updateClusterCentroids
protected void updateClusterCentroids()
- M-step of the KMeans clustering algorithm -- updates cluster centroids
calculateObjectiveFunction
protected void calculateObjectiveFunction()
throws java.lang.Exception
- calculates objective function
- Throws:
java.lang.Exception
buildClusterer
public void buildClusterer(Instances data)
throws java.lang.Exception
- Generates a clusterer. Instances in data have to be
either all sparse or all non-sparse
- Specified by:
buildClusterer
in interface SemiSupClusterer
- Specified by:
buildClusterer
in class Clusterer
- Parameters:
data
- set of instances serving as training data
- Throws:
java.lang.Exception
- if the clusterer has not been
generated successfully
bestPairsForActiveLearning
public InstancePair[] bestPairsForActiveLearning(int numActive)
throws java.lang.Exception
- Description copied from interface:
ActiveLearningClusterer
- Returns the list of best pairs for active learning
- Specified by:
bestPairsForActiveLearning
in interface ActiveLearningClusterer
- Throws:
java.lang.Exception
bestInstancesForActiveLearning
public int[] bestInstancesForActiveLearning(int numActive)
throws java.lang.Exception
- Returns the indices of the best numActive instances for active learning
- Specified by:
bestInstancesForActiveLearning
in interface ActiveLearningClusterer
- Throws:
java.lang.Exception
sumInstances
protected Instance sumInstances(Instance inst1,
Instance inst2)
throws java.lang.Exception
- Finds sum of instances (handles sparse and non-sparse)
- Throws:
java.lang.Exception
normalizeByWeight
protected void normalizeByWeight(Instance inst)
- This function divides every attribute value in an instance by
the instance weight -- useful to find the mean of a cluster in
Euclidean space
- Parameters:
inst
- Instance passed in for normalization (destructive update)
oldBestInstancesForActiveLearning
public int[] oldBestInstancesForActiveLearning(int numActive)
throws java.lang.Exception
- Throws:
java.lang.Exception
clusterInstance
public int clusterInstance(Instance instance)
throws java.lang.Exception
- Checks if instance has to be normalized and classifies the
instance using the current clustering
- Specified by:
clusterInstance
in class Clusterer
- Parameters:
instance
- the instance to be assigned to a cluster
- Returns:
- the number of the assigned cluster as an integer
if the class is enumerated, otherwise the predicted value
- Throws:
java.lang.Exception
- if instance could not be classified
successfully
assignClusterToInstance
public int assignClusterToInstance(Instance instance)
throws java.lang.Exception
- Classifies the instance using the current clustering
- Parameters:
instance
- the instance to be assigned to a cluster
- Returns:
- the number of the assigned cluster as an integer
if the class is enumerated, otherwise the predicted value
- Throws:
java.lang.Exception
- if instance could not be classified
successfully
getNumClusters
public int getNumClusters()
- Return the number of clusters
- Specified by:
getNumClusters
in interface SemiSupClusterer
numberOfClusters
public int numberOfClusters()
- A duplicate function to conform to Clusterer abstract class.
- Specified by:
numberOfClusters
in class Clusterer
- Returns:
- the number of clusters generated for a training dataset.
getExtraPhase1RunFraction
public double getExtraPhase1RunFraction()
- Return the number of extra phase1 runs
setExtraPhase1RunFraction
public void setExtraPhase1RunFraction(double w)
- Set the number of extra phase1 runs
getConcentration
public double getConcentration()
- Return the concentration
setConcentration
public void setConcentration(double w)
- Set the concentration
setSeedHash
public void setSeedHash(java.util.HashMap seedhash)
- Set the m_SeedHash
setRandomSeed
public void setRandomSeed(int s)
- Set the random number seed
- Parameters:
s
- the seed
getRandomSeed
public int getRandomSeed()
- Return the random number seed
setObjFunConvergenceDifference
public void setObjFunConvergenceDifference(double objFunConvergenceDifference)
- Set the minimum value of the objective function difference required for convergence
- Parameters:
objFunConvergenceDifference
- the minimum value of the objective function difference required for convergence
getObjFunConvergenceDifference
public double getObjFunConvergenceDifference()
- Get the minimum value of the objective function difference required for convergence
setInstances
public void setInstances(Instances instances)
- Sets training instances
getInstances
public Instances getInstances()
- Return training instances
- Specified by:
getInstances
in interface SemiSupClusterer
- Returns:
- Instances used for clustering, or null
setNumClusters
public void setNumClusters(int n)
- Set the number of clusters to generate
- Specified by:
setNumClusters
in interface SemiSupClusterer
- Parameters:
n
- the number of clusters to generate
setMetric
public void setMetric(Metric m)
- Set the distance metric
- Specified by:
setMetric
in interface SemiSupClusterer
getMetric
public Metric getMetric()
- Get the distance metric
metricName
public java.lang.String metricName()
- Get the distance metric name
setSeedingMethod
public void setSeedingMethod(SelectedTag seedingMethod)
- Set the seeding method. Values other than
SEEDING_CONSTRAINED, or SEEDING_SEEDED will be ignored
- Parameters:
seedingMethod
- the seeding method to use
getSeedingMethod
public SelectedTag getSeedingMethod()
- Get the seeding method used.
setAlgorithm
public void setAlgorithm(SelectedTag algo)
- Set the KMeans algorithm. Values other than
ALGORITHM_SIMPLE or ALGORITHM_SPHERICAL will be ignored
- Parameters:
algo
- algorithm type
getAlgorithm
public SelectedTag getAlgorithm()
- Get the KMeans algorithm type. Will be one of
ALGORITHM_SIMPLE or ALGORITHM_SPHERICAL
setMetricName
public void setMetricName(java.lang.String metricName)
- Set the distance metric
setDefaultPerturb
public void setDefaultPerturb(double p)
- Set default perturbation value
- Parameters:
p
- perturbation fraction
getDefaultPerturb
public double getDefaultPerturb()
- Get default perturbation value
- Returns:
- perturbation fraction
setSeedable
public void setSeedable(boolean seedable)
- Turn seeding on and off
- Parameters:
seedable
- should seeding be done?
getSeedable
public boolean getSeedable()
- Turn seeding on and off
seedClusterer
public void seedClusterer(java.util.HashMap seedHash)
- Read the seeds from a hastable, where every key is an instance and every value is:
the cluster assignment of that instance
seedVector vector containing seeds
- Specified by:
seedClusterer
in interface SemiSupClusterer
- Parameters:
seedHash
- HashMap of seeding parameters
getIndexClusters
public java.util.ArrayList getIndexClusters()
throws java.lang.Exception
- Computes the clusters from the cluster assignments, for external access
- Throws:
java.lang.Exception
- if clusters could not be computed successfully
printIndexClusters
public void printIndexClusters()
throws java.lang.Exception
- Outputs the current clustering
- Throws:
java.lang.Exception
- if something goes wrong
printClusters
public void printClusters()
throws java.lang.Exception
- Prints clusters
- Throws:
java.lang.Exception
getClusters
public java.util.ArrayList getClusters()
throws java.lang.Exception
- Computes the final clusters from the cluster assignments, for external access
- Specified by:
getClusters
in interface SemiSupClusterer
- Throws:
java.lang.Exception
- if clusters could not be computed successfully
listOptions
public java.util.Enumeration listOptions()
- Description copied from interface:
OptionHandler
- Returns an enumeration of all the available options..
- Specified by:
listOptions
in interface OptionHandler
- Returns:
- an enumeration of all available options.
getMetricSpec
protected java.lang.String getMetricSpec()
- Gets the classifier specification string, which contains the class name of
the classifier and any options to the classifier
- Returns:
- the classifier string.
getOptions
public java.lang.String[] getOptions()
- Description copied from interface:
OptionHandler
- Gets the current option settings for the OptionHandler.
- Specified by:
getOptions
in interface OptionHandler
- Returns:
- the list of current option settings as an array of strings
setOptions
public void setOptions(java.lang.String[] options)
throws java.lang.Exception
- Parses a given list of options.
- Specified by:
setOptions
in interface OptionHandler
- Parameters:
options
- the list of options as an array of strings
- Throws:
java.lang.Exception
- if an option is not supported
concatStringArray
public static java.lang.String concatStringArray(java.lang.String[] strings)
- A little helper to create a single String from an array of Strings
- Parameters:
strings
- an array of strings
toString
public java.lang.String toString()
- return a string describing this clusterer
- Returns:
- a description of the clusterer as a string
setVerbose
public void setVerbose(boolean verbose)
- set the verbosity level of the clusterer
- Specified by:
setVerbose
in interface SemiSupClusterer
- Parameters:
verbose
- messages on(true) or off (false)
getVerbose
public boolean getVerbose()
- get the verbosity level of the clusterer
- Returns:
- messages on(true) or off (false)
trainClusterer
public void trainClusterer(Instances instances)
throws java.lang.Exception
- Train the clusterer using specified parameters
- Specified by:
trainClusterer
in interface SemiSupClusterer
- Parameters:
instances
- Instances to be used for training
- Throws:
java.lang.Exception
normalize
public void normalize(Instance inst)
throws java.lang.Exception
- Normalizes Instance or SparseInstance
- Parameters:
inst
- Instance to be normalized
- Throws:
java.lang.Exception
normalizeInstance
public void normalizeInstance(Instance inst)
throws java.lang.Exception
- Normalizes the values of a normal Instance
- Parameters:
inst
- Instance to be normalized
- Throws:
java.lang.Exception
normalizeSparseInstance
public void normalizeSparseInstance(Instance inst)
throws java.lang.Exception
- Normalizes the values of a SparseInstance
- Parameters:
inst
- SparseInstance to be normalized
- Throws:
java.lang.Exception
meanOrMode
protected double[] meanOrMode(Instances insts)
- Fast version of meanOrMode - streamlined from Instances.meanOrMode for efficiency
Does not check for missing attributes, assumes numeric attributes, assumes Sparse instances
main
public static void main(java.lang.String[] args)
- Main method for testing this class.