weka.clusterers
Class HAC

java.lang.Object
  extended byweka.clusterers.Clusterer
      extended byweka.clusterers.HAC
All Implemented Interfaces:
java.lang.Cloneable, OptionHandler, SemiSupClusterer, java.io.Serializable

public class HAC
extends Clusterer
implements SemiSupClusterer, OptionHandler

See Also:
Serialized Form

Field Summary
static int COMPLETE_LINK
           
static int GROUP_AVERAGE
           
protected  java.util.HashMap m_checksumHash
          A 'checksum hash' where indices are hashed to the sum of their attribute values
protected  double[] m_checksumPerturb
           
protected  int[] m_clusterAssignments
          temporary variable holding cluster assignments
protected  int m_clusterID
          ID of current cluster
protected  java.util.ArrayList m_clusters
          holds the clusters
protected  double[][] m_distanceMatrix
          distance matrix
protected  java.lang.String m_dotFileName
          Dot file name for dumping graph for tree visualization
protected  java.io.PrintWriter m_dotWriter
          Dot file name for dumping graph for tree visualization
protected  java.util.HashMap m_instancesHash
          instance hash
protected  boolean m_isDistanceBased
          Is the metric (and hence the algorithm) relying on similarities or distances?
protected  int m_linkingType
          Default linking method
protected  double m_mergeThreshold
          The threshold distance beyond which no clusters are merged (except for one - TODO)
protected  Metric m_metric
          metric used to calculate similarity/distance
protected  boolean m_metricBuilt
          has the metric has been constructed? a fix for multiple buildClusterer's
protected  java.lang.String m_metricName
           
protected  int m_numClusters
          Number of clusters
protected  int m_numCurrentClusters
          Number of clusters in the process
protected  int m_numSeededClusters
          Number of seeded clusters
protected  java.util.Random m_randomGen
           
protected  int m_randomSeed
          holds the random Seed, useful for random selection initialization
protected  java.util.HashMap m_reverseInstancesHash
          reverse instance hash
protected  boolean m_seedable
          seeding
protected  java.util.HashMap m_SeedHash
          holds the ([seed instance] -> [clusterLabel of seed instance]) mapping
protected  int m_StartingIndexOfTest
          starting index of test data in unlabeledData if transductive clustering
protected  boolean m_verbose
          verbose?
static int SINGLE_LINK
          cluster similarity type
static Tag[] TAGS_LINKING
           
 
Constructor Summary
HAC()
          empty constructor, required to call using Class.forName
HAC(Metric metric)
           
 
Method Summary
 void buildClusterer(Instances data)
          Cluster given instances.
 void buildClusterer(Instances labeledData, Instances unlabeledData, int classIndex, int numClusters)
          Clusters unlabeledData and labeledData (with labels removed), using labeledData as seeds
 void buildClusterer(Instances labeledData, Instances unlabeledData, int classIndex, int numClusters, int startingIndexOfTest)
          Clusters unlabeledData and labeledData (with labels removed), using labeledData as seeds
 void buildClusterer(Instances data, int num_clusters)
          Cluster given instances to form the specified number of clusters.
protected  void checkClusters()
           
protected  void cluster()
          Internal method that produces the actual clusters
protected  double clusterDistance(Cluster cluster1, Cluster cluster2)
          internal method that returns the distance between two clusters
 int clusterInstance(Instance instance)
          Clusters an instance.
protected  void createDistanceMatrix()
          Fill the distance matrix with values using the metric
protected  double distance(Instance instance, Cluster cluster)
          internal method that returns the distance between an instance and a cluster
protected  Instances filterInstanceDescriptions(Instances instances)
          If some of the attributes start with "__", form a separate Instances set with descriptions and filter them out of the argument dataset.
 java.util.ArrayList getClusters()
          Computes the final clusters from the cluster assignments, for external access
 Instances getInstances()
          Return training instances
 java.util.ArrayList getIntClusters()
          Computes the clusters from the cluster assignments
 SelectedTag getLinkingType()
          Get the linking type
 double getMergeThreshold()
          Get the merge threshold
 Metric getMetric()
          Get the distance metric
 int getNumClusters()
          Return the number of clusters
 java.lang.String[] getOptions()
          Gets the current settings of Greedy Agglomerative Clustering
 int getRandomSeed()
          Return the random number seed
 boolean getSeedable()
          Turn seeding on and off
 java.util.HashMap getSeedHash()
          returns the SeedHash
 Clusterer getThisClusterer()
          We always want to implement SemiSupClusterer from a class extending Clusterer.
 boolean getVerbose()
          get the verbosity level of the clusterer
protected  void hashInstances(Instances data)
          Create the hashtable from given Instances; keys are numeric indeces, values are actual Instances
protected  void initClusterAssignments()
          Update the clusterAssignments for all points in two clusters that are about to be merged
protected  void initConstraints()
          Internal method that initializes distances between seed clusters to POSITIVE_INFINITY
 java.util.Enumeration listOptions()
          Returns an enumeration describing the available options
static void main(java.lang.String[] argv)
           
protected  boolean matchInstance(Instance instance1, Instance instance2)
          Internal method: check if two instances match on their attribute values
protected  Cluster mergeClusters(int cluster1Idx, int cluster2Idx)
          Internal method to merge two clusters and update distances
protected  double mergeStep()
          Internal method that finds two most similar clusters and merges them
 java.lang.String metricName()
          Get the distance metric name
 int numberOfClusters()
          A duplicate function to conform to Clusterer abstract class.
 double objectiveFunction()
          returns objective function, needed for compatibility with SemiSupClusterer
 void printCluster(int i)
          Outputs the specified cluster
 void printClusters()
          Outputs the current clustering
 void printIntClusters()
          Outputs the current clustering
static int[] randomSubset(int numIdxs, int maxIdx)
          get an array of random indeces out of n possible values.
 void resetClusterer()
          Reset all values that have been learned
 void seedClusterer(java.util.HashMap SeedHash)
          Read the seeds from a hastable, where every key is an instance and every value is: a FastVector of Doubles: [(Double) probInCluster0 ...
 void setInstances(Instances instances)
          Sets training instances
 void setLinkingType(SelectedTag linkingType)
          Set the type of clustering
 void setMergeThreshold(double threshold)
          Set the merge threshold
 void setMetric(Metric m)
          Set the distance metric
 void setNumClusters(int n)
          Set the number of clusters to generate
 void setOptions(java.lang.String[] options)
          Parses a given list of options.
 void setRandomSeed(int s)
          Set the random number seed
 void setSeedable(boolean seedable)
          Turn seeding on and off
 void setSeedHash(java.util.HashMap seedhash)
          Set the m_SeedHash
 void setVerbose(boolean verbose)
          set the verbosity level of the clusterer
 void trainClusterer(Instances instances)
          Train the clusterer using specified parameters
protected  void unhashClusters()
          assuming m_clusters contains the clusters of indeces, convert it to clusters containing actual instances
 
Methods inherited from class weka.clusterers.Clusterer
forName, makeCopies
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

m_numClusters

protected int m_numClusters
Number of clusters


m_numCurrentClusters

protected int m_numCurrentClusters
Number of clusters in the process


m_clusterID

protected int m_clusterID
ID of current cluster


m_numSeededClusters

protected int m_numSeededClusters
Number of seeded clusters


m_dotFileName

protected java.lang.String m_dotFileName
Dot file name for dumping graph for tree visualization


m_dotWriter

protected java.io.PrintWriter m_dotWriter
Dot file name for dumping graph for tree visualization


m_clusters

protected java.util.ArrayList m_clusters
holds the clusters


m_clusterAssignments

protected int[] m_clusterAssignments
temporary variable holding cluster assignments


m_distanceMatrix

protected double[][] m_distanceMatrix
distance matrix


SINGLE_LINK

public static final int SINGLE_LINK
cluster similarity type

See Also:
Constant Field Values

COMPLETE_LINK

public static final int COMPLETE_LINK
See Also:
Constant Field Values

GROUP_AVERAGE

public static final int GROUP_AVERAGE
See Also:
Constant Field Values

TAGS_LINKING

public static final Tag[] TAGS_LINKING

m_linkingType

protected int m_linkingType
Default linking method


m_StartingIndexOfTest

protected int m_StartingIndexOfTest
starting index of test data in unlabeledData if transductive clustering


m_seedable

protected boolean m_seedable
seeding


m_SeedHash

protected java.util.HashMap m_SeedHash
holds the ([seed instance] -> [clusterLabel of seed instance]) mapping


m_checksumHash

protected java.util.HashMap m_checksumHash
A 'checksum hash' where indices are hashed to the sum of their attribute values


m_checksumPerturb

protected double[] m_checksumPerturb

m_randomSeed

protected int m_randomSeed
holds the random Seed, useful for random selection initialization


m_randomGen

protected java.util.Random m_randomGen

m_instancesHash

protected java.util.HashMap m_instancesHash
instance hash


m_reverseInstancesHash

protected java.util.HashMap m_reverseInstancesHash
reverse instance hash


m_mergeThreshold

protected double m_mergeThreshold
The threshold distance beyond which no clusters are merged (except for one - TODO)


m_verbose

protected boolean m_verbose
verbose?


m_metric

protected Metric m_metric
metric used to calculate similarity/distance


m_metricName

protected java.lang.String m_metricName

m_isDistanceBased

protected boolean m_isDistanceBased
Is the metric (and hence the algorithm) relying on similarities or distances?


m_metricBuilt

protected boolean m_metricBuilt
has the metric has been constructed? a fix for multiple buildClusterer's

Constructor Detail

HAC

public HAC()
empty constructor, required to call using Class.forName


HAC

public HAC(Metric metric)
Method Detail

setInstances

public void setInstances(Instances instances)
Sets training instances


getInstances

public Instances getInstances()
Return training instances

Specified by:
getInstances in interface SemiSupClusterer
Returns:
Instances used for clustering, or null

setNumClusters

public void setNumClusters(int n)
Set the number of clusters to generate

Specified by:
setNumClusters in interface SemiSupClusterer
Parameters:
n - the number of clusters to generate

setMergeThreshold

public void setMergeThreshold(double threshold)
Set the merge threshold


getMergeThreshold

public double getMergeThreshold()
Get the merge threshold


setMetric

public void setMetric(Metric m)
Set the distance metric

Specified by:
setMetric in interface SemiSupClusterer

getMetric

public Metric getMetric()
Get the distance metric


metricName

public java.lang.String metricName()
Get the distance metric name


getThisClusterer

public Clusterer getThisClusterer()
We always want to implement SemiSupClusterer from a class extending Clusterer. We want to be able to return the underlying parent class.

Specified by:
getThisClusterer in interface SemiSupClusterer
Returns:
parent Clusterer class

buildClusterer

public void buildClusterer(Instances data,
                           int num_clusters)
                    throws java.lang.Exception
Cluster given instances to form the specified number of clusters.

Parameters:
data - instances to be clustered
num_clusters - number of clusters to create
Throws:
java.lang.Exception - if something goes wrong.

buildClusterer

public void buildClusterer(Instances labeledData,
                           Instances unlabeledData,
                           int classIndex,
                           int numClusters)
                    throws java.lang.Exception
Clusters unlabeledData and labeledData (with labels removed), using labeledData as seeds

Parameters:
labeledData - labeled instances to be used as seeds
unlabeledData - unlabeled instances
classIndex - attribute index in labeledData which holds class info
numClusters - number of clusters
Throws:
java.lang.Exception - if something goes wrong.

buildClusterer

public void buildClusterer(Instances labeledData,
                           Instances unlabeledData,
                           int classIndex,
                           int numClusters,
                           int startingIndexOfTest)
                    throws java.lang.Exception
Clusters unlabeledData and labeledData (with labels removed), using labeledData as seeds

Specified by:
buildClusterer in interface SemiSupClusterer
Parameters:
labeledData - labeled instances to be used as seeds
unlabeledData - unlabeled instances
classIndex - attribute index in labeledData which holds class info
numClusters - number of clusters
startingIndexOfTest - from where test data starts in unlabeledData, useful if clustering is transductive
Throws:
java.lang.Exception - if something goes wrong.

buildClusterer

public void buildClusterer(Instances data)
                    throws java.lang.Exception
Cluster given instances. If no threshold or number of clusters is set, clustering proceeds until two clusters are left.

Specified by:
buildClusterer in interface SemiSupClusterer
Specified by:
buildClusterer in class Clusterer
Parameters:
data - instances to be clustered
Throws:
java.lang.Exception - if something goes wrong.

filterInstanceDescriptions

protected Instances filterInstanceDescriptions(Instances instances)
                                        throws java.lang.Exception
If some of the attributes start with "__", form a separate Instances set with descriptions and filter them out of the argument dataset. Return the original dataset without the filtered out attributes

Throws:
java.lang.Exception

resetClusterer

public void resetClusterer()
                    throws java.lang.Exception
Reset all values that have been learned

Specified by:
resetClusterer in interface SemiSupClusterer
Throws:
java.lang.Exception

setSeedHash

public void setSeedHash(java.util.HashMap seedhash)
Set the m_SeedHash


setRandomSeed

public void setRandomSeed(int s)
Set the random number seed

Parameters:
s - the seed

getRandomSeed

public int getRandomSeed()
Return the random number seed


setSeedable

public void setSeedable(boolean seedable)
Turn seeding on and off

Parameters:
seedable - should seeding be done?

getSeedable

public boolean getSeedable()
Turn seeding on and off


seedClusterer

public void seedClusterer(java.util.HashMap SeedHash)
Read the seeds from a hastable, where every key is an instance and every value is: a FastVector of Doubles: [(Double) probInCluster0 ... (Double) probInClusterN]

Specified by:
seedClusterer in interface SemiSupClusterer
Parameters:
SeedHash - HashMap of seeding parameters

getSeedHash

public java.util.HashMap getSeedHash()
returns the SeedHash

Returns:
seeds hash

hashInstances

protected void hashInstances(Instances data)
Create the hashtable from given Instances; keys are numeric indeces, values are actual Instances

Parameters:
data - Instances

unhashClusters

protected void unhashClusters()
                       throws java.lang.Exception
assuming m_clusters contains the clusters of indeces, convert it to clusters containing actual instances

Throws:
java.lang.Exception

createDistanceMatrix

protected void createDistanceMatrix()
                             throws java.lang.Exception
Fill the distance matrix with values using the metric

Throws:
java.lang.Exception

setLinkingType

public void setLinkingType(SelectedTag linkingType)
Set the type of clustering


getLinkingType

public SelectedTag getLinkingType()
Get the linking type


initConstraints

protected void initConstraints()
Internal method that initializes distances between seed clusters to POSITIVE_INFINITY


cluster

protected void cluster()
                throws java.lang.Exception
Internal method that produces the actual clusters

Throws:
java.lang.Exception

mergeStep

protected double mergeStep()
                    throws java.lang.Exception
Internal method that finds two most similar clusters and merges them

Throws:
java.lang.Exception

getIntClusters

public java.util.ArrayList getIntClusters()
                                   throws java.lang.Exception
Computes the clusters from the cluster assignments

Throws:
java.lang.Exception - if clusters could not be computed successfully

getClusters

public java.util.ArrayList getClusters()
                                throws java.lang.Exception
Computes the final clusters from the cluster assignments, for external access

Specified by:
getClusters in interface SemiSupClusterer
Throws:
java.lang.Exception - if clusters could not be computed successfully

clusterDistance

protected double clusterDistance(Cluster cluster1,
                                 Cluster cluster2)
internal method that returns the distance between two clusters


checkClusters

protected void checkClusters()

mergeClusters

protected Cluster mergeClusters(int cluster1Idx,
                                int cluster2Idx)
                         throws java.lang.Exception
Internal method to merge two clusters and update distances

Throws:
java.lang.Exception

initClusterAssignments

protected void initClusterAssignments()
Update the clusterAssignments for all points in two clusters that are about to be merged


printClusters

public void printClusters()
                   throws java.lang.Exception
Outputs the current clustering

Throws:
java.lang.Exception - if something goes wrong

printCluster

public void printCluster(int i)
                  throws java.lang.Exception
Outputs the specified cluster

Throws:
java.lang.Exception - if something goes wrong

printIntClusters

public void printIntClusters()
                      throws java.lang.Exception
Outputs the current clustering

Throws:
java.lang.Exception - if something goes wrong

clusterInstance

public int clusterInstance(Instance instance)
                    throws java.lang.Exception
Clusters an instance.

Specified by:
clusterInstance in class Clusterer
Parameters:
instance - the instance to cluster.
Returns:
the number of the assigned cluster as an interger if the class is enumerated, otherwise the predicted value
Throws:
java.lang.Exception - if something goes wrong.

matchInstance

protected boolean matchInstance(Instance instance1,
                                Instance instance2)
Internal method: check if two instances match on their attribute values


distance

protected double distance(Instance instance,
                          Cluster cluster)
                   throws java.lang.Exception
internal method that returns the distance between an instance and a cluster

Throws:
java.lang.Exception

setVerbose

public void setVerbose(boolean verbose)
set the verbosity level of the clusterer

Specified by:
setVerbose in interface SemiSupClusterer
Parameters:
verbose - messages on(true) or off (false)

getVerbose

public boolean getVerbose()
get the verbosity level of the clusterer

Returns:
verbose messages on(true) or off (false)

listOptions

public java.util.Enumeration listOptions()
Returns an enumeration describing the available options

Specified by:
listOptions in interface OptionHandler
Returns:
an enumeration of all the available options

setOptions

public void setOptions(java.lang.String[] options)
                throws java.lang.Exception
Parses a given list of options. Valid options are:

-A <0-100>
Acuity.

-C <0-100>
Cutoff.

Specified by:
setOptions in interface OptionHandler
Parameters:
options - the list of options as an array of strings
Throws:
java.lang.Exception - if an option is not supported

getOptions

public java.lang.String[] getOptions()
Gets the current settings of Greedy Agglomerative Clustering

Specified by:
getOptions in interface OptionHandler
Returns:
an array of strings suitable for passing to setOptions()

trainClusterer

public void trainClusterer(Instances instances)
                    throws java.lang.Exception
Train the clusterer using specified parameters

Specified by:
trainClusterer in interface SemiSupClusterer
Parameters:
instances - Instances to be used for training
Throws:
java.lang.Exception

objectiveFunction

public double objectiveFunction()
returns objective function, needed for compatibility with SemiSupClusterer

Specified by:
objectiveFunction in interface SemiSupClusterer

getNumClusters

public int getNumClusters()
Return the number of clusters

Specified by:
getNumClusters in interface SemiSupClusterer

numberOfClusters

public int numberOfClusters()
A duplicate function to conform to Clusterer abstract class.

Specified by:
numberOfClusters in class Clusterer
Returns:
the number of clusters generated for a training dataset.

randomSubset

public static int[] randomSubset(int numIdxs,
                                 int maxIdx)
get an array of random indeces out of n possible values. if the number of requested indeces is larger then maxIdx, returns maxIdx permuted values

Parameters:
maxIdx - - the maximum index of the set
numIdxs - number of indexes to return
Returns:
an array of indexes

main

public static void main(java.lang.String[] argv)