weka.clusterers
Class XMeans

java.lang.Object
  extended byweka.clusterers.Clusterer
      extended byweka.clusterers.XMeans
All Implemented Interfaces:
java.lang.Cloneable, OptionHandler, java.io.Serializable

public class XMeans
extends Clusterer
implements OptionHandler

XMeans clustering class. X-Means is K-Means extended by an Improve-Structure part In this part of the algorithm the centers are attempted to be split in its region. The decision between the children of each center and itself is done comparing the BIC-values of the two structures. See also D. Pelleg and A. Moore's paper 'X-means: Extending K-means with Efficient Estimation of the Number of Clusters'.

Valid options are:

-I
Maximum number of iterations in the overall loop (default = 1).

-M
Maximum number of iterations in the kMeans loop in
the Improve-Parameter part (default = 1000).

-J
Maximum number of iterations in the kMeans loop for the splitted
centroids in the Improve-Structure part (default = 1000).

-L
Specify the number of clusters to start with.

-H
Specify the maximal number of clusters.

-B
Distance value between true and false of binary attributes and
"same" and "different" of nominal attributes (default = 1.0).

-K
KDTrees class and its options (can only use the same distance function as XMeans).

-C
If none of the children are better, percentage of the best splits
to be taken.

-D Distance function class to be used (default = Euclidean distance) -N
Input starting cluster centers from file (ARFF-format).

-O
Output cluster centers to file (ARFF-format).

-S
Specify random number seed.

-U
Set debuglevel.

-Y
Used for debugging: Input random vektors from file.

major TODOS: make BIC-Score replaceable by other scores

See Also:
Clusterer, OptionHandler, Serialized Form

Field Summary
static int D_CONVCHCLOSER
           
static int D_CURR
           
static int D_FOLLOWSPLIT
           
static int D_GENERAL
           
static int D_ITERCOUNT
           
static int D_KDTREE
           
static int D_METH_MISUSE
           
static int D_PRINTCENTERS
           
static int D_RANDOMVEKTOR
           
 boolean m_CurrDebugFlag
           
static int R_HIGH
           
static int R_LOW
          Index in ranges for LOW and HIGH and WIDTH
static int R_WIDTH
           
 
Constructor Summary
XMeans()
           
 
Method Summary
 java.lang.String binValueTipText()
          Returns the tip text for this property
 void buildClusterer(Instances data)
          Generates the X-Means clusterer.
 boolean checkForNominalAttributes(Instances data)
          Checks for nominal attributes in the dataset.
 int clusterInstance(Instance instance)
          Classifies a given instance.
 double getBinValue()
          Gets value that represents true in a new numeric attribute.
 double getCutOffFactor()
          Gets the cutoff factor.
 int getDebugLevel()
          Gets the debug level.
 DistanceFunction getDistanceF()
          Gets the distance function.
protected  java.lang.String getDistanceFSpec()
          Gets the distance function specification string, which contains the class name of the distance function class and any options to it
 java.lang.String getInputCenterFile()
          Gets the name of the file to read the list of centers from.
 KDTree getKDTree()
          Gets the KDTree class.
protected  java.lang.String getKDTreeSpec()
          Gets the KDTree specification string, which contains the class name of the KDTree class and any options to the KDTree
 int getMaxIterations()
          Gets the maximum number of iterations.
 int getMaxKMeans()
          Gets the maximum number of iterations in KMeans.
 int getMaxKMeansForChildren()
          Gets the maximum number of iterations in KMeans.
 int getMaxNumClusters()
          Gets the maximum number of clusters to generate.
 int getMinNumClusters()
          Gets the minimum number of clusters to generate.
 Instance getNextDebugVektorsInstance(Instances model)
          Read an instance from debug vektors file.
 java.lang.String[] getOptions()
          Gets the current settings of SimpleKMeans.
 java.lang.String getOutputCenterFile()
          Gets the name of the file to write the list of centers to.
 int getSeed()
          Gets the random number seed.
 java.lang.String globalInfo()
          Returns a string describing this clusterer
 void initDebugVektorsInput()
          Initialises the debug vektor input.
static double[][] initializeRanges(Instances instances, int[] instList)
          Function should be in the Instances class!! Initializes the minimum and maximum values based on all instances.
 java.util.Enumeration listOptions()
          Returns an enumeration describing the available options.
static void main(java.lang.String[] argv)
          Main method for testing this class.
 java.lang.String maxNumClustersTipText()
          Returns the tip text for this property
 java.lang.String minNumClustersTipText()
          Returns the tip text for this property
 int numberOfClusters()
          Returns the number of clusters.
static void printRanges(Instances model, double[][] ranges)
          Function should be in the Instances class!! Prints a range.
 java.lang.String seedTipText()
          Returns the tip text for this property.
 void setBinValue(double value)
          Sets the distance e value between true and false of binary attributes and "same" and "different" of nominal attributes
 void setCutOffFactor(double i)
          Sets a new cutoff factor.
 void setDebugLevel(int d)
          Sets the debug level.
 void setDebugVektorsFile(java.lang.String fileName)
          Sets a file name for a file that has the random vektors stored.
 void setDistanceF(DistanceFunction distanceF)
          gets the "binary" distance value
 void setInputCenterFile(java.lang.String fileName)
          Sets the name of the file to read the list of centers from.
 void setKDTree(KDTree k)
          Sets the KDTree class.
 void setMaxIterations(int i)
          Sets the maximum number of iterations to perform.
 void setMaxKMeans(int i)
          Set the maximum number of iterations to perform in KMeans
 void setMaxKMeansForChildren(int i)
          Sets the maximum number of iterations KMeans that is performed on the child centers.
 void setMaxNumClusters(int n)
          Sets the maximum number of clusters to generate.
 void setMinNumClusters(int n)
          Sets the minimum number of clusters to generate.
 void setOptions(java.lang.String[] options)
          Parses a given list of options.
 void setOutputCenterFile(java.lang.String fileName)
          Sets the name of the file to write the list of centers to.
 void setSeed(int s)
          Sets the random number seed.
 java.lang.String toString()
          Return a string describing this clusterer.
static void updateRanges(Instance instance, int numAtt, double[][] ranges)
          Function should be in the Instances class!! Updates the minimum and maximum and width values for all the attributes based on a new instance.
static void updateRangesFirst(Instance instance, int numAtt, double[][] ranges)
          Function should be in the Instances class!! Used to initialize the ranges.
 
Methods inherited from class weka.clusterers.Clusterer
forName, makeCopies
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

R_LOW

public static int R_LOW
Index in ranges for LOW and HIGH and WIDTH


R_HIGH

public static int R_HIGH

R_WIDTH

public static int R_WIDTH

D_PRINTCENTERS

public static int D_PRINTCENTERS

D_FOLLOWSPLIT

public static int D_FOLLOWSPLIT

D_CONVCHCLOSER

public static int D_CONVCHCLOSER

D_RANDOMVEKTOR

public static int D_RANDOMVEKTOR

D_KDTREE

public static int D_KDTREE

D_ITERCOUNT

public static int D_ITERCOUNT

D_METH_MISUSE

public static int D_METH_MISUSE

D_CURR

public static int D_CURR

D_GENERAL

public static int D_GENERAL

m_CurrDebugFlag

public boolean m_CurrDebugFlag
Constructor Detail

XMeans

public XMeans()
Method Detail

globalInfo

public java.lang.String globalInfo()
Returns a string describing this clusterer

Returns:
a description of the evaluator suitable for displaying in the explorer/experimenter gui

initializeRanges

public static double[][] initializeRanges(Instances instances,
                                          int[] instList)
Function should be in the Instances class!! Initializes the minimum and maximum values based on all instances.

Parameters:
instList - list of indexes

printRanges

public static void printRanges(Instances model,
                               double[][] ranges)
Function should be in the Instances class!! Prints a range.

Parameters:
ranges - the ranges to print

updateRangesFirst

public static void updateRangesFirst(Instance instance,
                                     int numAtt,
                                     double[][] ranges)
Function should be in the Instances class!! Used to initialize the ranges. For this the values of the first instance is used to save time. Sets low and high to the values of the first instance and width to zero.

Parameters:
instance - the new instance
numAtt - number of attributes in the model

updateRanges

public static void updateRanges(Instance instance,
                                int numAtt,
                                double[][] ranges)
Function should be in the Instances class!! Updates the minimum and maximum and width values for all the attributes based on a new instance.

Parameters:
instance - the new instance
numAtt - number of attributes in the model
ranges - low, high and width values for all attributes

buildClusterer

public void buildClusterer(Instances data)
                    throws java.lang.Exception
Generates the X-Means clusterer.

Specified by:
buildClusterer in class Clusterer
Parameters:
data - set of instances serving as training data
Throws:
java.lang.Exception - if the clusterer has not been generated successfully

checkForNominalAttributes

public boolean checkForNominalAttributes(Instances data)
Checks for nominal attributes in the dataset. Class attribute is ignored.

Parameters:
data -
Returns:
false if no nominal attributes are present

clusterInstance

public int clusterInstance(Instance instance)
                    throws java.lang.Exception
Classifies a given instance.

Specified by:
clusterInstance in class Clusterer
Parameters:
instance - the instance to be assigned to a cluster
Returns:
the number of the assigned cluster as an integer if the class is enumerated, otherwise the predicted value
Throws:
if - instance could not be classified successfully
java.lang.Exception - if instance could not be classified successfully

numberOfClusters

public int numberOfClusters()
Returns the number of clusters.

Specified by:
numberOfClusters in class Clusterer
Returns:
the number of clusters generated for a training dataset.

listOptions

public java.util.Enumeration listOptions()
Returns an enumeration describing the available options.

Specified by:
listOptions in interface OptionHandler
Returns:
an enumeration of all the available options

minNumClustersTipText

public java.lang.String minNumClustersTipText()
Returns the tip text for this property

Returns:
tip text for this property

maxNumClustersTipText

public java.lang.String maxNumClustersTipText()
Returns the tip text for this property

Returns:
tip text for this property

setMaxIterations

public void setMaxIterations(int i)
                      throws java.lang.Exception
Sets the maximum number of iterations to perform.

Parameters:
i - the number of iterations
Throws:
java.lang.Exception - if i is less than 1

getMaxIterations

public int getMaxIterations()
Gets the maximum number of iterations.

Returns:
the number of iterations

setMaxKMeans

public void setMaxKMeans(int i)
Set the maximum number of iterations to perform in KMeans

Parameters:
i - the number of iterations

getMaxKMeans

public int getMaxKMeans()
Gets the maximum number of iterations in KMeans.

Returns:
the number of iterations

setMaxKMeansForChildren

public void setMaxKMeansForChildren(int i)
                             throws java.lang.Exception
Sets the maximum number of iterations KMeans that is performed on the child centers.

Parameters:
i - the number of iterations
Throws:
java.lang.Exception

getMaxKMeansForChildren

public int getMaxKMeansForChildren()
Gets the maximum number of iterations in KMeans.

Returns:
the number of iterations

setCutOffFactor

public void setCutOffFactor(double i)
                     throws java.lang.Exception
Sets a new cutoff factor.

Parameters:
i - the new cutoff factor
Throws:
java.lang.Exception

getCutOffFactor

public double getCutOffFactor()
Gets the cutoff factor.

Returns:
the cutoff factor

setMinNumClusters

public void setMinNumClusters(int n)
Sets the minimum number of clusters to generate.

Parameters:
n - the minimum number of clusters to generate

setMaxNumClusters

public void setMaxNumClusters(int n)
Sets the maximum number of clusters to generate.

Parameters:
n - the maximum number of clusters to generate

binValueTipText

public java.lang.String binValueTipText()
Returns the tip text for this property

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

getBinValue

public double getBinValue()
Gets value that represents true in a new numeric attribute. (False is always represented by 0.0.)

Returns:
the value that represents true in a new numeric attribute

setBinValue

public void setBinValue(double value)
Sets the distance e value between true and false of binary attributes and "same" and "different" of nominal attributes


setDistanceF

public void setDistanceF(DistanceFunction distanceF)
gets the "binary" distance value

Parameters:
distanceF - the distance function with all options set

getDistanceF

public DistanceFunction getDistanceF()
Gets the distance function.

Returns:
the distance function

getDistanceFSpec

protected java.lang.String getDistanceFSpec()
Gets the distance function specification string, which contains the class name of the distance function class and any options to it

Returns:
the distance function specification string

setDebugVektorsFile

public void setDebugVektorsFile(java.lang.String fileName)
Sets a file name for a file that has the random vektors stored. Just used for debugging reasons.

Parameters:
fileName - file name for the file to read the random vektors from

initDebugVektorsInput

public void initDebugVektorsInput()
                           throws java.lang.Exception
Initialises the debug vektor input.

Throws:
java.lang.Exception

getNextDebugVektorsInstance

public Instance getNextDebugVektorsInstance(Instances model)
                                     throws java.lang.Exception
Read an instance from debug vektors file.

Parameters:
model - the data model for the instance
Throws:
java.lang.Exception

setInputCenterFile

public void setInputCenterFile(java.lang.String fileName)
Sets the name of the file to read the list of centers from.

Parameters:
fileName - file name of file to read centers from

setOutputCenterFile

public void setOutputCenterFile(java.lang.String fileName)
Sets the name of the file to write the list of centers to.

Parameters:
fileName - file to write centers to

getInputCenterFile

public java.lang.String getInputCenterFile()
Gets the name of the file to read the list of centers from.

Returns:
filename of the file to read the centers from

getOutputCenterFile

public java.lang.String getOutputCenterFile()
Gets the name of the file to write the list of centers to.

Returns:
filename of the file to write centers to

setKDTree

public void setKDTree(KDTree k)
Sets the KDTree class.

Parameters:
k - a KDTree object with all options set

getKDTree

public KDTree getKDTree()
Gets the KDTree class.

Returns:
flag if KDTrees are used

getKDTreeSpec

protected java.lang.String getKDTreeSpec()
Gets the KDTree specification string, which contains the class name of the KDTree class and any options to the KDTree

Returns:
the KDTree string.

setDebugLevel

public void setDebugLevel(int d)
Sets the debug level. debug level = 0, means no output

Parameters:
d - debuglevel

getDebugLevel

public int getDebugLevel()
Gets the debug level.

Returns:
debug level

getMinNumClusters

public int getMinNumClusters()
Gets the minimum number of clusters to generate.

Returns:
the minimum number of clusters to generate

getMaxNumClusters

public int getMaxNumClusters()
Gets the maximum number of clusters to generate.

Returns:
the maximum number of clusters to generate

seedTipText

public java.lang.String seedTipText()
Returns the tip text for this property.

Returns:
tip text for this property

setSeed

public void setSeed(int s)
Sets the random number seed.

Parameters:
s - the seed

getSeed

public int getSeed()
Gets the random number seed.

Returns:
the seed

setOptions

public void setOptions(java.lang.String[] options)
                throws java.lang.Exception
Parses a given list of options.

Specified by:
setOptions in interface OptionHandler
Parameters:
options - the list of options as an array of strings
Throws:
java.lang.Exception - if an option is not supported

getOptions

public java.lang.String[] getOptions()
Gets the current settings of SimpleKMeans.

Specified by:
getOptions in interface OptionHandler
Returns:
an array of strings suitable for passing to setOptions

toString

public java.lang.String toString()
Return a string describing this clusterer.

Returns:
a description of the clusterer as a string

main

public static void main(java.lang.String[] argv)
Main method for testing this class.

Parameters:
argv - should contain options