The goal of this problem is to train a classifier for the Iris dataset. The dataset can be found here. The file 'iris_matrix' contains all 150 instances in the iris dataset. The first four columns of this file correspond to various attributes of the iris plant (sepal length, sepal width, petal length, and petal width). The last column denotes the species type, which is encoded as either 0, 1, or 2 (see the file iris_classes for the mapping from class id to species name). The dataset has been conveniently separated for you into two different sets: iris_train and iris_test. The first dataset should be used to train the classifier; the latter to test.
The classifier that you are to implement should be trained (on the training set) by computing the mean of all instances within each of the three classes. Unseen instances, i.e. instances in the test set, should be classified by computing the distance (or similarity) to each of the class means, and assigning it to the class with the smallest distance (or largest similarity) to the class mean. The measures you should try are (1) Euclidean Distance, (2) cosine similarity, and (3) correlation. As mentioned, your classifier should be trained on the iris_train dataset, and should be evaluated on the iris_test dataset. Print the three mean vectors and report the confusion matrices (see page 149 of the textbook) for each of the distance (similarity) measures. Finally, report the accuracy for each measure.
The goal of this problem is to visualize the Iris dataset. The first technique you will implement is Principal Components Analysis (PCA). This method does not look at class values. Complete the following steps to visual using PCA for the Iris data. The easiest way to implement this problem will be to use matlab. You are free to use a different language, but this is discouraged, as you will be computing eigenvectors.
(5 points)The next method you should test is a method called "Class-Preserving Projections" (CPP). Unlike PCA, this method uses the class labels to project. CPP projects onto the 2 dimensional plane that passes through the means of each of the classes. For the Iris dataset, there are three classes. Thus, CPP will project the data onto the 2-dimensional plane that is determined by the three class means. To do this, following these steps:
(10 points)