CS378 Homework #2

Due: In class, Thursday, October 5
Show work for all problems. The coding problems may be implemented in the language of your choice. Print out all code written and attach to your homework solutions. Code should be clearly written and well-commented to receive full credit.

Questions

(5 points) Question 16, page 143 from the textbook.
(10 points) Classification:
The goal of this problem is to train a classifier for the Iris dataset. The dataset can be found here. The file 'iris_matrix' contains all 150 instances in the iris dataset. The first four columns of this file correspond to various attributes of the iris plant (sepal length, sepal width, petal length, and petal width). The last column denotes the species type, which is encoded as either 0, 1, or 2 (see the file iris_classes for the mapping from class id to species name). The dataset has been conveniently separated for you into two different sets: iris_train and iris_test. The first dataset should be used to train the classifier; the latter to test.

The classifier that you are to implement should be trained (on the training set) by computing the mean of all instances within each of the three classes. Unseen instances, i.e. instances in the test set, should be classified by computing the distance (or similarity) to each of the class means, and assigning it to the class with the smallest distance (or largest similarity) to the class mean. The measures you should try are (1) Euclidean Distance, (2) cosine similarity, and (3) correlation. As mentioned, your classifier should be trained on the iris_train dataset, and should be evaluated on the iris_test dataset. Print the three mean vectors and report the confusion matrices (see page 149 of the textbook) for each of the distance (similarity) measures. Finally, report the accuracy for each measure.
(15 points) Data Visualization:
The goal of this problem is to visualize the Iris dataset. The first technique you will implement is Principal Components Analysis (PCA). This method does not look at class values. Complete the following steps to visual using PCA for the Iris data. The easiest way to implement this problem will be to use matlab. You are free to use a different language, but this is discouraged, as you will be computing eigenvectors.
(5 points)
1. Load the iris_matrix file into matlab. You can use the load command to do this (X = load('iris_matrix');).
2. Compute the 4x4 covariance matrix of the dataset. Make sure to omit the species type (the last column in the dataset) from this. Note that the (i,j)-th entry of the covariance matrix equals the covariance between the i-th and j-th attributes.
3. Compute the eigenvectors and eigenvalues of this covariance matrix (you can compute this using the 'eig' command - for usage, try 'help eig' at the matlab prompt).
4. Project the iris data onto the plane defined by the two eigenvectors corresponding to the 2 dominant (i.e. largest) eigenvalues. You will notice that these two eigenvectors are orthonormal: they have an L2 norm of 1 and their inner product is 0. To do this projection, you will need to take the inner product between each instance in the dataset and each of the two eigenvectors. Thus, each instance in this projected space will be represented by a vector of length two.
5. Plot this projection using matlab's "plot" command. The iris_matrix dataset is ordered so that the first 50 instances (rows) are in class one, instances 51-100 are in the second class, and instances 101-150 are in the third class. If w1 is a vector containing the iris data projected onto the first eigenvector (i.e. component), and w2 is a vector containing the iris data projected on the second eigenvector, you can plot using the following command:
  plot(w1(1:50), w2(1:50), '.', w1(51:100), w2(51:100), 'o', w1(101:150), w2(101:150), 'x')
The next method you should test is a method called "Class-Preserving Projections" (CPP). Unlike PCA, this method uses the class labels to project. CPP projects onto the 2 dimensional plane that passes through the means of each of the classes. For the Iris dataset, there are three classes. Thus, CPP will project the data onto the 2-dimensional plane that is determined by the three class means. To do this, following these steps:
(10 points)
1. Compute the means of the three classes (you can use the code you write in the previous question).
2. Find a basis (b1, b2) for this plane.
3. Transform this basis into an orthonormal basis (see definition above).
4. Project the iris data onto each of the two orthonormal basis vectors and plot.