CS378: Concurrency

Lab #4

The goal of this assignment is exposure to GPU programming. You will solve the same problem you solved in Lab 1, k-means, using CUDA. Recall from lab 1, the following background about K-means:

K-Means

K-Means is a machine-learning algorithm most commonly used for unsupervised learning. Suppose you have a data set where each data point has a set of features, but you don't have labels for them, so training a classifier to bin the data into classes cannot rely on supervised algorithms (e.g. Support Vector Machines, which learn hypothesis functions to predict labels given features).

One of the most straightforward things we can do with unlabeled data is to look for groups of data in our dataset which are similar: clusters. K-Means is a "clustering" algorithms. K-Means stores k centroids that define clusters. A point is considered to be in a particular cluster if it is closer to that cluster's centroid than to any other centroid. K-Means finds these centroids by alternately assigning data points to clusters based on a current version of the centroids, and then re-computing the centroids based on the current assignment of data points to clusters. The behavior the algorithm can be visualized as follows:

Initial input Choose three random centers Map each point to its nearest centroid New centroid is mean of all points
mapping to it. Iterations move
the centroids closer to their destination. Centroids stop moving.
Each point labeled with its nearest centroid.

The Algorithm

In the clustering problem, we are given a training set x(1),...,x(m), and want to group the data into cohesive "clusters." We are given feature vectors for each data point x(i) encoded as floating-point vectors in D-dimensional space. But we have no labels y(i). Our goal is to predict k centroids and a label c(i) for each datapoint. Here is some pseudo-code implementing k-means:

kmeans(dataSet, k) {

  // initialize centroids randomly
  numFeatures = dataSet.getNumFeatures();
  centroids = randomCentroids(numFeatures, k);

  // book-keeping
  iterations = 0;
  oldCentroids = null;

  // core algorithm
  while(!done) {

    oldCentroids = centroids;
    iterations++;

    // labels is a mapping from each point in the dataset 
    // to the nearest (euclidean distance) centroid
    labels = findNearestCentroids(dataSet, centroids);

    // the new centroids are the average 
    // of all the points that map to each 
    // centroid
    centroids = averageLabeledCentroids(dataSet, labels, k);
    done = iterations > MAX_ITERS || converged(centroids, oldCentroids);
}

K-Means in CUDA

You will write the same program you wrote for lab 1, with same input options and the same single-threaded CPU baseline, but will parallelize using CUDA rather than pthreads for Steps 2 and beyond. However, your single-threaded CPU-based K-Means should be the baseline against which you measure all the variants on your GPU implementation below.

Step 1: Resurrect your sequential CPU solution

From lab 2, you should already have a single-threaded CPU-based program that accepts command-line parameters to specify the following:

--clusters:integer valued k -- the number of clusters to find
--threshold:floating-point-valued threshold for convergence
--iterations:integer-valued maximum number of iterations to run k-means
--workers:integer-valued number of threads
--input:string-valued path to an input file.

Your program's output should include:

A report of the number of iterations required to converge or terminate.
A time-measurement that reports only the phase during which the k-means algorithm executed You will find that the cudeEvent_t API is key to doing this accurately. A good overview of how to do use it to measure performance can be found here.
A list of the final cluster centers.

Step 2: Parallelize your algorithm

Write a basic version of K-Means in CUDA. The most natural parallelization is to assign a CUDA thread per point in the input. The CS department has set up a handful of machines with GPUs and CUDA installed on them: eldar-2 through eldar-4. The CUDA-8.0 runtime and tools are installed in /opt/cuda-8.0. There are a number of ways to make it work, but putting the /opt/cuda-8.0/bin on your path should suffice to let you use NVCC, the CUDA compiler. There are 2 GPUs per machine. If you want to select a particular GPU, see the cudaGetDeviceCount() and cudaSetDevice() functions in the CUDA Device Management API.

Using the random-n2048-d16-c16.txt, random-n16384-d24-c16.txt, and random-n65536-d32-c16.txt sample inputs, --iterations 20, and --threshold 0.0000001, use your implementation to find 16 centroids. Create a graph of scalability against input size of your solution using the Please normalize your measurements with the single-threaded solution from Step 1.

Step 3: Use shared Memory

Use CUDA shared memory to implement private partial aggregations per thread group.

Create a similar graph of scalability versus input size for your optimized solution. In this case, include bars for the fastest multi-threaded solution you measured in lab 2, labeled to in indicate the number of threads and the combination of locking primitives used to implement it.

Step 4: Extra credit -- pursue better performance

In this step, you may, for extra credit, explore other ways to make your GPU k-means faster. Can you use other architectural support (read-only memory, ballot instructions, CAS), other parallelization techniques? K-Means can be decomposed as a GroupBy-Aggregate, which can be implemented relatively easily in CUDA using thrust::sort and thrust::prefix_sum primitives. Is this approach faster or slower than BSP-style domain decomposition? As with lab 2, extra credit will be given for any solution that undertakes this section, as long as the solution is still correct (you may find it hard to improve scalability). Major kudos (and extra points) go to the solution that is fastest in absolute terms and the one that is most scalable.

Deliverables

You should use canvas to submit, along with your code, Makefiles, and measurements scripts, a brief writeup with the scalability graphs requested above. Since the goal is compare the performance of different implementations, it is fine to include all measurements on the same graph, as long as they are well-labeled. Either way, be sure that your writeup includes sufficient text to enable us to understand which graphs are which.

Your writeup should additionally answer the following questions. In cases where we ask you to explain performance behavior, it is fine to speculate, but be clear whether your observations are empirical or speculation.

Report the GPU hardware details, CPU hardware details, and OS version on the machine where you did your measurements.
Which of your implementations is fastest? Does it match your expectations of which should be fastest? Estimate the best-case performance speedup your CUDA implementations should have based on the number of threads in your program and the number of processing contexts actually supported by your hardware. How far of that prediction is your best-case performance?
Which of the parallel implementations is slowest, and does it match your expectations? Why or why not?
What fraction of the end-to-end runtime in your CUDA versions is spent in data transfer?
How much time did you spend on the lab?

Handy links:


Initial input	Choose three random centers	Map each point to its nearest centroid	New centroid is mean of all points mapping to it. Iterations move the centroids closer to their destination.	Centroids stop moving. Each point labeled with its nearest centroid.