CS378: Concurrency: Honors

Lab #1: Shared Counters and Locks

The goal of this assignment is to familiarize yourself with basic shared-memory synchronization concepts and primitives, get some experience predicting, measuring, and understanding the performance of concurrent programs. In this lab you will write a program that parallelizes a seemingly basic task: incrementing counters. The task is algorithmically quite simple, and the synchronization required to preserve correctness is not intended to be a major challenge. In contrast, understanding (and working around) the performance subtleties introduced by practical matters such as code structure, concurrency management primitives, and the hardware itself can be non-trivial. At the end of this lab you should have some familiarity with concurrency primitives, and some awareness of performance considerations that will come up repeatedly in later sections of the course.

Specifically, your task is to write a program in which the main thread creates (forks) a parameterizable number of worker threads, and waits for them all to complete (join). The core task of each worker thread is to execute something like the psuedo-code below, where the counter variable is global and shared across all threads, while the my_increment_count is local, and tracks the number of times each individual worker increments the variable:

worker_thread() {
  int my_increment_count = 0;
  while(counter < MAX_COUNTER) {
    counter++;
    my_increment_count++;
  }
}

We will assume the following correctness property: the sum of all increment operations performed by each worker must equal the value of the counter itself (no lost updates). It may not surprise you to hear that some synchronization is required to preserve that condition. In this lab, we'll learn how to use thread and locking APIs to do this, look at some different ways of preserving correctness, measure their performance, and do some thinking about those measurements.

We recommend you do this lab using C/C++ and pthreads. However, it is not a hard requirement--if you wish to use another language, it is fine as long as it supports or provides access to thread management and locking APIs similar to those exported by pthreads: you'll need support for creating and waiting for threads, creating/destroying and using locks, and for using hardware-supported atomic instructions. Using language-level synchronization support (e.g. synchronized or atomic keywords) is not acceptable--it can preserve correctness obviously, but it sidesteps the point of the lab. You can meet these requirements in almost any language, but it is worth talking to me or the TA to be sure if you're going to use something other than C/C++.

Deliverables will be detailed below, but the focus is on a writeup that provides performance measurements as graphs, and answers (perhaps speculatively) a number of questions. Spending some time setting yourself up to quickly and easily collect and visualize performance data is a worthwhile time investment since it will come up over and over in this lab and for the rest of the course.

Step 1: Creating Threads, Unsynchronized Counting

In step 1 of the lab, you will write a program that accepts command-line parameters to specify the following:

--maxcounter:integer-valued target value for the shared counter
--workers:integer-valued number of threads

It is not critical that you actually parse "--maxcounter" and "--workers" and it's fine if you want to write your program to be invoked for example as:

myprogram 10000 4

Where the position of "10000" and "4" by convention means maxcounter and workers respectively. However, there are many tools to make command line parsing easy and it is worthwhile to learn to do it well, so some additional tips and pointers about it are in the hints section of this document.

Your program will fork "workers" number of worker threads, which will collectively increment the shared counter and save the increment operations it performs somewhere accessible to the main thread (e.g. global array of worker counts) before returning. The main thread will wait on all the workers and report the final value of the counter and the sum and values of the local counters before exiting. Note that you expressly will NOT actually attempt to synchronize the counter variable yet! The results may not preserve the stated correctness conditions.

For your writeup, perform the following (please answer all the questions but keep in mind many of them represent food for thought and opportunities to speculate--some of the questions may not have definitive or easily-obtained answers):

Using the Unix time utility, time and graph the runtime of your program with a maxcounter of ~~1000000~~ 10,000,000 as a function of the number of workers from 1 (sequential) to twice the number of processor cores on your machine. This is essentially a scalability experiment, and your code is unlikely to scale. List at least two reasons you can think of why performance gets worse as parallelism increases. If performance does not get monotonically worse list at least one reason why this might be the case.
Now change your program so that each worker counts the number of times it actually increments the shared counter variable. You can do this with a global array of per-worker counters indexed by say, worker id. This will allow you to measure lost updates as well as load imbalance. The ratio of lost updates to correct updates should simply be the total number of updates summed across all workers, divided by the maxcounter value. Without synchronization, this should be greater than 1. Create a graph of the lost update ratio as a function of worker thread count. Does the number of lost updates surprise you or match your expectations? Does the trend match the scalability you observe, or differ? Why or why not?
As a rough characterization of load imbalance let's compare the number of operations each worker actually performs to the amount of work we actually expect. In a perfectly load-balanced system, each worker would perform a proportional share of the updates to the counter (e.g. with 4 threads, each worker does maxcounter/4 increments--since our program is not yet properly synchronized, we need to change this to be total-updates/4). Each thread will differ from expectation differently so graph the average difference from the expected value for each worker. Since you already track the number of updates made by each worker, this should be a straightforward additional modification. Graph the average difference from load expectation as a function of the number of threads. Does it follow a similar pattern to your scalability?

Step 2: Synchronization

Next, we'll add some synchronization to the counter increment to ensure that we have no lost updates. If you're doing this lab with pthreads, this should involve using pthread mutex, spinlocks, and calls to initialize and destroy them. In each case, verify that you no longer have lost updates before proceeding. For your write up, include the following experiments and graphs. If you're comfortable doing so, you can merge the data from different experiments into a single graph--but it is fine to graph things separately.

Use a pthread_mutex_t to synchronize access to the shared counter. Repeat the scalability and load balance experiments from the previous section. Do the run-time and load balance as a function of the number of threads change significantly from the unsynchronized case? How about absolute performance? Does the lock make things slower or faster or is it "a wash"?
Use a pthread_spinlock_t to synchronize access to the shared counter. Repeat the scalability and load balance experiments from the previous section. Do the run-time and load balance as a function of the number of threads change significantly from the unsynchronized case? How about absolute performance? Does the spinlock make things slower or faster or is it "a wash"? How does it affect load imbalance, and why?
Use std::atomic primitives to implement your counter so that increments can be performed with hardware-supported CAS. To do this, you'll want to use atomic_compare_exchange_strong, and you'll need to slightly restructure things relative to the pseudo-code above. In particular, since compare-and-exchange operations can fail, the need to use an explicit lock goes away, but your logic to perform the the actual increment will have to handle the failure cases for CAS. Repeat the scalability and load imbalance experiments above. Do scalability, absolute performance, or load imbalance change? Why/why not?

Step 3: Load Balance

In this step, we will take some steps toward addressing load imbalance. There are many potential sources of load imbalance, and we'll start by considering that our program currently has no way to control how threads are distrubuted across processors. If you're using pthreads on Linux, start by reading the documentation for pthread_set_affinity_np, and get_nprocs_conf. If you are using other languages or platforms, rest assured similar APIs exist and can be easily found. You will use these functions to try to control the distribution of threads across the physical cores on your machine. You can choose to use one of the mutex, spinlock, or atomic versions above, or better yet, compare them all. The differences are quite dramatic.

Intuitively, the system will be better load balanced if worker threads are distributed evenly across available cores. For example if your machine has 8 CPUs, and you have 8 workers, you want 1 on each core; if you have 16 workers, you want 2 per core, etc. Use pthread_set_affinity to pin your worker threads in this way, and repeat the scalability and load-balance experiments from the prevous section. What changes and why?
Is load balance always best for performance? Try pinning all your workers to the SAME core and repeat scalability and load imbalance measurements. What happens to absolute performance and why?

Step 4: Reducing Contention

The most fundamental reason performance generally degrades with increasing parallelism for a shared counter is that every access involves an update, which on modern architectures with deep cache heirarchies causes the cache line containing the shared counter to bounce around from cache to cache, dramatically decreasing the benefit of caching. However, if most threads don't actually change the counter value, (i.e. the data are predominantly read-shared rather than write-shared), cache coherence traffic can be significantly reduced, enabling the additional parallelism to yield performance benefits. In this section, we will change the worker function such that a parameterizable fraction of its operations are reads and the remaining ones increment the counter. To do this, modify your program to accept an additional command line parameter specifying the fraction of operations that should be reads (or writes), and change the worker's thread proc to use this parameter along with rand() to conditionally make an update. Since this introduces some non-determinism into the program, and we prefer to preserve the goal of splitting fixed work across different numbers of workers, the termination condition must change as well, such that each worker performs a fixed fraction of the number of operations specified by the --target parameter. Consequently, the final value of the counter will no longer be a fixed target, but the correctness condition (no lost updates) remains the same. A revised version of the psuedo-code above is below:

/* PSUEDO-CODE--don't just copy/paste!
   dWriteProbability is type double 
   in the range [0..1] specifying the 
   fraction of operations that are updates.
*/

bool operation_is_a_write() {
  if(static_cast<double>(rand()) / static_cast<double>(RAND_MAX)) < dWriteProbability)
     return true;
  return false; 
}

worker_thread() {
  int my_operations = 0;
  int my_operation_count = maxcounter/num_workers;
  while(my_operations < my_operation_count) {
    int curval = counter;
    if(operation_is_a_write()) {
       counter++;
       my_increment_count++;
    }
    my_operations++;
  }
}

Include the following for your writeup.

Using the load-balanced spinlock variant from above (or try 'em all if you're curious) repeat the scalability and load balance experiments from above with the the read-write mixtures below. NOTE: It is not necessary to sweep the entire range of thread-counts. Please measure with --workers = 1, 2, 4, 8, 16.
- 50% writes
- 10% writes
- 1% writes
How do scalability and load imbalance change? Do the data match your expectations? Please speculate on any reasons for divergence; you are not required to investigate them (yet), but you should feel free to do so if you're curious.
Using the spinlock variant that pins all threads to the same core from above (unless you already did this above) repeat the scalability and load balance experiments with the the (same) following read-write mixtures:
- 50% writes
- 10% writes
- 1% writes
How do scalability and load imbalance change? Do the data match your expectations? Please speculate on any reasons for divergence.

Step 5: Reader/Writer Locks

In the last section you observed that reducing the ratio of writes to shared data can have a big impact on scalability. In this section you will implement and use your own locking primitive to take advantage of this by relaxing your mutual exclusion condition to an invariant called 'single-write-multiple-readers', meaning only a single writer can be admitted to the critical section OR multiple readers. Consequently, multiple readers need not be serialized, which should increase scalability. Based on the low-level locking primitive of your choice from above (spinlocks, mutex, atomics, you will implement a reader-writer locking API that replaces simple lock()/unlock() with the following:

read_lock()waits until no writers are in the critical section and acquires a read lock. If other readers already hold the lock, no wait should be required. Calls to write_lock() must wait until no readers hold the lock.
write_lock()acquires exclusive access to the critical section; waits until no readers or writers hold the lock, no reader or writers are admitted until the current writer releases the lock.
upgrade_lock()allows the holder of a read lock to upgrade that lock to a write lock.
unlock()releases the read or write lock held by the caller. benefits.

Similar to the previous section, we will change the worker function such that a parameterizable fraction of its operations are reads and the remaining ones increment the counter. To do this, use your program's command line parameter specifying the fraction of operations that should be reads (or writes), with rand() to read from a shared array under a read lock and conditionally make an update. The correctness condition from the previous section will remain for this section: each worker performs a fixed fraction of the number of operations specified by the --target parameter. A revised version of the psuedo-code above is below:

const int BUFSIZE=1024;
int sharedArray[BUFSIZE]; // init to zeros
int read_index() {
  return (static_cast<double>(rand()) / static_cast<double>(RAND_MAX))*(BUFSIZE-1);
}
bool operation_is_a_write() {
  if(static_cast<double>(rand()) / static_cast<double>(RAND_MAX)) < dWriteProbability)
     return true;
  return false; 
}
worker_thread() {
  int my_operations = 0;
  int my_operation_count = maxcounter/num_workers;
  while(my_operations < my_operation_count) {
    read_lock();
    int readval = sharedArray[read_index()];
    if(operation_is_a_write()) {
       upgrade_lock();
       counter++;
       my_increment_count++;
    }
    unlock();
    my_operations++;
  }
}

Include the following for your writeup.

Using the reader/writer lock variant from above and core affinity, repeat the scalability and load balance experiments from above with the the read-write mixtures below. NOTE: It is not necessary to sweep the entire range of thread-counts. Please measure with --workers = 1, 2, 4, 8, 16.
- 10% writes
- 1% writes
How do scalability and load imbalance change? Do the data match your expectations? Please speculate on any reasons for divergence; you are not required to investigate them (yet), but you should feel free to do so if you're curious.

What to Turn In

The following should be turned in using the canvas turn in tools:

Your writeup, with graphs and answers to questions above. To simplify this process and reduce the probability of overlooked questions, a LaTeX template that includes placeholders for graphs and re-iterates the questions can be found here, (a build of that template is here).
A zip or tarball of all code and scripts you used during the lab. We expect the code to be understandable and reasonably well documented. Our focus here will be on the writeup--we will only attempt to build and run your code in cases where the data presented in the write-up are difficult to interpret. Otherwise, grading will be limited to checking the code for readability and spot-checking for correctness.

Note that we will be checking code in your solutions for plagiarism using Moss.

Hints

Dealing with command-line options in C/C++ can be painful if you've not had much experience with it before. For this lab, it is not critical that you actually use the long form option flags (or any flags at all). (e.g. If you want to just assume that the argument at position 1 is always workers, and position 2 is always maxcounter that is fine. However, getting used to dealing with command line options in a principled way is something that will serve you well, and we strongly encourage you to consider getopt or boost/program_options. They will take a little extra time to learn now, but will save you a lot of time in the future.

When you measure performance, be sure you've turned debugging flags off and enabled optimizations in the compiler. If you're using gcc and pthreads, it's simplest to just turn off "-g", and turn on"-O3". In fact, the "-g" option just controls debugging symbols, which do not fundamentally impact performance: for the curious, some deeper consideration of gcc, debug build configurations, and performance can be found here.

It is very much worth spending some time creating scripts to run your program and tailoring it to generate output (such as CSV) that can easily be imported into graphing programs like R. For projects of this scale, I often just collect CSV output and import it into Excel. When engaged in longer term empirical efforts, a greater level of automation is often highly desirable, and I tend to prefer using bash scripts to collect CSV, along with Rscript to automatically use R's read.csv and the ggplot2 package which has functions like ggsave to automatically save graphs as PDF files. The includegraphics LaTeX macro used in the write up template works with PDF files too!

Please report how much time you spent on the lab.