The goal of this assignment is to familiarize yourself with basic shared-memory synchronization concepts and primitives, get some experience predicting, measuring, and understanding the performance of concurrent programs. In this lab you will write a program that parallelizes a seemingly basic task: incrementing counters. The task is algorithmically quite simple, and the synchronization required to preserve correctness is not intended to be a major challenge. In contrast, understanding (and working around) the performance subtleties introduced by practical matters such as code structure, concurrency management primitives, and the hardware itself can be non-trivial. At the end of this lab you should have some familiarity with concurrency primitives, and some awareness of performance considerations that will come up repeatedly in later sections of the course.
Specifically, your task is to write a program in which the main thread creates (forks) a parameterizable number of worker threads, and waits for them all to complete (join). The core task of each worker thread is to execute something like the psuedo-code below, where the counter variable is global and shared across all threads, while the my_increment_count is local, and tracks the number of times each individual worker increments the variable:
worker_thread() { int my_increment_count = 0; while(counter < MAX_COUNTER) { counter++; my_increment_count++; } }
We will assume the following correctness property: the sum of all increment operations performed by each worker must equal the value of the counter itself (no lost updates). It may not surprise you to hear that some synchronization is required to preserve that condition. In this lab, we'll learn how to use thread and locking APIs to do this, look at some different ways of preserving correctness, measure their performance, and do some thinking about those measurements.
We recommend you do this lab using C/C++ and pthreads. However, it is not a hard requirement--if you wish to use another language, it is fine as long as it supports or provides access to thread management and locking APIs similar to those exported by pthreads: you'll need support for creating and waiting for threads, creating/destroying and using locks, and for using hardware-supported atomic instructions. Using language-level synchronization support (e.g. synchronized or atomic keywords) is not acceptable--it can preserve correctness obviously, but it sidesteps the point of the lab. You can meet these requirements in almost any language, but it is worth talking to me or the TA to be sure if you're going to use something other than C/C++.
Deliverables will be detailed below, but the focus is on a writeup that provides performance measurements as graphs, and answers (perhaps speculatively) a number of questions. Spending some time setting yourself up to quickly and easily collect and visualize performance data is a worthwhile time investment since it will come up over and over in this lab and for the rest of the course.
In step 1 of the lab, you will write a program that accepts command-line parameters to specify the following:
--maxcounter:
integer-valued target value for the shared counter--workers:
integer-valued number of threadsIt is not critical that you actually parse "--maxcounter" and "--workers" and it's fine if you want to write your program to be invoked for example as:
myprogram 10000 4
Where the position of "10000" and "4" by convention means maxcounter and workers respectively. However, there are many tools to make command line parsing easy and it is worthwhile to learn to do it well, so some additional tips and pointers about it are in the hints section of this document.
Your program will fork "workers" number of worker threads, which will collectively increment the shared counter and save the increment operations it performs somewhere accessible to the main thread (e.g. global array of worker counts) before returning. The main thread will wait on all the workers and report the final value of the counter and the sum and values of the local counters before exiting. Note that you expressly will NOT actually attempt to synchronize the counter variable yet! The results may not preserve the stated correctness conditions.
For your writeup, perform the following (please answer all the questions but keep in mind many of them represent food for thought and opportunities to speculate--some of the questions may not have definitive or easily-obtained answers):
time
utility, time and graph the
runtime of your program with a maxcounter of Next, we'll add some synchronization to the counter increment to ensure that we have no lost updates. If you're doing this lab with pthreads, this should involve using pthread mutex, spinlocks, and calls to initialize and destroy them. In each case, verify that you no longer have lost updates before proceeding. For your write up, include the following experiments and graphs. If you're comfortable doing so, you can merge the data from different experiments into a single graph--but it is fine to graph things separately.
atomic_compare_exchange_strong
,
and you'll need to slightly restructure things relative to the pseudo-code
above. In particular, since compare-and-exchange operations can fail,
the need to use an explicit lock goes away, but your logic to perform the
the actual increment will have to handle the failure cases for CAS.
Repeat the scalability and load imbalance experiments above.
Do scalability, absolute performance, or load imbalance change? Why/why not?
In this step, we will take some steps toward addressing load imbalance.
There are many potential sources of load imbalance, and we'll start by
considering that our program currently has no way to control how threads
are distrubuted across processors. If you're using pthreads on Linux, start by reading the
documentation for pthread_set_affinity_np
,
and get_nprocs_conf
.
If you are using other languages or platforms, rest assured similar APIs
exist and can be easily found. You will use these functions to try to control the
distribution of threads across the physical cores on your machine.
You can choose to use one of the mutex, spinlock, or atomic versions above,
or better yet, compare them all. The differences are quite dramatic.
pthread_set_affinity
to pin your worker threads in this way, and repeat the scalability and
load-balance experiments from the prevous section. What changes and why?
The most fundamental reason performance generally degrades with
increasing parallelism for a shared counter is that every access
involves an update, which on modern architectures with deep cache
heirarchies causes the cache line containing the shared counter to
bounce around from cache to cache, dramatically decreasing the benefit
of caching. However, if most threads don't actually change
the counter value, (i.e. the data are predominantly read-shared
rather than write-shared), cache coherence traffic can be significantly
reduced, enabling the additional parallelism to yield performance
benefits. In this section, we will change the worker function
such that a parameterizable fraction of its operations are reads and
the remaining ones increment the counter. To do this, modify your program
to accept an additional command line parameter specifying the fraction
of operations that should be reads (or writes), and change the worker's
thread proc to use this parameter along with rand()
to conditionally make an update. Since this introduces some non-determinism
into the program, and we prefer to preserve the goal of splitting fixed
work across different numbers of workers, the termination condition must
change as well, such that each worker performs a fixed fraction of the
number of operations specified by the --target
parameter.
Consequently, the final value of the counter will no longer be
a fixed target, but the correctness condition (no lost updates)
remains the same. A revised version of the psuedo-code above is below:
/* PSUEDO-CODE--don't just copy/paste! dWriteProbability is type double in the range [0..1] specifying the fraction of operations that are updates. */ bool operation_is_a_write() { if(static_cast<double>(rand()) / static_cast<double>(RAND_MAX)) < dWriteProbability) return true; return false; } worker_thread() { int my_operations = 0; int my_operation_count = maxcounter/num_workers; while(my_operations < my_operation_count) { int curval = counter; if(operation_is_a_write()) { counter++; my_increment_count++; } my_operations++; } }
In the last section you observed that reducing the ratio of
writes to shared data can have a big impact on scalability.
In this section you will implement and use your own locking
primitive to take advantage of this by relaxing your mutual exclusion
condition to an invariant called 'single-write-multiple-readers',
meaning only a single writer can be admitted to the critical
section OR multiple readers. Consequently, multiple readers need
not be serialized, which should increase scalability. Based on the
low-level locking primitive of your choice from above (spinlocks,
mutex, atomics, you will implement a reader-writer locking API
that replaces simple lock()/unlock()
with the following:
read_lock()
waits until no writers are in the
critical section and acquires a read lock. If other readers
already hold the lock, no wait should be required. Calls
to write_lock()
must wait until no readers
hold the lock.write_lock()
acquires exclusive access to the
critical section; waits until no readers or writers hold
the lock, no reader or writers are admitted until the current
writer releases the lock.upgrade_lock()
allows the holder of a read
lock to upgrade that lock to a write lock.unlock()
releases the read or write lock
held by the caller.
benefits.
Similar to the previous section, we will change the worker function
such that a parameterizable fraction of its operations are reads and
the remaining ones increment the counter. To do this, use your program's
command line parameter specifying the fraction
of operations that should be reads (or writes), with rand()
to read from a shared array under a read lock and conditionally make an update.
The correctness condition from the previous section will remain for this
section: each worker performs a fixed fraction of the
number of operations specified by the --target
parameter.
A revised version of the psuedo-code above is below:
const int BUFSIZE=1024; int sharedArray[BUFSIZE]; // init to zeros int read_index() { return (static_cast<double>(rand()) / static_cast<double>(RAND_MAX))*(BUFSIZE-1); } bool operation_is_a_write() { if(static_cast<double>(rand()) / static_cast<double>(RAND_MAX)) < dWriteProbability) return true; return false; } worker_thread() { int my_operations = 0; int my_operation_count = maxcounter/num_workers; while(my_operations < my_operation_count) { read_lock(); int readval = sharedArray[read_index()]; if(operation_is_a_write()) { upgrade_lock(); counter++; my_increment_count++; } unlock(); my_operations++; } }
canvas
turn in tools:
Note that we will be checking code in your solutions for plagiarism using Moss.
Dealing with command-line options in C/C++ can be painful if you've
not had much experience with it before. For this lab, it is not
critical that you actually use the long form option flags (or any
flags at all). (e.g. If you want to just assume that the argument at position
1 is always workers
, and position 2 is always maxcounter
that is fine. However, getting used to dealing with command line options
in a principled way is something that will serve you well, and we
strongly encourage you to consider
getopt
or
boost/program_options
. They will take a little extra time
to learn now, but will save you a lot of time in the future.
When you measure performance, be sure you've turned debugging flags off and enabled optimizations in the compiler. If you're using gcc and pthreads, it's simplest to just turn off "-g", and turn on"-O3". In fact, the "-g" option just controls debugging symbols, which do not fundamentally impact performance: for the curious, some deeper consideration of gcc, debug build configurations, and performance can be found here.
It is very much worth spending some time creating scripts to run your program
and tailoring it to generate output (such as CSV) that can easily be imported
into graphing programs like R. For projects of this scale, I often just collect
CSV output and import it into Excel. When engaged in longer term
empirical efforts, a greater level of automation is often highly
desirable, and I tend to prefer using bash scripts to collect
CSV, along with
Rscript to automatically use
R's
read.csv
and the ggplot2
package which
has functions like ggsave
to
automatically save graphs as PDF files. The includegraphics
LaTeX macro used in the write up template works with PDF files too!
Please report how much time you spent on the lab.