The goal of this assignment is to familiarize yourself with basic shared-memory synchronization concepts and primitives, get some experience predicting, measuring, and understanding the performance of concurrent programs. In this lab you will write a program that parallelizes a seemingly basic task: incrementing counters. The task is algorithmically quite simple, and the synchronization required to preserve correctness is not intended to be a major challenge. In contrast, understanding (and working around) the performance subtleties introduced by practical matters such as code structure, concurrency management primitives, and the hardware itself can be non-trivial. At the end of this lab you should have some familiarity with concurrency primitives, and some awareness of performance considerations that will come up repeatedly in later sections of the course.
Specifically, your task is to write a program in which the main thread creates (forks) a parameterizable number of worker threads, and waits for them all to complete (join). The core task of each worker thread is to execute something like the psuedo-code below, where the counter variable is global and shared across all threads, while the my_increment_count is local, and tracks the number of times each individual worker increments the variable:
worker_thread() { int my_increment_count = 0; while(counter < MAX_COUNTER) { counter++; my_increment_count++; } }
We will assume the following correctness property: the sum of all increment operations performed by each worker must equal the value of the counter itself (no lost updates). It may not surprise you to hear that some synchronization is required to preserve that condition. In this lab, we'll learn how to use thread and locking APIs to do this, look at some different ways of preserving correctness, measure their performance, and do some thinking about those measurements.
You should do this lab using C/C++ and pthreads (or std::thread
). Using language-level synchronization support
(e.g. synchronized or atomic keywords) is not acceptable--it can preserve
correctness obviously, but it sidesteps the point of the lab. While many-to-most languages
provide sufficient support to do this lab well, we insist on C/C++ because some subsequent labs
also use C/C++, so increasing your fluency with it will help later.
Deliverables will be detailed below, but the focus is on a writeup that provides performance measurements as graphs, and answers (perhaps speculatively) a number of questions. Spending some time setting yourself up to quickly and easily collect and visualize performance data is a worthwhile time investment since it will come up over and over in this lab and for the rest of the course.
In step 1 of the lab, you will write a program that accepts command-line parameters to specify the following:
--maxcounter:
integer-valued target value for the shared counter--workers:
integer-valued number of threads
If you write code to actually parse command line options like "--maxcounter" and "--workers"
you are doing things the hard way. It will work, but your time is better invested learning
to use library support/tools that already make this easy. My personal favorites are getopt
and boost::program_options
. However, there are many tools to make
command line parsing easy and it is worthwhile to learn to do it well,
so some additional tips and pointers about it are in the
hints section of this document.
Your program will fork "workers" number of worker threads, which will collectively increment the shared counter and save the increment operations it performs somewhere accessible to the main thread (e.g. global array of worker counts) before returning. The main thread will wait on all the workers and report the final value of the counter and the sum and values of the local counters before exiting. Note that you expressly will NOT actually attempt to synchronize the counter variable yet! The results may not preserve the stated correctness conditions.
For your writeup, perform the following (please answer all the questions but keep in mind many of them represent food for thought and opportunities to speculate--some of the questions may not have definitive or easily-obtained answers):
time
utility, time and graph the
runtime of your program with a maxcounter of __thread
attribute on the counter, etc. Measure repeat
the measurements (including similar graphs in your report) from step one using this privatized version of the counter.
What does it tell you about the primary source of scalability loss in your implementation?
Next, we'll add some synchronization to the counter increment to ensure that we have no lost updates. If you're doing this lab with pthreads, this should involve using pthread mutex, spinlocks, and calls to initialize and destroy them. In each case, verify that you no longer have lost updates before proceeding. For your write up, include the following experiments and graphs. If you're comfortable doing so, you can merge the data from different experiments into a single graph--but it is fine to graph things separately.
atomic_compare_exchange_strong
,
and you'll need to slightly restructure things relative to the pseudo-code
above. In particular, since compare-and-exchange operations can fail,
the need to use an explicit lock goes away, but your logic to perform the
the actual increment will have to handle the failure cases for CAS.
Repeat the scalability and load imbalance experiments above.
Do scalability, absolute performance, or load imbalance change? Why/why not?
In this step, we will take some steps toward addressing load imbalance.
There are many potential sources of load imbalance, and we'll start by
considering that our program currently has no way to control how threads
are distrubuted across processors. If you're using pthreads on Linux, start by reading the
documentation for pthread_set_affinity_np
,
and get_nprocs_conf
.
If you are using other languages or platforms, rest assured similar APIs
exist and can be easily found. You will use these functions to try to control the
distribution of threads across the physical cores on your machine.
You can choose to use one of the mutex, spinlock, or atomic versions above,
or better yet, compare them all. The differences are quite dramatic.
pthread_set_affinity
to pin your worker threads in this way, and repeat the scalability and
load-balance experiments from the prevous section. What changes and why?
The most fundamental reason performance generally degrades with
increasing parallelism for a shared counter is that every access
involves an update, which on modern architectures with deep cache
heirarchies causes the cache line containing the shared counter to
bounce around from cache to cache, dramatically decreasing the benefit
of caching. However, if most threads don't actually change
the counter value, (i.e. the data are predominantly read-shared
rather than write-shared), cache coherence traffic can be significantly
reduced, enabling the additional parallelism to yield performance
benefits. In this final section, we will change the worker function
such that a parameterizable fraction of its operations are reads and
the remaining ones increment the counter. To do this, modify your program
to accept an additional command line parameter specifying the fraction
of operations that should be reads (or writes), and change the worker's
thread proc to use this parameter along with rand()
to conditionally make an update. Since this introduces some non-determinism
into the program, and we prefer to preserve the goal of splitting fixed
work across different numbers of workers, the termination condition must
change as well, such that each worker performs a fixed fraction of the
number of operations specified by the --target
parameter.
Consequently, the final value of the counter will no longer be
a fixed target, but the correctness condition (no lost updates)
remains the same. A revised version of the psuedo-code above is below:
/* PSUEDO-CODE--don't just copy/paste! Note UNIX rand() is not thread-safe: we recommend you look at boost/random/mersenne_twister.hpp dWriteProbability is type double in the range [0..1] specifying the fraction of operations that are updates. */ bool operation_is_a_write() { if(static_cast<double>(random_number()) / static_cast<double>(RANDOM_NUMBER_MAX)) < dWriteProbability) return true; return false; } worker_thread() { int my_operations = 0; int my_operation_count = maxcounter/num_workers; while(my_operations < my_operation_count) { int curval = counter; if(operation_is_a_write()) { counter++; my_increment_count++; } my_operations++; } }
canvas
turn in tools:
Note that we will be checking code in your solutions for plagiarism using Canvas' tools as well as Moss.
Dealing with command-line options in C/C++ can be painful if you've
not had much experience with it before. For this lab, it is not
critical that you actually use the long form option flags (or any
flags at all). However, getting used to dealing with command line options
in a principled way is something that will serve you well, and we
strongly encourage you to consider
getopt
or
boost/program_options
. They will take a little extra time
to learn now, but will save you a lot of time in the future.
When you measure performance, be sure you've turned debugging flags off and enabled optimizations in the compiler. If you're using gcc and pthreads, it's simplest to just turn off "-g", and turn on"-O3". In fact, the "-g" option just controls debugging symbols, which do not fundamentally impact performance: for the curious, some deeper consideration of gcc, debug build configurations, and performance can be found here.
It is very much worth spending some time creating scripts to run your program
and tailoring it to generate output (such as CSV) that can easily be imported
into graphing programs like R. For projects of this scale, I often just collect
CSV output and import it into Excel. When engaged in longer term
empirical efforts, a greater level of automation is often highly
desirable, and I tend to prefer using bash scripts to collect
CSV, along with
Rscript to automatically use
R's
read.csv
and the ggplot2
package which
has functions like ggsave
to
automatically save graphs as PDF files. The includegraphics
LaTeX macro used in the write up template works with PDF files too!
Please report how much time you spent on the lab.