CS 377P: Programming for Performance
Assignment 5: Shared-memory parallel programming
Due date: April 9th, 2019, 11:59 PM
You can work independently or in groups of two.
Late submission policy: Submission can be at the most 2 days late with 10%
penalty for each day.
In this
assignment, you will implement parallel programs to compute an approximation to
pi using the numerical integration program discussed in class. You will implement
several variations of this program to understand factors that affect
performance in shared-memory programs. Read the entire assignment before
starting work since you will be incrementally changing your code in each
section of the assignment, and it will be useful to see the overall structure
of what you are being asked to do.
Numerical
integration to compute an estimate for pi:
- A sequential program for performing the numerical
integration is available here.
It is an adaptation of the code I showed you in class. The code
includes some header files that you will need in the rest of the
assignment. Read this code and run it. It prints the estimate for pi and
the running time in nanoseconds.
What to turn in:
- Use your knowledge of basic calculus to explain briefly
why this code provides an estimate for pi.
- In this part of the assignment, you will learn the
use of atomic updates. Modify the sequential code as follows to
compute the estimate for pi in parallel using pthreads.
Your code should create some number of threads and divide the
responsibility for performing the numerical integration between these
threads. You can use the round-robin assignments of points in the code I
showed you in class. Whenever a thread computes a value, it should add
it directly to the global variable pi without any
synchronization.
What to turn in:
- Find the running times (of only computing pi) for one,
two, four and eight threads and plot the running times and speedups you
observe. What value is computed by your code when it is run on 8 threads?
Why would you expect that this value is not an accurate estimate of pi?
- In this part of the assignment, you will study the
effect of true-sharing on performance. Modify the code in the
previous part by using a pthread mutex to
ensure that pi is updated atomically.
What to turn in:
- Find the running times (of only computing pi) for one,
two, four and eight threads and plot the running times and speedups you
observe. What value of pi is computed by your code when it is run on
8 threads?
- You can avoid the mutex in the previous part by
using atomic instructions to add contributions from threads to the global
variable pi. C++ provides a rich set of atomic
instructions for this purpose. Here is one way to use them for your
numerical integration program. The code below creates an object pi
that contains a field of type double on which atomic operations can be
performed. This field is initialized to 0, and its value can be read using
method load(). The routine add_to_pi atomically adds the
value passed to it to this field. You should read the definition of compare_exchange_weak to make sure you understand how
it works. The while loop iterates until
this operation succeeds. Use this approach to implement the
numerical integration routine in a lock-free manner.
What to turn in:
- As before, find the running times (of only computing
pi) for one, two, four and eight threads and plot the running times and
speedups you observe. Do you see any improvements in running times
compared to the previous part in which you used mutexes? How about
speedups? Explain your answers briefly. What value of pi is computed
by your code when it is run on 8 threads?
std::atomic<double> pi{0.0};
void add_to_pi(double bar) {
auto current = pi.load();
while (!pi.compare_exchange_weak(current,
current + bar));
}
- In this part of the assignment, you will study the
effect of false-sharing on performance. Create a global array sum
and have each thread t add
its contribution directly into sum[t]. At the end,
thread 0 can add the values in this array to produce the estimate for pi.
What to turn in:
- Find the running times (of only computing pi) for one,
two, four and eight threads, and plot the running times and speedups you
observe. What value of pi computed by your code when it is run on 8
threads?
What to turn in:
- Find
the running times (of only computing pi) for one, two, four and eight
threads, and plot the running times and speedups you observe. What value
of pi is computed by your code when it is run on 8 threads?
- The code used in the previous part used pthread_join. Replace this with barriers and run your
code again.
What
to turn in:
- Find the running times
(of only computing pi) for one, two, four and eight threads, and plot the
running times and speedups you observe. What value of pi is computed by
your code when it is run on 8 threads?
- Write a short summary of your results in the previous
parts, using phrases like “atomic operations,” "true-sharing," and
"false-sharing" in your explanation.
Notes:
- When you compute speedup for numerical integration, the
numerator should be the running time of the serial code I gave you,
and the denominator should be the running time of the parallel code on
however many threads you used. The speedup will be different for different
numbers of threads. Note that the running time of the serial code will be
different from the running time of your parallel code running on one
thread because of the overhead of synchronization in the parallel code
even when it is running on one thread.