CS380P: Parallel Systems

Lab #1: Prefix Sum and Barriers

The goals of this assignment are:

To learn about parallel prefix sum algorithms
Compare their performance using different synchronization primitives
Gain experience implementing your own barrier primitive
Understand the performance tradeoffs made by your barrier implementation

In this assignment, you will implement a work-efficient parallel prefix sum algorithm using the pthreads library and a synchronization barrier you will write yourself. We encourage you to think carefully about the various corner cases that exist when implementing this algorithm, and challenge you to think about the most performant way to synchronize a parallel prefix sum depending on its input size.

Background

Parallel prefix sum has come up a number of times so far in the course, and will come up again. It is a fundamental building block for a lot of parallel algorithms. There are good materials on it in Dr. Lin's textbook, and the treatment of it you can find on Wikipedia comes with reasonable pseudocode (see here). Note that you will be implementing a work-efficient parallel prefix sum.

Step 1: Sequential Implementation

The first step with any parallelization is to develop a sequential version to serve as a baseline both for performance and correctness. It is generally easier to get the algorithm right without parallelism, and you need a sequential version anyway to measure against so you can tell if your parallel implementation is performance-profitable. The command line parameters we would like your implementation to support are listed in the "Inputs/Outputs" section below. We ask that you implement your sequential version to be called when your program is invoked when the number of threads specified on the command line is 0 (-n 0). Note that using a single pthread (-n 1) is not the same, as it involves all the overheads of thread creation and teardown that would not be present in a truly sequential implementation.

Be sure your sequential implementation is correct before proceeding to parallelization!

Step 2: Parallel Implementation

Recall that prefix sum requires a barrier. In this step, you will write a work-efficient parallel prefix sum using pthread barriers. For each of the provided input sets, graph the speedup of your parallel implementation over a sequential prefix sum implementationi as a function of the number of worker threads used. Vary from 2 to 32 threads in increments of 2. Then, explain the trends in the graph. Why do these occur? If your speedup is not ideal, what overheads are causing your implementation to underperform an "ideal" speedup ratio?

Step 3: Barrier Implementation

In this step you will build your own re-entrant barrier. Recall from lecture that we considered a number of implementation strategies and techniques. We recommend you base your barrier on pthread's spinlocks, but encourage you to use other techniques we discussed in this course. Regardless of your technique, answer the following questions: how is/isn't your implementation different from pthread barriers? In what scenario(s) would each implementation perform better than the other? What are the pathological cases for each? Use your barrier implementation to implement the same work-efficient parallel prefix sum. Repeat the measurements from part 2, graph them, and explain the trends in the graph. Why do these occur? What overheads cause your implementation to underperform an "ideal" speedup ratio?

How do the results from part 2 and part 3 compare? Are they in line with your expectations? Suggest some workload scenarios which would make each implementation perform worse than the other.

Framework

Inputs/Outputs

Your program should accept the following four command line parameters, with types indicated in parentheses, and the meaning of each parameter described in angle brackets:

-n <number of threads> (int) (0 means sequential, not 1 pthread!)
-i <absolute path to input file> (char *)
-o <absolute path to output file> (char *)
-s <use your barrer implementation, else default to pthreads barrier> (Optional)

The first line of the input file will contain a single integer i denoting the number of integers in the remainder input file. The following i lines will each contain a single integer, representing one integer in the input array. Your program should read in these i integers, compute their prefix sums, and print the sums to the specified output file, with a single integer on each line. Note: there should be i lines in your output file.

Your source code should include a Makefile which produces your executable by simply running make in the top-level directory of your submission. The executable should be named "pfxsum" and must accept the above arguments exactly as they are described. It is critically important you follow the build conventions and command line requirements exactly, as we will test and measure your code using automated tools that expect your executable and command line options to be specifically as described above.

Your solution should output the execution time in seconds to stdout, and no other output. It is fine to produce other output on stdout as long as it is disabled by default, either at compile time (using macros) or at runtime (using an extra/optional command-line flag).

Deliverables

When you submit your solution, you should include the following:

A written report of the assignment in either plain text or PDF format. Please explain your approach, present performance results, discuss any insights gained, etc.
Your source code, including a Makefile to compile it into an executable.

To submit your solution, simply change the project status to "completed" in Codio. To ensure your writeup is included, please include it in PDF form in your /home/codio/workspace directory. You'll want to make sure it's there before changing project status to "completed".

Notes

Make sure that your implementations for parts 2 and 3 output a correct prefix sum for any number of threads.
The auto-grader for this lab depends on your solution printing out the execution time in seconds (and nothing else!) to stdout. Please be sure that if your solution produces other output and debugging information that you either build it out by default using pre-processor macros, or add your own additional command-line options that by default disable it.

Please report how much time you spent on the lab.