CS 377P: Programming for Performance

Assignment 6: Parallel Bellman-Ford algorithm

Due date: Tuesday, December 3rd, 2024, 10:00PM CST

Partners: You can do this assignment with another student in the course. If you do, make sure you put both names on your report. Only one student needs to submit on Canvas.

Late submission policy: Submission can be at the most 2 days late. There will be a 10% penalty for each day after the due date (cumulative).

Clarifications: Clarifications to the assignment will be posted on Piazza.

In this assignment, you will implement both sequential and parallel versions of the Bellman-Ford algorithm in C++. You may use classes from the C++ STL and boost libraries if you wish. You may also use either pthreads or C++ threads.

Recall that the Bellman-Ford algorithm solves the single-source shortest path problem. It is a topology-driven algorithm, so it makes sweeps over all the nodes of the graph, terminating sweeps when node labels do not change in a sweep. In each sweep, it visits all the nodes of the graph, and at each node, it applies a push-style relaxation operator to update the labels of neighboring nodes.

One way to assign work to threads is to divide the nodes in the graph uniformly between threads. This will give good load balance for uniform-degree graphs but not for power-law graphs, but it is a start. Each thread should relax the edges of all nodes assigned to it. Decide if any thread updated any node and if so, continue iterating over the graph. Use a barrier to ensure that all threads have finished relaxation before deciding whether to perform another iteration.

Coding

Part 1 (Sequential BF): Implement a sequential program for Bellman-Ford and measure its running time on the input graphs given to you (see below). These times will be the baseline for computing parallel speedups.

Part 2 (Parallel BF): Implement a parallel program for Bellman-Ford, using these different forms of synchronization (i.e. different versions of the same algorithm):

Mutex on the graph: Before relaxing any edge, acquire a lock on the graph. Release it after relaxation. This is coarse-grain locking.
Mutex on each node: Before relaxing any edge, acquire a lock on the destination node. Release it after relaxation. This is fine-grain locking.
Spin-lock on each node: For each edge to be relaxed, try acquiring a lock on the destination node. If it succeeds, relax the edge and release the lock. Otherwise, try relaxing it again.
Compare and swap: To relax an edge, perform an atomic update on the destination node using std::atomic::compare exchange weak() in C++11 standard atomics library (see "Compare-and-swap").

The main complexity in a parallel Bellman-Ford implementation is ensuring that updates to node labels are done atomically. Each of the above parts handles these updates in a different way. For each of these implementations, you will compute their runtimes and speedups over the serial version (see "What to turn in").

Part 3 Extra Credit (Edge distribution): Assigning equals numbers of nodes to threads will result in poor load-balancing for power-law graphs. You can get better load-balancing for these graphs by assigning equal numbers of edges to threads, although this complicates the algorithm a little. Implement this approach, using only the compare-and-swap approach for the relaxations.

Input graphs

Input graphs: Use rmat15 and road-NY in DIMACS format, which you used in Assignment 4.

Source nodes (DIMACS node numbers): These are the nodes with the highest degree.

rmat15: node 1
road-NY: node 140961

Output of your program: Your program should output, as a .txt file, one line for each node, specifying the number of the node and the label of that node.

What to turn in

Upload the following to Canvas (only one partner per group needs to submit):

(.tar / .tar.gz archive) The following:
- Your code for parallel and sequential BF
- Your code for Part 3 (if you decide to do it)
- A Makefile
- A readme that describes how to compile and run the program
- 2 .txt files, containing the SSSP values for both graphs (Tip: The outputs should match regardless of which version of your code you run.)
(.pdf) A report that includes results from the following experiments:
1. Find the running time of your serial code for each input graph.
2. Find the running times and speedups for 1,2,4,8,16 threads for rmat15 and road-NY (baseline for speedup is the time for your serial code) using the four ways of implementing atomic updates discussed above. Plot these results. You can plot two plots for the running times for the two input graphs since the sizes and therefore the running times will be very different, but use a single plot for the speedups. Based on these experiments, what is the best way to implement the atomic updates for Bellman-Ford?
3. Do you observe good speedups for rmat15? How about road-NY?
4. (Part 3 Extra Credit) Repeat 2 and 3 with your Part 3 implementation.

Compare-and-swap

Figure 3 shows a way to use C++11 atomic compare and swap to achieve the same functionality as a mutex. var is the variable whose value is of type double to be synchronized. It is declared as std::atomic<double>. The function call compare exchange weak(old, new, ..) on var compares its value with old. If it is equal, it sets new as its value and returns true. Otherwise, it copies its value to old and returns false. All this is done atomically.