CS 377P: Programming for Performance

Assignment 4: Graph algorithms

Due date: Monday, October 28th, 2024, 10:00pm CDT

You can do this assignment with another student in the course. If you do, make sure you put both names on your report. Only one student needs to submit on Canvas.

Late submission policy: Submission can be at the most 2 days late. There will be a 10% penalty for each day after the due date (cumulative).

Clarifications
to the assignment will be posted on Piazza.

Description

In this assignment, you will implement a sequential program in C++ for the page-rank problem. In later assignments, you will implement parallel algorithms for page-rank and other graph problems. Read the entire assignment before starting your coding. You may use library routines from the STL and boost libraries.

Graph formats

We will provide three files with the following graphs: (i) power-law graph rmat15, (ii) road network road-NY (New York road network) and (iii) the Wikipedia graph discussed in lecture. Graphs will be given to you in DIMACS format, which is described at the end of this assignment.

 

Links: rmat15.dimacs road-NY.dimacs wiki.dimacs

 

Coding

  1.  I/O routines for graphs:  These routines will be important for debugging your programs so make sure they are working before starting the rest of the assignment.
  2. Page-rank algorithm: Write a push-style page-rank algorithm that operates on a graph stored in CSR format in memory, using the following specifications.

Experiments

Node degree histograms


Page-rank

Submission

1.      (90 points) Submit (in canvas), as a .tar / .tar.gz archive, your code and all the items listed in the experiments above. Inside the archive, also submit a makefile so that the code can be compiled with make [PARAMETER]. Describe how to compile and run the program in a README.txt. Separately from the archive, submit a .pdf with the experimental results.

 

2.      (10 points) In lecture, I mentioned that the page-rank algorithm computes the solution to a system of linear equations in which the unknowns are the page-ranks of each node and in which there is one equation for each node that defines the page-rank of that node in terms of the page-ranks of its in-neighbors. Demonstrate this with the graph from Wikipedia used in lecture, as follows.

 

a.      Write down the system of linear equations for the example.

b.      Using MATLAB or any other system, compute the solution to this system of equations.

c.      Does your solution match the page-ranks shown in the diagram (you may need to scale all your computed page-ranks so their sum is one)?

 

Turn in the answers to each of these questions.

DIMACS format for graphs

One popular format for representing directed graphs as text files is the DIMACS format (undirected graphs are represented as a directed graph by representing each undirected edge as two directed edges). Files are assumed to be well-formed and internally consistent so it is not necessary to do any error checking.  A line in a file must be one of the following.

c This is an example of a comment line.
p FORMAT NODES EDGES

The lower-case character p signifies that this is the problem line. The FORMAT field should contain a mnemonic for the problem such as sssp. The NODES field contains an integer value specifying n, the number of nodes in the graph. The EDGES field contains an integer value specifying m, the number of edges in the graph.

a s d w

The lower-case character "a" signifies that this is an edge descriptor line. The "a" stands for arc, in case you are wondering. Edges may occur in any order in the file. For graphs with unweighted edges, we will use an arbitrary edge weight like 1.

Edges for rmat graphs: Special care is needed when reading in rmat graphs. Because of the generator used for rmat graphs, the files for some rmat graphs may have multiple edges between the same pair of nodes, violating the DIMACS spec. When building the CSR representation in memory, keep only the edge with the largest weight. For example, if you find edges (s d 1) and (s d 4) for example from source s to destination d, keep only the edge with weight 4. In principle, you can keep the smallest weight edge or follow some other rule, but I want everyone to follow the same rule to make grading easier.


Hints for constructing CSR format graphs from DIMACS files

Nodes are numbered starting from 1 in DIMACS format but C++ arrays start at 0. To keep things simple and to make grading easier, your data structures and code should ignore node position 0 in your arrays.

 

To construct CSR representation of graphs, you can use the following steps:

  1. First construct the coordinate representation (COO) of the graph from the DIMACS file. You may find std::vector to be helpful.
  2. Sort edges in COO by the source node ID. You may find std::sort() in STL to be helpful.
  3. Construct the CSR representation from the information in this sorted COO representation.
  4. You will need an array to represent node labels. This array is not usually shown in presentations of CSR format such the one in my slides.