CS 377P: Programming for Performance

Assignment 4: Profiling Page-rank with Intel^(R) VTune^TM

Due: April 3rd, 2018, 11:59 PM

You can work in groups of two. Each group should do the assignment independently.
Late policy: Submissions can be at most 1 day late with 10% penalty.

In class, we described algorithms to compute page-rank values for a given graph. In this assignment, we will use VTune to analyze the performance of page-rank computation with variants on the following dimensions:

Algorithms: We introduced pull-style and push-style algorithms for computing page-rank. See lecture slides for details. We are giving you versions that work on the CSR representation of graphs.
Graph topology: A power-law graph has a small number of high-degree nodes and a large number of low-degree nodes, while in a uniform-degree graph, almost all nodes have roughly the same degree.
Node data representation: Node data for a graph can be represented in array of struct (AoS) or struct of arrays (SoA), shown by the following table:

Array of struct (AoS)	Struct of arrays (SoA)
struct node_data_t { double pr[2]; int out_degree; }; class graph { ... node_data_t nodes[NUM_NODES]; ... };	class graph { ... double pr[2][NUM_NODES]; int out_degree[NUM_NODES]; ... };

We will give you three graphs to run with the two algorithms and two node data representation. Therefore, there will be 12 variants in total, and those variants have different performance characteristics. You will learn through this assignment the steps to analyze the performance of the code at instruction level, at micro-architectural level, and using cycle-related metrics. After going through the steps, you should be able to pinpoint the bottleneck of a code at micro-architectural level, e.g. branch prediction unit, caches, etc.; see why the code for an input leads to such a bottleneck; and possibly figure out a way to improve the code.

Understanding input characteristics

(a) (10 points) Add a function in the code given to you to derive out-degree histogram. Add another function to get the in-degree histogram. Draw the out-degree histograms and in-degree histograms for rmat22, transposed rmat22 (rmat22_t), and roadUSA graphs. Explain the histograms briefly.

Collect performance numbers

Run VTune on the orcrists for pull-style and push-style algorithms on the three graphs in part (a), where node data are represented in AoS. Run until convergence using the threshold of 10^-20. Repeat the previous runs, but this time the node data are represented in SoA. During the runs, collect for the main loops (without graph construction, initialization, scaling and output) (1) runtime (by instrumenting the code), and (2) the estimated numbers for the following events (by VTune):

Event name	Meaning of the event	Sample after value (SA)
CPU_CLK_UNHALTED.REF_P	# unhalted CPU reference cycles	133000
CPU_CLK_UNHALTED.THREAD	# unhalted CPU thread cycles	2000000
FP_COMP_OPS_EXE.X87	# X87 floating-point instructions	2000000
FP_COMP_OPS_EXE.MMX	# MMX floating-point instructions	2000000
FP_COMP_OPS_EXE.SSE_FP	# SSE floating-point instructions	2000000
FP_COMP_OPS_EXE.SSE2_INTEGER	# SSE2 floating-point instructions for typecast from integers to floating-point numbers	2000000
INST_RETIRED.ANY	# retired instructions	2000000
MEM_INST_RETIRED.LOADS	# load instructions	2000000
MEM_INST_RETIRED.STORES	# store instructions	2000000
MEM_LOAD_RETIRED.L1D_HIT	# loads and stores that hit in L1D cache	2000000
MEM_LOAD_RETIRED.L2_HIT	# loads and stores that hit in L2 cache	200000
MEM_LOAD_RETIRED.LLC_UNSHARED_HIT	# loads and stores that hit in LLC	200000
MEM_LOAD_RETIRED.LLC_MISS	# loads and stores that miss LLC	200000
MEM_LOAD_RETIRED.DTLB_MISS	# loads and stores that miss DTLB	200000
BR_MISP_EXEC.ANY	# mis-predicted branches	20000
BR_INST_EXEC.ANY	# branches	20000

Now we are ready to analyze the performance of various page-rank computations. From now on, all the numbers are for the main loops in page-rank computation. You will be asked to generate "plots" below - these are bar graphs, one for each of the three input graphs. In each bar graph, there are four bars: pull-AoS, pull-SoA, push-AoS, and push-SoA.

Analysis at instruction level

Let us first analyze at instruction level.

(b) (5 points) Plot the runtimes of the 12 variants. Plot the number of retired instructions for the 12 variants. Do the two plots resemble each other?

(c) (15 points) Plot the numbers of branches, loads, stores, and all floating-point operations for the 12 variants. Explain the numbers you get using the characteristics of input graphs, pull-style or push-style algorithms, CSR graph representation, and node data in SoA or AoS. Can you explain the runtime using one of the instruction types from the above?

Analysis at micro-architectural level

Now let us dive into micro-architectural level.

(d) (10 points) Compute and plot the branch mis-prediction rate (# mis-predicted branches / # branches) for the 12 variants. Explain the numbers you get. Can you explain the runtime using branch mis-prediction rate?

(e) (15 points) Compute and plot the cache miss rates (# cache misses / # cache accesses) of L1D, L2, and LLC for the 12 variants. Explain the numbers you get. Can you explain the runtime using cache miss rates?

(f) (20 points) Compute and plot Misses Per Kilo Instructions (MPKI, # cache misses / # retired instructions * 1000) of each level of data cache for the 12 variants. Explain the numbers you get. What does MPKI of page-rank computation suggest you? (Hint: Vectorized MMM without L1 tiling, which has abundant amount of independent floating-point computations, can have 45 L1D MPKI on the orcrists.) Combined with cache miss rates, where in the micro-architecture would you expect the bottleneck of page-rank computation to be?

Analysis with cycle-related metric

Let us now verify the previous analysis with cycle-related metrics.

(g) (5 points) Compute and plot Cycles Per Instruction (CPI, # unhalted CPU thread cycles / # retired instructions). Does that match the runtime trend?

(h) (20 points) To know where the cycles go during the computation, we can compute penalty based on some machine parameters and collected performance numbers. Compute and plot for the 12 variants the penalty of the following components using the indicated formula:

Component	Penalty formula
Branch mis-predictions	15 * # mis-predicted branches / # unhalted CPU thread cycles
DTLB misses	7 * # DTLB misses / # unhalted CPU thread cycles
L2 hits	12 * # L2 hits / # unhalted CPU thread cycles
LLC hits	41 * # LLC hits / # unhalted CPU thread cycles
LLC misses	200 * # LLC misses / # unhalted CPU thread cycles

Explain the penalty numbers you get. Why are some penalty numbers greater than 1? Why do penalties from all parts for a page-rank computation not sum to one? On which component does the page-rank computation spend most of the time? Does that concur with what we analyzed at micro-architectural level?

Submission

Submit in a .tar.gz to Canvas all your code and a report in PDF format. In the report, state all your teammates clearly, and include all the figures and explanations. Include a Makefile to compile your program by running make [PARAMETERS]. Include a README.txt to explain how to compile your code, how to run your code, and what the output will be.

Comments from TA

(1) Please use the code provided by TA. The tarball contains the following:

The code to be profiled, already instrumented for VTune and runtime for the main loop.
The symbolic links for the input graphs as inputs/*. Do not copy the inputs graphs, as they are of GBs and can explode the disk quota if every one copies them.
set_env_hw4.sh, a script to set up environment variables required for VTune and icc. "source" the script to run it.
A sample Makefile.
A sample run_vtune.sh, a script to run all the 12 variants with VTune. See (3) for details.

(2) For deriving histogram, you can use std::map (or std::unordered_map and then sort by degree). Use log scale for both axis in your histograms. You can discard 0-degree nodes in your histogram.

(3) You can use the following VTune command for collecting all numbers (except for runtime) on the orcrists:

amplxe-cl -collect-with runsa -knob event-config=\
    CPU_CLK_UNHALTED.REF_P:sa=133000,\
    CPU_CLK_UNHALTED.THREAD:sa=2000000,\
    FP_COMP_OPS_EXE.X87:sa=2000000,\
    FP_COMP_OPS_EXE.MMX:sa=2000000,\
    FP_COMP_OPS_EXE.SSE_FP:sa=2000000,\
    FP_COMP_OPS_EXE.SSE2_INTEGER:sa=2000000,\
    INST_RETIRED.ANY:sa=2000000,\
    MEM_INST_RETIRED.LOADS:sa=2000000,\
    MEM_INST_RETIRED.STORES:sa=2000000,\
    MEM_LOAD_RETIRED.L1D_HIT:sa=2000000,\
    MEM_LOAD_RETIRED.L2_HIT:sa=200000,\
    MEM_LOAD_RETIRED.LLC_UNSHARED_HIT:sa=200000,\
    MEM_LOAD_RETIRED.LLC_MISS:sa=200000,\
    MEM_LOAD_RETIRED.DTLB_MISS:sa=200000,\
    BR_MISP_EXEC.ANY:sa=20000,\
    BR_INST_EXEC.ANY:sa=20000 \
  -start-paused -analyze-system -app-working-dir <working_dir> -- <command_line_for_pagerank>

Change <working_dir> and <command_line_for_pagerank> to fit your needs when necessary.

(4) We recommend visualizing VTune results at your local machines. To do so, (i) register at https://software.intel.com/en-us/qualify-for-free-software/student, download Intel Parallel Studio, and install it on your local machine; (ii) download your VTune results from the orcrists; and (iii) open your VTune results using your local VTune.

(5) We care about only performance numbers for the main loop. You can use top-down view in VTune to get to the main loop. See the following image for an example.