CS 378: Programming for Performance

Assignment 2: Cache Measurement

Due date: October 1

You can do this assignment alone or with someone else from class. Each group can have a maximum of two students. Each group should turn in one submission.

1. Miss ratio measurement in simulator (50 points)

Consider the different permutations of the matrix multiplication pseudocode shown below:

for I = 1, N
   for J = 1, N
     for K = 1, N
       C[I,J] = C[I,J] + A[I,K]*B[K,J]

(a) Write C code for implementing these six versions of matrix multiplication. All three arrays should contain doubles. Use the code shown in the implementation notes below to allocate storage for the arrays.
(b) Using Dinero, determine the miss ratios of your programs for different problem sizes. Make sure you read the measurement protocol given at the end of this assignment before doing your experiments. Warning: for some of these permutations, each matrix multiplication can take an hour or more. Model just an L1 Data cache with the same parameters as the cpu on lonestar.
(c) Plot the miss ratio as a function of N for all six permutations. Report the cache parameters used for Dinero.

2. Miss ratio measurement on real hardware (50 points)

In part 1, you wrote 6 variations of matrix-matrix multiply, generated address traces, then ran those traces through a cache simulator to generate plots of miss ratios as a function of N. Your results from that part should validate the model presented in class. All models are approximations. In this part, you will perform the same experiment on real hardware. Real hardware has more going on than the model accounts for or than the cache simulator simulators, so we expect some divergence from the model predicted behaviour. The implementation notes give code for using papi and some critical information to collect valid numbers.

(a) Instrument your matrix matrix implementations using PAPI to measure:
- PAPI_LST_INS (Total Load Store Instructions)
- PAPI_FP_INS (Total Floating Point Instructions)
- PAPI_L1_DCM (Data L1 miss)
- PAPI_L2_TCM (Total L2 miss)
You are doing coarse measurements of a block of code, not individual instructions, so you can simply read the counters before and after the entire matrix matrix multiplication.
(b) Using lonestar at tacc, collect these measurements for each matrix size as before. Since the values you obtain will depend a lot on the machine you use, you must use lonestar for the numbers you report. To ensure no interference with other processes, submit your runs to the job scheduler.
(c) Plot the miss rates for L1 and L2 for each of the matrix sizes.
(d) (optional) Can you explain the discrepancy between the model and your collected data?

Implementation notes:

Dinero is available from: http://pages.cs.wisc.edu/~markhill/DineroIV/ . You should read the man pages that come with it. It is easiest if you use the subrutime interface. Example code is:

#include 
extern "C" {
#include "d4-7/d4.h"
}

#include 
#include 
#include 
#include 

void doread(void* addr, d4cache* Cache) {
  d4memref R;
  R.address = (d4addr)addr;
  R.size = sizeof(double);
  R.accesstype = D4XREAD;
  d4ref(Cache, R);
}
void dowrite(void* addr, d4cache* Cache) {
  d4memref R;
  R.address = (d4addr)addr;
  R.size = sizeof(double);
  R.accesstype = D4XWRITE;
  d4ref(Cache, R);
}

void matmult_ijk(Matrix& A, Matrix& B, Matrix& C, d4cache* Cache) {
  int N = A.size();
  for (unsigned i = 0; i < N; ++i) {
    doread(&A(i,i), Cache);
    doread(&B(i,i), Cache);
    dowrite(&C(i,i), Cache);
  }
}

int main(int argc, char** argv) {

  d4cache* Mem;
  d4cache* L1;
  Mem = d4new(0);
  L1 = d4new(Mem);
  L1->name = "L1";
  L1->lg2blocksize = 8;
  L1->lg2subblocksize = 6;
  L1->lg2size = 20;
  L1->assoc = 2;
  L1->replacementf = d4rep_lru;
  L1->prefetchf = d4prefetch_none;
  L1->wallocf = d4walloc_always;
  L1->wbackf = d4wback_always;
  L1->name_replacement = L1->name_prefetch = L1->name_walloc = L1->name_wback = "L1";
  int r;
  if (0 != (r = d4setup())) {
    std::cerr << "Failed\n";
    abort();
  }

  Matrix A(10), B(10), C(10);
  matmult_ijk(A, B, C, L1);

  std::cout << L1->miss[D4XREAD]
    + L1->miss[D4XWRITE]
    + L1->miss[D4XINSTRN]
    + L1->miss[D4XMISC]
    + L1->miss[D4XREAD+D4PREFETCH]
    + L1->miss[D4XWRITE+D4PREFETCH]
    + L1->miss[D4XINSTRN+D4PREFETCH]
    + L1->miss[D4XMISC+D4PREFETCH]
	    << " of " << 
    L1->fetch[D4XREAD]
    + L1->fetch[D4XWRITE]
    + L1->fetch[D4XINSTRN]
    + L1->fetch[D4XMISC]
    + L1->fetch[D4XREAD+D4PREFETCH]
    + L1->fetch[D4XWRITE+D4PREFETCH]
    + L1->fetch[D4XINSTRN+D4PREFETCH]
    + L1->fetch[D4XMISC+D4PREFETCH]
	    <<"\n";
  
  return 0;
}

The sample Dinero code above does not compute a matrix multiply, it only shows how to feed an address trace into Dinero. You will have to write the matrix multiply code.

This code does not model the correct cache parameters. Read the man pages that come with the simulator.

Dinero has a man page distributed in the source which explains the parameters to the cache and the various API calls that are available. Google will also find the man page.

Be sure to start with a clean cache each measurement.

You can check /proc/cpuinfo to find out the cpu lonestar uses. From that you can find out the cache parameters you should use.
Read http://www.tacc.utexas.edu/user-services/user-guides/lonestar-user-guide#running to learn how to submit jobs to lonestar.
There isn't an architectural way to flush a cache from userspace, but if you reflect on how a cache works, you should see that you are able to. Allocate an array several times larger than the cache you are flushing. Walk though the array writing to each location. You should be able to convince yourself this will flush the cache.
For real hardware your protocol should look like:
```
allocate A,B,C
initialize A,B,C
flush cache
do 1 matrix matrix multiply
```
You cannot do more than 1 matrix matrix multiply for a given allocation or you will not get valid numbers. It is fine to put the protocol above in a loop, but you must reallocate A,B, and C each time. This will randomize the starting address of each matrix, which is related to the optional question of why the model and the real machine diverge.

PAPI:

To program with PAPI on lonestar, you need to do a:

module load papi

To see which papi counters are available on a host, do:

papi_avail

Example code to use papi follows:

#include 
#include 

void handle_error (int retval)
{
  printf("PAPI error %d: %s\n", retval, PAPI_strerror(retval));
  exit(1);
}
void init_papi() {
  int retval = PAPI_library_init(PAPI_VER_CURRENT);
  if (retval != PAPI_VER_CURRENT && retval < 0) {
    printf("PAPI library version mismatch!\n");
    exit(1);
  }
  if (retval < 0) handle_error(retval);

  std::cout << "PAPI Version Number: MAJOR: " << PAPI_VERSION_MAJOR(retval)
            << " MINOR: " << PAPI_VERSION_MINOR(retval)
            << " REVISION: " << PAPI_VERSION_REVISION(retval) << "\n";
}
int begin_papi(int Event) {
  int EventSet = PAPI_NULL;
  int rv;
  /* Create the Event Set */
  if ((rv = PAPI_create_eventset(&EventSet)) != PAPI_OK)
    handle_error(rv);
  if ((rv = PAPI_add_event(EventSet, Event)) != PAPI_OK)
    handle_error(rv);
  /* Start counting events in the Event Set */
  if ((rv = PAPI_start(EventSet)) != PAPI_OK)
    handle_error(rv);
  return EventSet;
}
long_long end_papi(int EventSet) {
  long_long retval;
  int rv;

  /* get the values */
  if ((rv = PAPI_stop(EventSet, &retval)) != PAPI_OK)
    handle_error(rv);

  /* Remove all events in the eventset */
  if ((rv = PAPI_cleanup_eventset(EventSet)) != PAPI_OK)
    handle_error(rv);

  /* Free all memory and data structures, EventSet must be empty. */
  if ((rv = PAPI_destroy_eventset(&EventSet)) != PAPI_OK)
    handle_error(rv);

  return retval;
}
int main(int argc, char** argv) {
  init_papi();
  int EventSet = begin_papi(PAPI_TOT_INS);
   DoTest();
   long_long r = end_papi(EventSet);
   std::cout << "Total instructions: " << r << "\n";
return 0;
}

HINTS:

You only need to model the L1 data cache in Dinero. We don't care about the instruction cache and it would be a lot more work to generate an address trace for that cache anyway.

Don't forget that *C += ... is a read, then a write of C.

For both parts, run the experiments with N = 1 .. 512.

The L1 cache on lonestar is not 12MB, /proc/cpuinfo only shows the L3 cache. That file has other information that you can use to find the L1 cache parameters and there are other ways to do it.

There is no requirement that you wrap the matrix in a class. The amount of syntactic sugar you want to apply is up to you, as long as it doesn't negatively impact performance (say by requiring more pointer dereferences than necessary). The memory representation is up to you as long as it is sensible for a dense matrix and is row-major. One can think of two representations (with a couple variations) that are sensible under these constraints and they should both give you about the same answer.