You can work in groups. Each group should not have more than 2 students. Each group should do the assignment independently.
(10 points) Consider the following loop-nest, which is referred
to as
"wavefront" in the compiler literature. Assume A stores doubles.
for I = 2,N
for J = 1, N-1
A(I,J) = A(I,J) + A(I-1,J+1)
//mini-kernel for(int j = 0; j < NB; j += NU) for (int i = 0; i < NB; i += MU) load C[i..i+MU-1, j..j+NU-1] into registers for (int k = 0; k < NB; k++) //micro-kernel load A[i..i+MU-1,k] into registers load B[k,j..j+NU-1] into registers multiply A's and B's and add to C's store C[i..i+MU-1, j..j+NU-1](a) (10 points) Write a function that takes three NxN matrices of doubles as input and uses the straight-forward ikj 3-nested loop version of MMM to perform matrix multiplication. Use the allocation routine given in the implementation notes to allocate storage for the three arrays. Run this code to multiply matrices of various sizes, and draw a graph of GFlops vs. matrix size. What fraction of peak performance do you get for large matrices? Explain your results briefly.
We will care about matrix multiply in the range of problems starting with those that do not fit in L2 to those which do not fit in L3. The exact sizes will not be given, but do expect that some tests will be on odd sized matrixes.
All matrixes will be square.
For the competition, we will have a test harness which will
call your code. Thus you MUST provide a function called matmult with the following signature:
void matmult(double* A, double* B, double* C, unsigned N)which multiplies A by B and puts the result in C. A,B, and C are square matrixes of size N.
We will test on lonestar.
You are free to use any vector intrinsics available. You are free to prefetch.
We will compile with gcc -O3.
Implementation notes:
1) In the C programming language, a 2-dimensional array of
doubles is
implemented as a flattened 2-dimensional array of
doubles. When the compiler knows the sizes of the dimentions it
generates row-major indexing expressions for that. Thus for
A[anything][8], it will generate Aflat[r*8 + c] for A[r][c]. Note
that you can always elide the left most dimention off of a type in
C since the compiler doesn't need it. Unfortunately, C does not
pass variable sized multidimentional arrays to functions well
since the number of colums is part of the type. Note that the
type A[][] denotes an array of arrays and not a multidimentional
array. The simple solution is to simple deal with an explicitly
flattened array and generate the index expressions yourself. This
will make the part of the assignment dealing with column-major
easy to do as you will just change the indexing function. For
example, assuming square matrixes:
double* alloc(int N) { return malloc(sizeof(double)*N*N); } double* index(double* matrix, int row, int column, int N) { return &matrix[row*N + column]; }
//compile using: //module load intel //module load boost //module load mkl //icpc -I $TACC_BOOST_INC -L $TACC_BOOST_LIB -O3 -mkl -lboost_timer -DNDEBUG mkl_test.cpp #include#include #include #include using namespace boost::numeric::ublas; void benchmark(int size) { matrix m(size, size); matrix n(size, size); for (int i = 0; i < size; ++i) { for (int j = 0; j < size; ++j) { m(i,j) = 2*i-j; } } matrix r(size, size); std::cout << "Trying " << size << " "; { boost::timer::auto_cpu_timer t; r = prod(m,n); } } int main(int argc, char *argv[]) { mkl_set_num_threads ( 1 ); for (int x = 100; x < 2000; x += 100) benchmark(x); return 0; } ====================== results: size time (s) 100 0.0008 200 0.0085 300 0.027987 400 0.06655 500 0.13358 600 1.1599 700 2.68 800 4.015 900 5.87 1000 8.15 1100 10.97 1200 14.18 1300 18.632 1400 23.088 1500 29.432 1600 36.54 1700 43.67