CS 378: Programming for Performance

Assignment 1

Due date: February 19th, 2008

Note: The contents of last question have changed since assignment was posted on 2/09

You can do this assignment alone or with someone else from class.
Each group can have a maximum of two students.
Each group should turn in one submission.

1. Pipeline forwarding paths (10 points)

Consider the simple 5 stage pipeline (Fetch, Decode, Execute, Memory, Write-back) discussed in class. For the purpose of this problem, make the following assumptions about the baseline implementation of the processor pipeline:

every instruction takes 5 cycles to complete unless it stalls,
hardware interlocks are available wherever needed, and
there is no forwarding along any path.

For each of the following pairs of instructions, determine how many cycles it would take to complete execution in the baseline implementation. Then identify any forwarding paths in the processor that would speed up execution of that instruction pair, and determine how many cycles would be saved by that addition.

a) ADD R1, R2, R3
    ADD R4, R1, R2

b) LOAD R1, (R2)
    ADD R3, R1, R2

c) LOAD R1, (R2)
    STORE (R3), R1

d) Are there any other forwarding paths that might be useful? If so, specify those paths and for each one, show an instruction sequence for which it reduces the cycle count.

2. Interlocks (5 points)

As mentioned in class, early versions of the MIPS processor had an "exposed pipeline" (that is, the assembly language programmer needed to know the latencies of operations and had to insert NO-OPS or other operations between dependent instructions to guarantee correctness). Later versions of the MIPS processor abandoned this idea. Argue from a software engineering perspective why this was the right move.

3. Execution time (25 points)

Consider the following straight-line program to compute values for variables D and H. All variables are memory-resident,integer-valued, and located at distinct memory addresses.

D = (A*X+B)*X+C
H = (E*X+F)*X+G

A compiler for a DLX machine generates the following code for this program fragment. The memory addresses are symbolic names.

1: LW R1, X
2: LW R2, A
3: MULT R10, R1, R2
4: LW R3, B
5: ADD R10, R10, R3
6: MULT R10, R10, R1
7: LW R4, C
8: ADD R10, R10, R4
9: SW D, R10
10: LW R5, E
11: MULT R11, R1, R5
12: LW R6, F
13: ADD R11, R11, R6
14: MULT R11, R11, R1
15: LW R7, G
16: ADD R11, R11, R7
17: SW H, R11

(a) Enumerate all the data hazards that exist among the 17 instructions. Classify them as RAW, WAR, and WAW hazards.

(b) Draw the precedence graph for the program.

(c) Assume a five-stage linear pipelined implementation of DLX with the following additional assumptions (both admittedly unrealistic): all operations spend a single cycle in the EX stage, and there are no forwarding paths whatsoever. Represent the execution of the program above with the help of a space-time diagram or equivalent, and determine the number of clock cycles it will take to complete execution. Indicate a stalled instruction by circling or boxing the table entry.

(d) Still assuming no forwarding, rearrange the order of instructions in this program to reduce execution time without violating any precedence constraints. Represent the execution of the rearranged program with the help of a space-time diagram or equivalent, and determine the number of clock cycles it will take to complete execution.

(e) Now assume that all reasonable forwardings are available. Represent the execution of the original program with the help of a space-time diagram or equivalent, and determine the number of clock cycles it will take to complete execution.

(f) Still assuming that all reasonable forwardings are available, represent the execution of the rearranged program with the help of a space-time diagram or equivalent, and determine the number of clock cycles it will take to complete execution.

(g) Summarize the execution times of the four program executions in a two-dimensional table. Label one axis with the lack (or availability) of forwarding, and the other with the lack (or availability) of compile-time scheduling.

4. Optimizing loops for pipelined execution (60 points)

In this problem, you will optimize a loop for execution on a simple pipelined processor. The processor we will use is an implementation of the IBM PowerPC architecture that is part of the IBM Cell engine.
We will use a program called mambo to simulate program execution on this processor; mambo provides detailed information about program execution such as the number of instructions executed, the number of cycles it took to execute the program, cache misses at various levels, etc. Information about the processor and the simulator are given at the end of this assignment. You need to compile the code using ppu-gcc. The Mambo setup instructions ask you to install it. The Makefile of template program uses ppu-gcc. The O2 level gcc optimization needs to be used, to make sure that program variables are allocated in registers.

(a) The following C program initializes all the elements of an array containing floating-point numbers.
Instead of using array indexing, it uses pointers into the array so that we can use a relatively simple compiler and still get reasonable code.

#define N 1000
int main()
{
    register double sum = 0;
    double x[N];
    register double *p = &x[0], *q = &x[N]; //p and q are pointers to start and end of array
    while (p != q) {
         *p = 1.0;
           p++;
    }
    return *(p-1); //Avoid dead code elimination due to compiler optimizations
}

Compile this program and study the assembly listing and make sure you understand its structure.
Use mambo to determine the number of instructions and the total execution time in cycles of this program. Note that the counts returned by mambo include a lot of instructions that are executed in the system before control is transferred to your program.

(b) Modify this C program by inserting after the initialization loop a "reduction" loop that sums up all the elements of the array. The structure of this loop should be similar to that of the initialization loop shown above. Make sure that dead code elimination technique used by compiler does not eliminate the reduction loop. You can use similar trick as used by example code above. Using mambo, determine the number of instructions and total execution time in cycles of this new program. By subtracting the quantities you measured in part (a) from these, you can determine the number of the instructions executed and execution time in cycles of the reduction loop. Study the assembly listing again and see if you can explain these values.

(c) One advantage of using a simulator is that it is less susceptible to measurement noise than a real machine would be. To reduce the impact of noise on measurements, it is standard to repeat a measurement some number of times, and compute averages or some other statistics. Modify your code in (b) so that the reduction loop is executed 10 times (you can wrap your reduction loop with another loop to accomplish this). Using mambo and some calculation, compute the average number of instructions and execution time in cycles required for the reduction loop. Are these quantities close to those you measured in part (b) or is there a lot of variation?

(d) In the PowerPC implementation you are using, floating-point adds take 10 cycles and the adder is pipelined. One problem with the reduction loop you wrote in part (b) is that the flow dependence between the additions in successive iterations limits exploitation of the pipelined adder. As mentioned in class, one way to improve performance is to split the overall reduction into some number of smaller reductions of subsets of the array elements, and overlap the execution of the smaller reductions. For example, you can split the overall sum into two sums, one of which adds the even elements of the array while the other adds the odd elements of the array, and then sum the two results at the end. As shown in class, this can be implemented by loop unrolling and renaming of variables, and the hardware will be able to overlap the execution of the two sums.

By hand, unroll the reduction loop in (b) by the following amounts successively: 2,4,8,12, and measure the number of instructions and execution time in cycles for each unrolling amount. From your results, determine the best unrolling amount. Is this value reasonable, given what you know what about the instruction latencies of your machine?

Paper on Cell engine (has description of Power processor)

Here are some relevant facts about the processor:

latency of simple integer operations: 2 cycles
latency of loads (L1 hits): 2 cycles
double-precision floating-point adds: 10 cycles
L1 instruction and data cache capacities: 32 KB
L2 cache capacity: 512 KB

mambo installation instructions

Sample program to run on mambo (MMM)