CS 378: Programming for Performance
Assignment
1
Due date: February 19th, 2008
Note: The contents of last question have changed since assignment was posted on 2/09
You can do this assignment alone or with someone else from class.
Each group can have a maximum of two students.
Each group should turn in one submission.
1. Pipeline forwarding paths (10
points)
Consider the simple 5 stage pipeline (Fetch, Decode, Execute, Memory,
Write-back)
discussed in class. For the purpose of this
problem, make the following assumptions about
the baseline implementation of the processor pipeline:
- every instruction takes 5 cycles to complete
unless it stalls,
- hardware interlocks are available wherever
needed, and
- there is no forwarding along any
path.
For each of the following pairs of instructions,
determine how many cycles it would take to complete
execution in the baseline implementation. Then
identify any forwarding paths in the processor that
would speed up execution of that instruction pair,
and determine how many cycles would be saved by that addition.
a) ADD R1, R2, R3
ADD R4, R1, R2
b) LOAD R1, (R2)
ADD R3, R1, R2
c) LOAD R1, (R2)
STORE (R3), R1
d) Are there any other forwarding paths that might be useful?
If so, specify those paths and for each one, show an instruction
sequence for which it reduces the cycle
count.
2. Interlocks (5 points)
As mentioned in class, early versions of the MIPS processor
had an "exposed pipeline" (that is, the assembly language programmer
needed to know the latencies of operations and had to insert NO-OPS or
other operations between dependent instructions to guarantee
correctness). Later versions of the MIPS processor abandoned this
idea. Argue from a software engineering perspective why this was
the right move.
3. Execution time (25 points)
Consider the following straight-line program to compute values for
variables D and H. All variables are memory-resident,integer-valued,
and located at distinct memory addresses.
D = (A*X+B)*X+C
H = (E*X+F)*X+G
A compiler for a DLX machine generates the following code for this
program fragment. The memory addresses are symbolic names.
1: LW R1, X
2: LW R2, A
3: MULT R10, R1, R2
4: LW R3, B
5: ADD R10, R10, R3
6: MULT R10, R10, R1
7: LW R4, C
8: ADD R10, R10, R4
9: SW D, R10
10: LW R5, E
11: MULT R11, R1, R5
12: LW R6, F
13: ADD R11, R11, R6
14: MULT R11, R11, R1
15: LW R7, G
16: ADD R11, R11, R7
17: SW H, R11
(a) Enumerate all the data hazards that exist among the 17
instructions. Classify them as RAW, WAR, and WAW hazards.
(b) Draw the precedence graph for the program.
(c) Assume a five-stage linear pipelined implementation of DLX with the
following additional assumptions (both admittedly unrealistic): all
operations spend a single cycle in the EX stage, and there are no
forwarding paths
whatsoever. Represent the execution of the program above with the help
of a space-time diagram or equivalent, and determine the number of
clock cycles it will take to complete execution. Indicate a stalled
instruction by circling or
boxing the table entry.
(d) Still assuming no forwarding, rearrange the order of instructions
in this program to reduce execution time without violating any
precedence constraints. Represent the execution of the rearranged
program with the help of a space-time diagram or equivalent, and
determine the number of clock cycles it will take to complete
execution.
(e) Now assume that all reasonable forwardings are available. Represent
the execution of the original program with the help of a space-time
diagram or equivalent, and determine the number of clock cycles it will
take to complete execution.
(f) Still assuming that all reasonable forwardings are available,
represent the execution of the rearranged program with the help of a
space-time diagram or equivalent, and determine the number of clock
cycles it will take to complete execution.
(g) Summarize the execution times of the four program executions in a
two-dimensional table. Label one axis with the lack (or availability)
of forwarding, and the other with the lack (or availability) of
compile-time scheduling.
4. Optimizing loops for pipelined execution (60 points)
In this problem, you will optimize a loop for execution on a simple
pipelined processor. The processor we will use is an implementation of
the IBM PowerPC architecture that is part of the IBM Cell engine.
We will use a program called mambo to simulate program execution on
this processor; mambo provides detailed information about program
execution such as the number of instructions executed, the number of
cycles it took to execute the program, cache misses at various levels,
etc. Information about the processor and the simulator are given at the
end of this assignment. You need to compile the code using ppu-gcc. The Mambo setup instructions ask you to install it. The Makefile of template program uses ppu-gcc. The O2 level gcc optimization needs to be used, to make sure that program variables are allocated in registers.
(a) The following C program initializes all the elements of an array
containing floating-point numbers.
Instead of using array indexing, it uses pointers into the array so
that we can use a relatively simple compiler and still get reasonable
code.
#define N 1000
int main()
{
register double sum = 0;
double x[N];
register double *p = &x[0], *q = &x[N]; //p
and q are pointers to start and end of array
while (p != q) {
*p = 1.0;
p++;
}
return *(p-1); //Avoid dead code elimination due to compiler optimizations
}
Compile this program and study the assembly listing and make sure you
understand its structure.
Use mambo to determine the number of instructions and the total
execution time in cycles of this program. Note that the counts returned
by mambo include a lot of instructions that are executed in the system
before control is transferred to your program.
(b) Modify this C program by inserting after the initialization loop a
"reduction" loop that sums up all the elements of the array. The
structure of this loop should be similar to that of the initialization
loop shown above. Make sure that dead code elimination technique
used by compiler does not eliminate the reduction loop. You can use similar
trick as used by example code above. Using mambo, determine the number of
instructions and total execution time in cycles of this new program. By
subtracting the quantities you measured in part (a) from these, you can
determine the number of the instructions executed and execution time in
cycles of the reduction loop. Study the assembly listing again and see
if you can explain these values.
(c) One advantage of using a simulator is that it is less susceptible
to measurement noise than a real machine would be. To reduce the impact
of noise on measurements, it is standard to repeat a measurement some
number of times, and compute averages or some other statistics. Modify
your code in (b) so that the reduction loop is executed 10 times (you
can wrap your reduction loop with another loop to accomplish this).
Using mambo and some calculation, compute the average number of
instructions and execution time in cycles required for the reduction
loop. Are these quantities close to those you measured in part (b) or
is there a lot of variation?
(d) In the PowerPC implementation you are using, floating-point adds
take 10 cycles and the adder is pipelined. One problem with the
reduction loop you wrote in part (b) is that the flow dependence
between the additions in successive iterations limits exploitation of
the pipelined adder. As mentioned in class, one way to improve
performance is to split the overall reduction into some number of
smaller reductions of subsets of the array elements, and overlap the
execution of the smaller reductions. For example, you can split the
overall sum into two sums, one of which adds the even elements of the
array while the other adds the odd elements of the array, and then sum
the two results at the end. As shown in class, this can be implemented
by loop unrolling and renaming of variables, and the hardware will be
able to overlap the execution of the two sums.
By hand, unroll the reduction loop in (b) by the following amounts
successively: 2,4,8,12, and measure the number of instructions and
execution time in cycles for each unrolling amount. From your results,
determine the best unrolling amount. Is this value reasonable, given
what you know what about the instruction latencies of your machine?
Paper on Cell engine (has description of
Power processor)
Here are some relevant facts about the processor:
- latency of simple integer operations: 2 cycles
- latency of loads (L1 hits): 2 cycles
- double-precision floating-point adds: 10 cycles
- L1 instruction and data cache capacities: 32 KB
- L2 cache capacity: 512 KB
mambo installation instructions
Sample program to run on mambo (MMM)