Part 1 Due: Friday February 13 5pm. Part 2 Due: Friday February 20 5pm No late homeworks will be accepted
Work in groups of two. Each group should turn in one homework.
From Hennessy and Patterson Computer Architecture: A Quantitative Approach Second Edition
This question examines a new application of prediction called "Load value prediction." This problem is mostly self-contained (it was an exam problem last year), but you may find it interesting to read the paper "Value Locality and Load Value Prediction" by Mikko H. Lipasti, Christopher B. Wilkerson and John Paul Shen: Carnegie Mellon University (ASPLOS 1996)
Consider an in-order superscalar, superpipelined machine. The machine issues up to 2 instructions per cycle into an 8-stage pipeline (IF: first instruction fetch, IS: second instruction fetch, RF: register fetch, DF: first data fetch, DS: second data fetch, TC: data fetch tag check, WB: write back). As with the MIPS R4000, cache accesses are pipelined over three stages, with the result available in the second cycle and the tag checked in the third. E.g.:
instr1 | IF | IS | RF | EX | DS | TC | TC | WB | |
instr2 | IF | IS | RF | EX | DS | TC | TC | WB | |
instr3 | IF | IS | RF | EX | DS | TC | TC | WB | |
instr4 | IF | IS | RF | EX | DS | TC | TC | WB |
The following table summarizes the probabilities that the i'th instruction after a load is the first instruction after the load to use the result of the load:
Instruction after load | Use probability |
1 | 35% |
2 | 20% |
3 | 15% |
4 | 10% |
5 | 6% |
6 | 4% |
7 | 2% |
8,9,10 | 1% |
11,12,13,14,15 | 0.5% |
16..25 | 0.1% |
The base CPI of the machine is 0.5 and for the target workload branch stalls add an additional 0.52 cycles per instruction. 25% of all instructions are loads. We will not consider floating point stalls in this problem. For simplicity, we will assume all instructions and data accesses hit in the L1 cache.
a. What is the load delay for this machine?
i. In cycles
ii. In instructions
b. What is the CPI in the base machine?
One proposal to reduce the load delay in such a machine is to take advantage of the fact that a load instruction at a particular address is somewhat likely to load the same value from memory repeatedly. A Load Value Prediction table (LVP) maps the address of load instructions to a predicted value for loads.
c. Describe how you could use a LVP to reduce the load delay in this CPU. (For now, do not describe how you would recover from mispredictions; we'll talk about that later.
d. Suppose we add a LVP to the above pipeline. What is the minimum load delay that can be achieved on a LVP hit?
i. In cycles?
ii. In instructions?
e. In such a design, over how many cycles can we pipeline the LVP lookup?
Dealing with mispredictions Suppose we access the LVP to predict a load value for each instruction whether it is a load or not. By the time the value would be used, we know whether the instruction was a load, so if it was not a load, we can throw away the prediction with no harm done. However, if the instruction is a load, we must compare the prediction against the real value and roll back on a miss. To simplify this problem (and the pipeline), assume that you will flush and refetch all instructions following a mispredicted load whether or not they depend on the load.
We can compare the prediction and the result during TC. If the prediction was incorrect, squash all instructions in the pipeline that are after the load. The load instruction does its WB normally, and the two instructions immediately after the load restart in IF1. (There are quite a few ways we might make restart more efficient, but there are complexity trade-offs
g. What is the misprediction penalty when a load value prediction is found to be incorrect under this scheme? (Consider both the case when the load is in the "top half" and the "bottom half" of a cycle.
h. Assume that a 1024-entry LVP predicts loads correctly 70% of the time. WHat is the CPI of the machine using such a LVP?
To reduce the cost of mispredictions, researchers have proposed adding a load classification table (LCT) that tracks whether the load associated with an address is predictable or not. Each entry of the LCT is a two-bit saturating counter that assigns the four states 0-3 as "dont predict", "dont predict", "predict", and "predict." The counter is incremented when the predicted value is correct and decremented otherwise. A 256-entry LCT classifies 93% of unpredictable loads as "don't predict" and classifies 82% of predictable loads as "predict."
g. Using the LCT, what is the new CPI of the machine?
a. Write a program to simulate
Let the buffer entry be indexed by the least significant bits of the branch instruction address. As input to the program, use this memory reference trace.
The trace consists of the sequence of memory references in one execution of the tex program. Each memory reference in the sequence is a two field record. The first field is a label which gives the access type of the reference: 0 is read data, 1 is write data, 2 is instruction fetch. The second field is the address accessed by the memory reference; it is specified as a hexadecimal byte-address between (and including) 0 and ffffffff. Assume that any branch instruction in the program results in a taken branch at least once in the trace.
b. Use your program to calculate the following statistics:
c. Perform your runs first assuming a size of 2K bits for the prediction buffer with each of the three prediction strategies. Then repeat these runs with a size of 1K bits and 4K bits for each strategy. What are your conclusions regarding the effectiveness of the simulated strategies, based on the results of your runs?
d. Hand in well-documented code, the results of your runs (three runs with varying buffer sizes for each of the three prediction strategies), and your conclusions.