## A Wire-Delay Scalable Microprocessor Architecture for High Performance Systems

Stephen W. Keckler<sup>1</sup>, Doug Burger<sup>1</sup>, **Charles R. Moore**<sup>1</sup>, Ramadass Nagarajan<sup>1</sup>, Karthikeyan Sankaralingam<sup>1</sup>, Vikas Agarwal<sup>2</sup>, M.S. Hrishikesh<sup>2</sup>, Nitya Ranganathan<sup>1</sup>, and Premkishore Shivakumar<sup>2</sup>

Computer Architecture and Technology Laboratory <sup>1</sup>Department of Computer Sciences <sup>2</sup>Department of Electrical and Computer Engineering The University of Texas at Austin

# <u>Outline</u>

- Progress and Limitations of Conventional Superscalar Designs
- Grid Processor Architecture (GPA) Overview
  - Block Compilation
  - Block Execution Flow
  - Results
- Extending the GPA with *Polymorphism*
- Conclusion and Future Work

#### <u>Superscalar Core – "Spot the ALU"</u>



Only 12% of Non-Cache, Non-TLB Core Area is Execution Units

# Looking Back: Conventional Superscalar

- Enormous gains in frequency
  - **1998**: 500MHz → **2002**: 3000MHz
  - Equal contributions from pipelining and technology
- IPC Basically Unchanged
  - **1998**: ~1 IPC → **2002**: ~1 IPC
  - uArch innovations just overcome losses due to pipelining
  - Issue width remains at 4 instructions
- Pushing the limits of Complexity Management
  - uArch innovations  $\rightarrow$  Verification is the Gate
  - Hundreds of full custom macros
  - 250-500 person design teams
  - Execution units are a small % of processor area

#### Faster, Higher IPC Superscalar Processors?

<u>Faster</u> → Deeper Pipelines (8 FO4)

- Key latencies increase ... IPC decreases
  - Pipeline bubbles
  - uArch innovations mitigate losses, but ...
    - » Increases *complexity* and *performance anomalies*
- After 8 FO4 jump, frequency growth limited to technology only

**<u>Higher IPC</u>**  $\rightarrow$  Wide Issue (16) and Large Window (512+)

- Growth is *quadratic* but gain is *logarithmic* 
  - Broadcast results to all pending instructions
  - Studies indicate only incremental performance gains
- Wire delay limits size of monolithic structures
  - Large structures must be partitioned to meet cycle time
  - Key latencies increase, reducing IPC gain (again!)
  - Additional logical and circuit complexity

#### <u>Superscalar Cores – Key Circuit Elements</u>

|             | Conventional 4-Issue          | Hypothetical 16-issue           |
|-------------|-------------------------------|---------------------------------|
| Execution   | 2 FP, 2 INT, 2 LD/ST          | 8 FP, 8 INT, 8 LD/ST            |
| I-Cache     | 64KB 1 Port, 64B (1 instance) | 128KB <b>2 Ports</b> , 128B (1) |
| Mapper      | 8 port x 72-entry CAM (2)     | 32 port x 512-entry CAM (2)     |
| Issue Queue | 4P x 20-entry dual CAM (3)    | 4P x 40-entry dual CAM (12)     |
| RegFiles    | 72-entry, 4R, 5W ports (4)    | 512-entry, 4R, 18W ports (8)    |
| D-Cache     | 32KB 2R/1W ports (1)          | 128KB <b>8R/4W ports</b> (1)    |

... and pipeline these to use only 8 FO4 delays / cycle !

# What is Going Wrong?

#### **1.** Superscalar MicroArchitecture: Scalability is Limited

- Relies on large, centralized structures that want to grow larger
- Partitioning is a slippery slope: Complexity, IPC loss...

#### 2. Architecture: Conventional Binary Interface is outdated !

- Linear sequence of instructions
- Defined for simple, single-issue machines
- Not natural for compiler .....
  - Internally builds and optimizes 2D Control Flow Graph
  - Forced to map CFG into 1D linear sequence
  - Lots of useful information gets thrown away
- Not natural for instruction parallel machines .....
  - Instruction relationships scattered throughout linear sequence
  - Dynamically re-establish by scanning linear sequence
  - N<sup>2</sup> problem  $\rightarrow$  large, centralized structures

#### **Grid Processor Overview**

- Wire-delay constraints exposed at the <u>architecture</u> level
- Renegotiate the Compiler / HW Binary Interface



#### **GPA Execution Model**

- Compiler structures program into sequence of *hyperblocks* 
  - Atomic unit of fetch / schedule / execute / commit
- Blocks specify *explicit instruction placement* in the GRID
  - Critical path placed to minimize communication delays
  - Less critical instructions placed in remaining positions
- Instructions specify consumers as *explicit targets* 
  - CFG cast into instruction encoding
  - Point-to-point results forwarding
  - In-GRID storage expands register space
  - Only block outputs written back to RF
- Dynamic Instruction Issue
  - GRID forms large distributed window with independent issue controls
  - Instructions execute in original *dataflow-order*

- $\rightarrow$  no HW dependency analysis!
- $\rightarrow$  no associative issue queues!
- $\rightarrow$  no global bypass network!
- $\rightarrow$  no register renaming!
- → Fewer RF ports needed!

### **Block Compilation**



### **Block Compilation** (cont)



#### **Block Compilation** (cont)



#### **Block Execution**



#### Instruction Buffers - frames

- Instruction Buffers add *depth* and define *frames* 
  - 2D GRID of execution units; 3D scheduling of instructions
  - Allows very large blocks to be mapped onto GRID
  - Result addresses explicitly specified in 3-dimensions (x,y,z)



# Using frames for Speculation and ILP



Map A onto GRID Start executing A

Predict C is next block Speculatively execute C

Predict is D is after C Speculatively execute D

Predict is E is after D Speculatively execute E

#### 16 total frames (4 sets of 4)



#### Result:

- Enormous effective instruction window for extracting ILP
- Increased utilization of execution units (accuracy counts!)
- Latency tolerance for GRID delays and Load instructions

#### <u>Results – GPA Instructions per Cycle</u>



#### <u>Using frames for Thread-Level Parallelism</u>



Divide frame space

- Each can be further divided to enable some degree of speculation

- Shown: 2 threads, with 1 speculative block

- Alternate configuration might provide 4 threads

#### Result:

- Simultaneous Multithreading (SMT) for Grid Processors
- <u>Polymorphism</u>: Use same resources in different ways for different workloads ("T-morph")

## Using frames for Data-Level Parallelism



#### Result:

The instruction buffers act as a *distributed I-Cache* Ability to absorb and process large amounts of streaming data Another type of *Polymorphism ("S-morph")* 

# **Conclusions**

• Technology and Architecture Trends:

| Good News: | Lots of transistors, faster transistors      |
|------------|----------------------------------------------|
| Bad News:  | Global wire delays are growing               |
|            | Pipeline depth near optimal                  |
|            | Superscalar pushing the limits of complexity |

- GPA Represents a Promising Technology Direction
  - ► Wire delay constraints: MicroArchitecture <u>and</u> Architecture
  - Eliminates difficult centralized structures dominating today's designs
  - Architectural partitioning encourages regularity and re-use
  - Enhanced information flow between compiler and hardware
  - Polymorphic features: performance on a wide range of workloads

# Future Work

- Architectural Refinement
  - Block-oriented predictors
  - Selective re-execution
- Enhance Compilation and Scheduling Tools
  - *Hyperblock* formation
  - 3D Instruction Scheduling algorithms
- Compatibility bridge to existing architectures
- Hardware Prototype (currently in planning stage)
  - Four 4x4 GPA cores + NUCA L2 cache on chip
  - 0.10um, ~350mm2, 1000+ signal I/O, 300MHz
  - 4Q 2004 tape-out