# GPUs to the left GPUs to the right GPUs all day GPUs all night

Chris Rossbach cs378



## **L0 Instruction Cache**

Warp Scheduler (32 thread/clk)

Dispatch Unit (32 thread/clk)

Register File (16,384 x 32-bit)



# Outline for Today

Questions?

FP64

FP6

LD

- Administrivia
  - Exam graded
- Agenda
  - CUDA continued
  - **CUDA Performance**

#### Acknowledgements:

Regi

- http://developer.download.nvidia.com/compute/developertrainingmaterials/presentations/cuda language/Introduction to CUDA C.pptx
- http://www.seas.upenn.edu/~cis565/LECTURES/CUDA%20Tricks.pptx
- http://www.cs.utexas.edu/~pingali/CS378/2015sp/lectures/GPU%20Programming.p ptx

 FP64
 INT
 INT
 FP32
 FP32

 FP64
 INT
 INT
 FP32
 FP32

 FP64
 INT
 INT
 FP32
 FP32

## Exam Stats



Average: ~60

High: 91

Low: 33

Stdev: ~14



## Faux Quiz Questions

- How is occupancy defined (in CUDA nomenclature)?
- What's the difference between a block scheduler (e.g. Giga-Thread Engine) and a warp scheduler?
- Modern CUDA supports UVM to eliminate the need for cudaMalloc and cudaMemcpy\*. Under what conditions might you want to use or not use it and why?
- What is control flow divergence? How does it impact performance?
- What is a bank conflict?
- What is work efficiency?
- What is the difference between a thread block scheduler and a warp scheduler?
- How are atomics implemented in modern GPU hardware?
- How is \_\_shared\_\_ memory implemented by modern GPU hardware?
- Why is \_\_shared \_\_ memory necessary if GPUs have an L1 cache? When will an L1 cache provide all the benefit of \_\_shared \_\_ memory and when will it not?
- Is cudaDeviceSynchronize still necessary after copyback if I have just one CUDA stream?

## Review: Blocks and Threads

With M threads/block, unique index per thread is :

```
int index = threadIdx.x + blockIdx.x * M;
```

# How many threads/blocks should I use?

```
// Copy inputs to device
cudaMemcpy(d a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d b, b, size, cudaMemcpyHostToDevice);
// Launch add() kernel on
add <<n/th>
Add <<n/td>
Add <<td>Add <<td>Add 
Add 
Add </td
// Copy result back to host
cudaMemcpy(c, d c, size, cudaMemcpyDeviceToHost);
// Cleanup
free(a); free(b); free(c);
cudaFree(d a); cudaFree(d b); cudaFree(d c);
return 0;
```

• Usually things are correct if grid\*block dims >= input size

Getting good performance is another matter

## Internals

```
__host__
void vecAdd()
{
   dim3 DimGrid = (ceil(n/256,1,1);
   dim3 DimBlock = (256,1,1);
   addKernel<<<<DGrid,DBlock>>>(A_d,B_d,C_d,n);
}
```



How are threads scheduled?

## Kernel Launch

- Commands by host issued through streams
  - Kernels in the same stream executed sequentially
  - Kernels in different streams may be executed concurrently
- Streams mapped to GPU HW queues
  - Done by "kernel management unit" (KMU)
  - ❖ Multiple streams mapped to each queue → serializes some kernels
- Kernel launch distributes thread blocks to SMs



## SIMD vs. SIMT Single Scalar Thread <u>Register F</u>ile Flynn Taxonomy e.g., SSE/AVX **Data Streams** Instruction Streams SISD SIMD Synchronous operation MISD MIMD **SIMT** Loosely synchronized threads Multiple threads e.g., pthreads e.g., PTX, HSA

## GPU Performance Metric: Occupancy

- Occupancy = (#Active Warps) /(#MaximumActive Warps)
  - Measures how well concurrency/parallelism is utilized
- Occupancy captures
  - which resources can be dynamically shared
  - how to reason about resource demands of a CUDA kernel Shouldn't we just create as many
  - Enables device-specific online tuning of kernel parameter, threads as possible?

Shouldn't we just create as many threads as possible?



## A Taco Bar





• Where is the parallelism here?

## GPU: a multi-lane Taco Bar

• Where is the parallelism here?

























## • Where is the parallelism here?

## GPU: a multi-lane Taco Bar











Goal: Increase Occupancy!

There's none!

















## • Where is the parallelism here?

## GPU: a multi-lane Taco Bar



## GPU Performance Metric: Occupancy

- Occupancy = (#Active Warps) /(#MaximumActive Warps)
  - Measures how well concurrency/parallelism is utilized
- Occupancy captures
  - which resources can be dynamically shared
  - how to reason about resource demands of a CUDA kernel Shouldn't we just create as many
  - Enables device-specific online tuning of kernel parameter, threads as possible?

Shouldn't we just create as many threads as possible?







## Hardware Resources Are Finite



## Occupancy:

- (#Active Warps) /(#MaximumActive Warps)
- Limits on the numerator:
  - Registers/thread
  - Shared memory/thread block
  - Number of scheduling slots: blocks, warps
- Limits on the denominator:
  - Memory bandwidth
  - Scheduler slots



What is the performance impact of varying kernel resource demands?

# Impact of Thread Block Size

## Example: v100:

- max active warps/SM == 64 (limit: warp context)
- max active blocks/SM == 32 (limit: block control)
  - With 512 threads/block how many blocks can execute (per SM) concurrently?
  - Max active warps \* threads/warp = 64\*32 = 2048 threads  $\rightarrow$  4
  - With 128 threads/block? → 16
- Consider HW limit of 32 thread blocks/SM @ 32 threads/block:
  - Blocks are maxed out, but max active threads = 32\*32 = 1024
  - Occupancy = .5 (1024/2048)
- To maximize utilization, thread block size should balance
  - Limits on active thread blocks vs.
  - Limits on active warps



# Impact of #Registers Per Thread

Registers/thread can limit number of active threads! V100:

- Registers per thread max: 255
- 64K registers per SM

Assume a kernel uses 32 registers/thread, thread block size of 256

- Thus, A TB requires 8192 registers for a maximum of 8 thread blocks per SM
  - Uses all 2048 thread slots (8 blocks \* 256 threads/block)
  - 8192 regs/block \* 8 block/SM = 64k registers
  - FULLY Occupied!
- What is the impact of increasing number of registers by 2?
  - Recall: granularity of management is a thread block!
  - Loss of concurrency of 256 threads!
  - 34 regs/thread \* 256 threads/block \* 7 blocks/SM = 60k registers,
  - 8 blocks would over-subscribe register file
  - Occupancy drops to .875!

# Impact of Shared Memory

- Shared memory is allocated per thread block
  - Can limit the number of thread blocks executing concurrently per SM
  - Shared mem/block \* # blocks <= total shared mem per SM</li>
- gridDim and blockDim parameters impact demand for
  - shared memory
  - number of thread slots
  - number of thread block slots

## Balance

template < class T >

\_\_host\_\_ <u>cudaError t</u> cudaOccupancyMaxActiveBlocksPerMultiprocessor (int\* numBlocks, T func, int blockSize, size\_t dynamicSMemSize) [inline]

Returns occupancy for a device function.

#### **Parameters**

#### numBlocks

- Returned occupancy

#### func

- Kernel function for which occupancy is calulated

#### blockSize

- Block size the kernel is intended to be launched with

#### dynamicSMemSize

- Per-block dynamic shared memory usage intended, in bytes
  - Navigate the tradeoffs
    - maximize core utilization and memory bandwidth utilization
    - Device-specific
  - Goal: Increase occupancy until one or the other is saturated

# Parallel Memory Accesses

- Coalesced main memory access (16/32x faster)
  - HW combines multiple warp memory accesses into a single coalesced access
- Bank-conflict-free shared memory access (16/32)
  - No alignment or contiguity requirements
    - CC 1.3: 16 different banks per half warp or same word
    - CC 2.x+3.0 : 32 different banks + 1-word broadcast each

CUDA Optimization Tutorial 21

# Parallel Memory Architecture

- In a parallel machine, many threads access memory
  - Therefore, memory is divided into banks
  - Essential to achieve high bandwidth
- Each bank can service one address per cycle
  - A memory can service as many simultaneous accesses as it has banks
- Multiple simultaneous accesses to a bank result in a bank conflict
  - Conflicting accesses are serialized



# Coalesced Main Memory Accesses

NVIDIA

## single coalesced access

NVIDIA



one and two coalesced accesses\*



# Bank Addressing Examples



# Bank Addressing Examples



# Linear Addressing

• Given:

```
__shared__ float shared[256];
float foo =
   shared[baseIndex + s *
   threadIdx.x];
```



- This is only bank-conflict-free if s shares no common factors with the number of banks
  - 16 on G80, so s must be odd



## Layered abstractions



- \* 1:1 correspondence between OS-level and user-level abstractions
- \* Diverse HW support enabled HAL

## GPU abstractions



## No OS support $\rightarrow$ No isolation

## **GPU** benchmark throughput



CPU+GPU schedulers not integrated! ...other pathologies abundant

ge-convolution in CUDA dows 7 x64 8GB RAM I Core 2 Quad 2.66GHz dia GeForce GT230

# Composition: Gestural Interface



## What We'd Like To Do

```
#> capture | xform | filter | detect &
CPU GPU GPU CPU
```

- Modular design
  - flexibility, reuse
- Utilize heterogeneous hardware
  - ▶ Data-parallel components → GPU
  - ▶ Sequential components → CPU
- Using OS provided tools
  - processes, pipes

## GPU Execution model

- GPUs cannot run OS:
  - different ISA
  - Memories have different coherence guarantees
    - (disjoint, or require fence instructions)
- Host CPU must "manage" GPU execution
  - Program inputs explicitly transferred/bound at runtime
- Device buffers pre-allocated

  Wain memory

  Copy inputs

  Copy inputs

  Copy outputs

  GPU

  Main CPU

  GPU

  GPU

  Memory

# Data migration



## Device-centric APIs considered harmful

```
Matrix
gemm(Matrix A, Matrix B) {
    copyToGPU(A);
    copyToGPU(B);
    invokeGPU();
    Matrix C = new Matrix();
    copyFromGPU(C);
    return C;
}
```

What happens if I want the following? Matrix  $D = A \times B \times C$ 

# Composed matrix multiplication

Matrix

## Composed matrix multiplication

```
Matrix
                                        gemm(Matrix A, Matrix B) {
                AxB copied from
                                          copyToGPU(A);
                                          copyToGPU(B);
                 GPU memory...
                                          invokeGPU();
                                          Matrix C = new Matrix();
                                          copyFromGPU(C);
Matrix
                                          return C;
AxBxC(Matrix A, B, C) {
    Matrix(AXB) = gemm(A,B);
    Matrix AxBxC = gemm(AxB,C);
    return AxBxC;
```

### Composed matrix multiplication

```
gemm(matrix A, Matrix B) {
                                           copyToGPU(A);
                                          copyToGPU(B);
                                           invoke [PU();
                                           Matrix C = new Matrix();
                                           copyFromGPU(C);
Matrix
                                                C;
                                           returr
AxBxC(Matrix A, B, C) {
    Matrix AxB = gemm(A,B);
    Matrix AxBxC = gemm(AxB,C);
    return AxBxC;
                                       ...only to be copied
                                       right back!
```

Matrix

## What if I have many GPUs?

```
Matrix
gemm(Matrix A, Matrix B) {
   copyToGPU(A);
   copyToGPU(B);
   invokeGPU();
   Matrix C = new Matrix();
   copyFromGPU(C);
   return C;
}
```

## What if I have many GPUs?

```
Matrix
gemm(GPU dev, Matrix A, Matrix B) {
   copyToGPU(dev, A);
   copyToGPU(dev, B);
   invokeGPU(dev);
   Matrix C = new Matrix();
   copyFromGPU(dev, C);
   return C;
}
```

What happens if I want the following? Matrix  $D = A \times B \times C$ 

## Composition with many GPUs

```
gemm(GPU dev, Matrix A, Matrix B)
                                      copyToGPU(A);
                                      copyToGPU(B);
                                      invokeGPU();
                                      Matrix C = new Matrix();
                                      copyFromGPU(C);
                                      return C;
Matrix
AxBxC(Matrix A,B,C) {
   Matrix AxB = gemm(???, A,B);
   Matrix AxBxC = gemm(????, AxB,C);
    return AxBxC;
```

Matrix

## Composition with many GPUs



```
Matrix
gemm(GPU dev, Matrix A, Matrix B)
{
    copyToGPU(A);
    copyToGPU(B);
    invokeGPU();
    Matrix C = new Matrix();
    copyFromGPU(C);
    return C;
}
```

```
Matrix
AxBxC(GPU dev, Matrix A,B,C) {
    Matrix AxB = gemm(dev, A,B);
    Matrix AxBxC = gemm(dev, AxB,C);
    return AxBxC;
}
```

### Composition with many GPUs

This will never be manageable for many GPUs.

Programmer implements scheduling using static view!

```
Matrix
gemm(GPU dev, Matrix A, Matrix B)
{
    copyToGPU(A);
    copyToGPU(B);
    invokeGPU();
    Matrix C = new Matrix();
    copyFromGPU(C);
    return C;
}
```

```
Matrix
AxBxC(GPU devA, GPU devB, Matrix A,B,C) {
   Matrix AxB = gemm(devA, A,B);
   Matrix AxBxC = gemm(devB, AxB,C);
   return AxBxC;
}

Why don't we have this problem with CPUs?
```

#### Dataflow: a better abstraction



- Minimal specification of data movement: runtime does it.
- asynchrony is a runtime concern (not programmer concern)
- No specification of compute → device mapping: like threads!

### Faux Quiz Questions

- How is occupancy defined (in CUDA nomenclature)?
- What's the difference between a block scheduler (e.g. Giga-Thread Engine) and a warp scheduler?
- Modern CUDA supports UVM to eliminate the need for cudaMalloc and cudaMemcpy\*. Under what conditions might you want to use or not use it and why?
- What is control flow divergence? How does it impact performance?
- What is a bank conflict?
- What is work efficiency?
- What is the difference between a thread block scheduler and a warp scheduler?
- How are atomics implemented in modern GPU hardware?
- How is \_\_shared\_\_ memory implemented by modern GPU hardware?
- Why is \_\_shared\_\_ memory necessary if GPUs have an L1 cache? When will an L1 cache provide all the benefit of \_\_shared\_\_ memory and when will it not?
- Is cudaDeviceSynchronize still necessary after copyback if I have just one CUDA stream?

## GPU Cache Coherence Challenges



## GPU Cache Coherence Challenges



# Temporal Coherence (TC)





# TC-Strong vs TC-Weak



