

María Garzarán, Saeed Maleki William Gropp and David Padua

Department of Computer Science University of Illinois at Urbana-Champaign

#### Program Optimization Through Loop Vectorization

Materials for this tutorial can be found: http://polaris.cs.uiuc.edu/~garzaran/pldi-polv.zip

Questions? Send an email to garzaran@uiuc.edu

2

I

#### Topics covered in this tutorial

- What are the microprocessor vector extensions or SIMD (Single Instruction Multiple Data Units)
- How to use them
  - Through the compiler via automatic vectorization
    - Manual transformations that enable vectorization
    - Directives to guide the compiler
  - Through intrinsics
- · Main focus on vectorizing through the compiler.

3

- Code more readable
- Code portable

I

I

#### Outline

- 1. Intro
- 2. Data Dependences (Definition)
- 3. Overcoming limitations to SIMD-Vectorization

4

- Data Dependences
- Data Alignment
- Aliasing
- Non-unit strides
- Conditional Statements
- 4. Vectorization with intrinsics

























| Loop Transformations                                                                                                                                                                |                                                                                                                   |               |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------|---------------|
| S136                                                                                                                                                                                | S136_1                                                                                                            | S136_2        |
| <pre>for (int i=0; i<len; (int="" 0.0;="" b[i]="(float)" for="" i++){="" i<len;="" j="" sum="(float)" sum[i]="(float)" th="" }<=""></len;></pre>                                    |                                                                                                                   |               |
| Intel Nehalem<br>Compiler report: Loop was not<br>vectorized. Vectorization<br>possible but seems inefficient<br>Exec. Time scalar code: 3.7<br>Exec. Time vector code:<br>Speedup: | Intel Nehalem<br>report: Permuted loop<br>was vectorized.<br>scalar code: 1.6<br>vector code: 0.6<br>Speedup: 2.6 | Intel Nehalem |















Dependences in loops are easy to understand if the loops are unrolled. Now the dependences are between statement "executions".















### Dependences in Loops (I)

Dependences in loops are easy to understand if loops are unrolled. Now the dependences are between statement "executions"

1

For the dependences shown here, we assume that arrays do not overlap in memory (no aliasing). Compilers must know that there is no aliasing in order to vectorize.

35

**Dependences in Loops (II)** 



































## Data dependences and vectorization

• Main idea: A statement inside a loop which is not in a cycle of the dependence graph can be vectorized.



# Data dependences and transformations

- When cycles are present, vectorization can be achieved by:
  - Separating (distributing) the statements not in a cycle
  - Removing dependences
  - Freezing loops
  - Changing the algorithm

for (i=1; i<n; i++){ S1 b[i] = b[i] + c[i]; S2 a[i] = a[i-1]\*a[i-2]+b[i]; S3 c[i] = a[i] + 1;  $\begin{array}{l} b[1;n-1] &= b[1;n-1] + c[1;n-1]; \\ for & (i=1; i < n; i++) \\ & a[i] &= a[i-1]^* a[i-2] + b[i]; \\ \end{array}$ c[1:n-1] = a[1:n-1] + 1;I 56

**Distributing** 





#### Changing the algorithm

- When there is a recurrence, it is necessary to change the algorithm in order to vectorize.
- Compiler use pattern matching to identify the recurrence and then replace it with a parallel version.
- Examples or recurrences include:
  - Reductions (S+=A[i])
  - Linear recurrences (A[i]=B[i]\*A[i-1]+C[i])
  - Boolean recurrences (if (A[i]>max) max = A[i])







#### Outline

- 1. Intro
- 2. Data Dependences (Definition)
- 3. Overcoming limitations to SIMD-Vectorization
  - Data Dependences
  - Data Alignment
  - Aliasing
  - Non-unit strides
  - Conditional Statements
- 4. Vectorization with intrinsics

1

63

#### **Loop Vectorization**

- Loop Vectorization is not always a legal and profitable transformation.
- · Compiler needs:
  - Compute the dependences
    - The compiler figures out dependences by
      - Solving a system of (integer) equations (with constraints)

64

- Demonstrating that there is no solution to the system of equations
- Remove cycles in the dependence graph
- Determine data alignment
- Vectorization is profitable



































































































































### Outline

#### 1. Intro

I

- 2. Data Dependences (Definition)
- 3. Overcoming limitations to SIMD-Vectorization

- Data Dependences
- Reductions
   Data Alignment
- Aliasing
- Non-unit strides
- Conditional Statements
- 4. Vectorization using intrinsics





















### Outline

- 1. Intro
- 2. Data Dependences (Definition)
- Overcoming limitations to SIMD-Vectorization
   Data Dependences

146

- Data Alignment
- Aliasing
- Non-unit strides
- Conditional Statements
- 4. Vectorization with intrinsics

### I



### • Vector load/store from aligned data requires one

- aligned data requires one memory access
- Vector load/store from unaligned data requires multiple memory accesses and some shift operations



Reading 4 bytes from address 1 requires two loads







| Data Alignment                                                                                                                                                                                                                                                                                                    | - Examp                                                                                                                                       | ole                                |          |         |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------|----------|---------|
| <pre>float A[N]attribute((aligned<br/>float B[N]attribute((aligned<br/>float C[N]attribute((aligned<br/>void test10){<br/>m128 rA, rB, rC;<br/>for (int i = 0; i &lt; N; i+=4){<br/>rA = _mm_load_ps(A&amp;[1]);<br/>rB = _mm_load_ps(A&amp;[1]);<br/>rC = _mm_add_ps(A, rB);<br/>mstore_ps(A&amp;[1], rC);</pre> | <pre>(16))); (16))); void test2(){    m128 rA, rB, rC; for (int i = 0; i     rA = _mm_loadu_p     rB = _mm_loadu_p     rC = _mm_add_p8(</pre> | s(&A[i]);<br>s(&B[i]);<br>rA, rB); | [        |         |
| _mm_store_ps(&c[i], rc);<br>}}                                                                                                                                                                                                                                                                                    | _mm_storeu_ps(&C<br>}}                                                                                                                        | :[i], rC);                         |          |         |
| <pre>void test3(){    m128 rA, rB, rC; for (int i = 1; i &lt; N-3; i+=4){</pre>                                                                                                                                                                                                                                   | Nanose                                                                                                                                        | cond per itera                     | ation    |         |
| rA = _mm_loadu_ps(&A[i]);<br>rB = _mm_loadu_ps(&B[i]);                                                                                                                                                                                                                                                            |                                                                                                                                               | Core 2 Duo                         | Intel i7 | Power 7 |
| <pre>rC = _mm_add_ps(rA, rB);</pre>                                                                                                                                                                                                                                                                               | Aligned                                                                                                                                       | 0.577                              | 0.580    | 0.156   |
| _mm_storeu_ps(&C[i], rC);<br>}}                                                                                                                                                                                                                                                                                   | Aligned (unaligned Id)                                                                                                                        | 0.689                              | 0.581    | 0.241   |
|                                                                                                                                                                                                                                                                                                                   | Unaligned                                                                                                                                     | 2.176                              | 0.629    | 0.243   |
| I                                                                                                                                                                                                                                                                                                                 | µ                                                                                                                                             |                                    |          |         |

| Alignment in a struct                                                                            |
|--------------------------------------------------------------------------------------------------|
| <pre>struct st{     char A;     int B[64];     float C;     int D[64]; };</pre>                  |
| <pre>int main(){     st s1;     printf("%p, %p, %p\n", &amp;s1.A, s1.B, &amp;s1.C, s1.D);}</pre> |
| Output:<br>0x7fffe6765f00, 0x7fffe6765f04, 0x7fffe6766004, 0x7fffe6766008                        |
| <ul> <li>Arrays B and D are not 16-bytes aligned (see the address)</li> </ul>                    |
| 154                                                                                              |



### Outline

- 1. Intro
- 2. Data Dependences (Definition)
- 3. Overcoming limitations to SIMD-Vectorization

156

- Data Dependences
- Data Alignment
- Aliasing
- Non-unit strides
- Conditional Statements
- 4. Vectorization withintrinsics

I



 ${\rm I}$ 



















### Aliasing – Multidemensional Arrays

Three solutions when <u>\_\_restrict\_\_</u> does not enable vectorization

167

- 1. Static and global arrays
- 2. Linearize the arrays and use \_\_restrict\_\_ keyword
- 3. Use compiler directives

Aliasing – Multidimensional arrays 1. Static and Global declaration

| <pre>attribute ((aligned(16)) void t(){</pre> | ) float a[N][N]; |
|-----------------------------------------------|------------------|
| a[i][j]<br>}                                  |                  |
| <pre>int main() {</pre>                       |                  |
| Ť();<br>}                                     |                  |
| I                                             | 168              |



### Aliasing – Multidimensional arrays 3. Use compiler directives:

#pragma disjoint(IBM XLC)
void func1(float \*\*a, float \*\*b, float \*\*c) {
 for (int i=0; i <m; i++) {
 #pragma ivdep
 for (int j=0; j <LEN; j++)
 c[i][j] = b[i][j] \* a[i][j];
}}</pre>

170

#pragma ivdep (Intel ICC)

 $\mathbb{I}$ 















| Non-unit Stri                                                                                                                                                                       | ide – Exam                                                                                                        | ple II                                                                                                                                   |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------|
| S136                                                                                                                                                                                | S136_1                                                                                                            | S136_2                                                                                                                                   |
| <pre>sum = (float) 0.0; sum<br/>for (int j=0; j<len; for<br="" j++)="" {="">sum += A[j][i];<br/>}</len;></pre>                                                                      | [i] = (float) 0.0;                                                                                                | r (int i=0;i <en;i++)<br>B[i] = (float) 0.0;<br/>for (int j=0;j<en;j++){<br>B[i] += A[j][i];<br/>}<br/>S136_2</en;j++){<br></en;i++)<br> |
| Intel Nehalem<br>Compiler report: Loop was not<br>vectorized. Vectorization<br>possible but seems inefficient<br>Exec. Time scalar code: 3.7<br>Exec. Time vector code:<br>Speedup: | Intel Nehalem<br>report: Permuted loop<br>was vectorized.<br>scalar code: 1.6<br>vector code: 0.6<br>Speedup: 2.6 | Intel Nehalem<br>report: Permuted loop<br>was vectorized.<br>scalar code: 1.6<br>vector code: 0.6<br>Speedup: 2.6                        |



### Outline

### 1. Intro

- 2. Data Dependences (Definition)
- 3. Overcoming limitations to SIMD-Vectorization

181

- Data Dependences
- Data Alignment
- Aliasing
- Non-unit strides
- Conditional Statements
- 4. Vectorization with intrinsics

 $\mathbb{I}$ 





### **Conditional Statements – I**

- Loops with conditions need #pragma vector al ways
   Since the compiler does not know if vectorization will be
  profitable
- The condition may prevent from an exception

#pragma vector al ways
for (int i = 0; i < LEN; i++){
 if (c[i] < (float) 0.0)
 a[i] = a[i] \* b[i] + d[i];
}</pre>

182

1

•



Т

| Conditional Sta                                                                                       | atements                                                                                                                                                      |
|-------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <pre>for (int i=0;i&lt;1024;i++){     if (c[i] &lt; (float) 0.0)         a[i]=a[i]*b[i]+d[i]; }</pre> | <pre>vector bool char = rCmp<br/>vector float r0={0.,0.,0.,0.};<br/>vector float rA,rB,rC,rD,rS, rT,<br/>rThen,rElse:<br/>for (int i=0;i&lt;1024;i+=4){</pre> |
| Speedups will depend on the values on c[i]                                                            | rThen = vec_and(rT.rCmp);<br>rElse = vec_andc(rA.rCmp);<br>rS = vec_or(rthen, relse);                                                                         |
| Compiler tends to be<br>conservative, as the condition<br>may prevent from segmentation<br>faults     | //store rS<br>}                                                                                                                                               |
| 1                                                                                                     | 187                                                                                                                                                           |

### Compiler Directives • Compiler vectorizes many loops, but many more can be vectorized if the appropriate directives are used Compiler Hints for Intel ICC Semantics #pragma ivdep Ignore assume data dependences #pragma vector always override efficiency heuristics

|   | #pragma novector               | disable vectorization                      |
|---|--------------------------------|--------------------------------------------|
|   | restrict                       | assert exclusive access through<br>pointer |
|   | attribute ((aligned(int-val))) | request memory alignment                   |
|   | memalign(int-val,size);        | malloc aligned memory                      |
|   | assume_aligned(exp, int-val)   | assert alignment property                  |
| I |                                | 188                                        |

### **Compiler Directives**

· Compiler vectorizes many loops, but many more can be vectorized if the appropriate directives are used

| Compiler Hints for IBM XLC              | Semantics                                  |
|-----------------------------------------|--------------------------------------------|
| <pre>#pragma ibm independent_loop</pre> | Ignore assumed data dependences            |
| #pragma nosimd                          | disable vectorization                      |
| restrict                                | assert exclusive access through<br>pointer |
| attribute ((aligned(int-val)))          | request memory alignment                   |
| memalign(int-val,size);                 | malloc aligned memory                      |
| alignx (int-val, exp)                   | assert alignment property                  |
|                                         |                                            |
|                                         | 189                                        |

189

### **Outline**

- 1. Intro
- 2. Data Dependences (Definition)
- 3. Overcoming limitations to SIMD-Vectorization

190

- Data Dependences
- Data Alignment \_
- Aliasing
- Non-unit strides Conditional Statements
- 4. Vectorization with intrinsics

1

### Access the SIMD through intrinsics

- · Intrinsics are vendor/architecture specific
- · We will focus on the Intel vector intrinsics
- · Intrinsics are useful when
  - the compiler fails to vectorize
  - when the programmer thinks it is possible to generate better code than the one produced by the compiler

1

191

### The Intel SSE intrinsics Header file

- · SSE can be accessed using intrinsics.
- · You must use one of the following header files: #include <xmmintrin.h> (for SSE) #include <emmintrin.h> (for SSE2) #include cpmmintrin.h> (for SSE3) #include <smmintrin.h> (for SSE4)
- These include the prototypes of the intrinsics.

192

I





### Intel SSE intrinsics Instructions – Examples

Load four 16-byte aligned single precision values in a vector: float a[4]=(1.0, 2.0, 3.0, 4.0); //a must be 16-byte aligned \_m128 x = \_mm\_load\_ps(a);
Add two vectors containing four single precision values: \_\_m128 a, b; \_\_m128 c = \_mm\_add\_ps(a, b);

















### Summary

- Microprocessor vector extensions can contribute to improve program performance and the amount of this contribution is likely to increase in the future as vector lengths grow.
- · Compilers are only partially successful at vectorizing
- When the compiler fails, programmers can

   add compiler directives
  - apply loop transformations
- If after transforming the code, the compiler still fails to vectorize (or the performance
  of the generated code is poor), the only option is to program the vector extensions
  directly using intrinsics or assembly language.

205

I



# Data Dependences The correctness of many many loop transformations including vectorization can be decided using dependences. A good introduction to the notion of dependence and its applications can be found in D. Kuck, R. Kuhn, D. Padua, B. Leasure, M. Wolfe: Dependence Graphs and Compiler Optimizations. POPL 1981.

### **Algorithms**

- W. Daniel Hillis and Guy L. Steele, Jr. 1986. Data parallel algorithms. *Commun. ACM* 29, 12 (December 1986), 1170-1183.
- Shyh-Ching Chen, D.J. Kuck, "Time and Parallel Processor Bounds for Linear Recurrence Systems," IEEE Transactions on Computers, pp. 701-717, July, 1975

208

I



### Program Optimization Through Loop Vectorization

María Garzarán, Saeed Maleki William Gropp and David Padua {garzaran,maleki,wgropp,padua}@illinois.edu Department of Computer Science University of Illinois at Urbana-Champaign



















| Vendor   | Name                                                      | n-ways                      | Precision                  | Introduced with                                                                     |
|----------|-----------------------------------------------------------|-----------------------------|----------------------------|-------------------------------------------------------------------------------------|
| Intel    | SSE<br>SSE2<br>SSE3<br>SSSE3<br>SSE4                      | 4-way<br>+ 2-way            | single<br>double           | Pentium III<br>Pentium 4<br>Pentium 4(Prescott<br>Core Duo<br>Core2 Extreme (Penryn |
|          | AVX                                                       | + 4-way                     | double                     | Sandy Bridge, 2011                                                                  |
| Intel    | IPF                                                       | 2-way                       | single                     | Itanium                                                                             |
| AMD      | 3DNOW!<br>Enhanced 3DNow!<br>3DNow! Professional<br>AMD64 | 2-way<br>+ 4-way<br>+ 2-way | single<br>single<br>double | K6<br>K7<br>AthlonXP<br>Opteron                                                     |
| Motorola | AltiVec                                                   | 4-way                       | single                     | MPC 7400 G4                                                                         |
| IBM      | VMX                                                       | 4-way                       | single                     | PowerPC 970 G5                                                                      |
|          | SPU<br>Double FPU                                         | + 2-way<br>2-way            | double<br>double           | Cell BE<br>PowerPC 440 FP2                                                          |
| IBM      | VMX<br>SPU<br>Double FPU<br>r architecture is found i     | 4-way<br>+ 2-way<br>2-way   | single<br>double<br>double | PowerPC 970 G5<br>Cell BE                                                           |



| single<br>double<br>single<br>single<br>single | Pentium III<br>Pentium 4<br>Pentium 4(Prescott<br>Core Duo<br>Core2 Extreme (Penryn)<br>Sandy Bridge, 2011<br>Itanium<br>K6<br>K7<br>AttlonXP |
|------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|
| double<br>single<br>single                     | Pentium 4(Prescott<br>Core Duo<br>Core2 Extreme (Penryn)<br>Sandy Bridge, 2011<br>Itanium<br>K6<br>K7                                         |
| single<br>single                               | Core Duo<br>Core2 Extreme (Penryn)<br>Sandy Bridge, 2011<br>Itanium<br>K6<br>K7                                                               |
| single<br>single                               | Core2 Extreme (Penryn)<br>Sandy Bridge, 2011<br>Itanium<br>K6<br>K7                                                                           |
| single<br>single                               | Sandy Bridge, 2011<br>Itanium<br>K6<br>K7                                                                                                     |
| single<br>single                               | Itanium<br>K6<br>K7                                                                                                                           |
| single                                         | K6<br>K7                                                                                                                                      |
| 0                                              | K7                                                                                                                                            |
| sinale                                         |                                                                                                                                               |
| sinale                                         | A Hele in MD                                                                                                                                  |
|                                                | AthionXP                                                                                                                                      |
| double                                         | Opteron                                                                                                                                       |
| single                                         | MPC 7400 G4                                                                                                                                   |
|                                                | PowerPC 970 G5                                                                                                                                |
|                                                | Cell BE                                                                                                                                       |
| double                                         | PowerPC 440 FP2                                                                                                                               |
| nsoles (PS2, PS                                | S3) and GPUs (NVIDIA's                                                                                                                        |
|                                                | single<br>double<br>double                                                                                                                    |



















| How we          | ell o | do compilers                             | 5 V  | ectorize?               |    |
|-----------------|-------|------------------------------------------|------|-------------------------|----|
| Results for     | or    |                                          |      |                         |    |
|                 |       | Vectorizing compilers t<br>David Levine. | by D | avid Callahan, Jack     |    |
| – IBM XL        | C con | npiler, version 11                       |      |                         |    |
|                 |       |                                          |      |                         |    |
| Total           | 159   | 1                                        |      | _                       |    |
| Auto Vectorized | 74    | Not Vectorized                           | 85   |                         |    |
|                 |       | Vectorizable by                          | 60   | Impossible to Vectorize | 25 |
|                 |       | Classic Transformation                   | 35   | Non-unit Stride Access  | 16 |
|                 |       | New Transformation                       | 6    | Data Dependence         | 5  |
|                 |       | Manual Vectorization                     | 19   | Other                   | 4  |
|                 |       |                                          |      |                         |    |
| I               |       | 235<br>235                               |      |                         |    |

| How | well do compilers                | vectorize? |
|-----|----------------------------------|------------|
|     | Loops                            | Percentage |
|     | Vectorizable                     | 84.3%      |
|     | - Vectorized                     | 46.5%      |
|     | - Classic transformation applied | 22.0%      |
|     | - New transformation applied     | 3.8%       |
|     | - Manual vector code             | 11.9%      |
|     |                                  |            |
| I   | 236                              |            |

| Transformations                          | Explanation                                                                                                                                                                                                         |
|------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Classic transformation<br>(source level) | Loop Interchange     Scalar Expansion     Scalar and Array Renaming     Node Splitting     Reduction     Loop Peeling     Loop Distribution     Run-Time Symbolic Resolution     Speculating Conditional Statements |
| New Transformation<br>(Intrinsics)       | Manually vectorized Matrix Transposition     Manually vectorized Prefix Sum                                                                                                                                         |
| Manual Transformation<br>(Intrinsics)    | Auto vectorization is inefficient     Vectorization of the transformed code is inefficient     No transformation found to enable auto     vectorization                                                             |

### What are the speedups?

• Speedups obtained by XLC compiler

|   | Test Suite Collection             | Average Speed Up |  |
|---|-----------------------------------|------------------|--|
|   | Automatically by XLC              | 1.73             |  |
|   | By adding classic transformations | 3.48             |  |
|   | By adding new transformations     | 3.64             |  |
|   | By adding manual vectorization    | 3.78             |  |
|   |                                   |                  |  |
|   |                                   |                  |  |
|   |                                   |                  |  |
|   |                                   |                  |  |
|   |                                   |                  |  |
| I | 238                               |                  |  |
|   |                                   |                  |  |



| Compiler                   | XLC but<br>not ICC | ICC but<br>not XLC |
|----------------------------|--------------------|--------------------|
| Vectorized                 | 25                 | 26                 |
| Aliasing                   | 5                  | 0                  |
| Acyclic data dependence    | 4                  | 0                  |
| Unrolled loop              | 3                  | 0                  |
| Loop interchange           | 2                  | 0                  |
| Conditional statements     | 4                  | 7                  |
| Unsupported loop structure | 0                  | 5                  |
| Wrap around variable       | 0                  | 3                  |
| Reduction                  | 1                  | 2                  |
| Non-unit stride access     | 1                  | 2                  |
| Other                      | 5                  | 6                  |





























# <section-header>















































### Intel SSE intrinsics Instructions – Examples II

- Add two vectors containing four single precision values:
   \_\_m128 a, b;
   \_\_m128 c = \_\_mm\_add\_ps(a, b);
- Multiply two vectors containing four floats:

```
__m128 a, b;
__m128 c = _mm_mul_ps(a, b);
```

Add two vectors containing two doubles:

```
__m128d x, y;
__m128d z = _mm_add_pd(x, y);
```

1

279

### Intel SSE intrinsics Instructions – Examples III

• Add two vectors of 8 16-bit signed integers using saturation arithmetic (\*)

```
__m128i r, s;
__m128i t = _mm_adds_epi16(r, s);
```

Compare two vectors of 16 8-bit signed integers

```
<u>_____m128i a, b;</u>
```

```
\__m128i c = \_mm\_cmpgt\_epi8(a, b);
```

(\*) [In saturation arithmetic] all operations such as addition and multiplication are limited to a fixed range between a minimum and maximum value. If the result of an operation is greater than the maximum it is set (\*clamped\*) to the maximum, while if it is below the minimum it is clamped to the minimum. [From the wikipedia]

280









