# Weak Memory Models: A Tutorial 

Jade Alglave

University College London
February 3rd, 2014

## Sequential Consistency

A comfortable model for concurrent programming would be Sequential Consistency (SC), as defined by Leslie Lamport in 1979:

The result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.

## Example

Consider the following example, where initially $x=y=0$ :

| sb |  |
| :--- | :--- |
| $P_{0}$ | $P_{1}$ |
| $(a) \mathrm{x} \leftarrow 1$ | $(c) \mathrm{y} \leftarrow 1$ |
| $(b) \mathrm{r} 1 \leftarrow \mathrm{y}$ | $(d) \mathrm{r} 2 \leftarrow \mathrm{x}$ |
| $\mathrm{r} 1=? ; \mathrm{r} 2=? ;$ |  |

Following SC, we expect three possible outcomes:

$$
\begin{array}{l|l}
(a)(b)(c)(d) & \mathrm{r} 1=0 \wedge \mathrm{r} 2=1 \\
\hline(c)(d)(a)(b) & \mathrm{r} 1=1 \wedge \mathrm{r} 2=0 \\
\hline(a)(c)(b)(d) & \\
(a)(c)(d)(b) & \mathrm{r} 1=1 \wedge \mathrm{r} 2=1 \\
(c)(a)(b)(d) & \\
(c)(a)(d)(b) &
\end{array}
$$

## Example

Consider the following example, where initially $x=y=0$ :

| sb |  |
| :--- | :--- |
| $P_{0}$ | $P_{1}$ |
| $(a) \mathrm{x} \leftarrow 1$ | $(c) \mathrm{y} \leftarrow 1$ |
| (b) $\mathrm{r} 1 \leftarrow \mathrm{y}$ | $(d) \mathrm{r} 2 \leftarrow \mathrm{x}$ |
| $\mathrm{r} 1=? ; \mathrm{r} 2=? ;$ |  |

Following SC, we expect three possible outcomes:

$$
\begin{array}{l|l}
(a)(b)(c)(d) & \mathrm{r} 1=0 \wedge \mathrm{r} 2=1 \\
\hline(c)(d)(a)(b) & \mathrm{r} 1=1 \wedge \mathrm{r} 2=0 \\
\hline(a)(c)(b)(d) & \\
(a)(c)(d)(b) & \mathrm{r} 1=1 \wedge \mathrm{r} 2=1 \\
(c)(a)(b)(d) & \\
(c)(a)(d)(b) &
\end{array}
$$

## Example

Consider the following example, where initially $x=y=0$ :

| sb |  |
| :--- | :--- |
| $P_{0}$ | $P_{1}$ |
| $(a) \mathrm{x} \leftarrow 1$ | $(c) \mathrm{y} \leftarrow 1$ |
| $(b) \mathrm{r} 1 \leftarrow \mathrm{y}$ | $(d) \mathrm{r} 2 \leftarrow \mathrm{x}$ |
| $\mathrm{r} 1=0 ; \mathrm{r} 2=? ;$ |  |

Following SC, we expect three possible outcomes:

$$
\begin{array}{l|l}
(a)(b)(c)(d) & \mathrm{r} 1=0 \wedge \mathrm{r} 2=1 \\
\hline(c)(d)(a)(b) & \mathrm{r} 1=1 \wedge \mathrm{r} 2=0 \\
\hline(a)(c)(b)(d) & \\
(a)(c)(d)(b) & \mathrm{r} 1=1 \wedge \mathrm{r} 2=1 \\
(c)(a)(b)(d) & \\
(c)(a)(d)(b) &
\end{array}
$$

## Example

Consider the following example, where initially $x=y=0$ :

| sb |  |
| :--- | :--- |
| $P_{0}$ | $P_{1}$ |
| $(a) \mathrm{x} \leftarrow 1$ | $(c) \mathrm{y} \leftarrow 1$ |
| $(b) \mathrm{r} 1 \leftarrow \mathrm{y}$ | $(d) \mathrm{r} 2 \leftarrow \mathrm{x}$ |
| $\mathrm{r} 1=0 ; \mathrm{r} 2=? ;$ |  |

Following SC, we expect three possible outcomes:

$$
\begin{array}{l|l}
(a)(b)(c)(d) & \mathrm{r} 1=0 \wedge \mathrm{r} 2=1 \\
\hline(c)(d)(a)(b) & \mathrm{r} 1=1 \wedge \mathrm{r} 2=0 \\
\hline(a)(c)(b)(d) & \\
(a)(c)(d)(b) & \mathrm{r} 1=1 \wedge \mathrm{r} 2=1 \\
(c)(a)(b)(d) & \\
(c)(a)(d)(b) &
\end{array}
$$

## Example

Consider the following example, where initially $x=y=0$ :

| sb |  |
| :--- | :--- |
| $P_{0}$ | $P_{1}$ |
| (a) $\mathrm{x} \leftarrow 1$ | $(c) \mathrm{y} \leftarrow 1$ |
| (b) $\mathrm{r} 1 \leftarrow \mathrm{y}$ | $(d) \mathrm{r} 2 \leftarrow \mathrm{x}$ |
| $\mathrm{r} 1=0 ; \mathrm{r} 2=1 ;$ |  |

Following SC, we expect three possible outcomes:

$$
\begin{array}{l|l}
(a)(b)(c)(d) & \mathrm{r} 1=0 \wedge \mathrm{r} 2=1 \\
\hline(c)(d)(a)(b) & \mathrm{r} 1=1 \wedge \mathrm{r} 2=0 \\
\hline(a)(c)(b)(d) & \\
(a)(c)(d)(b) & \mathrm{r} 1=1 \wedge \mathrm{r} 2=1 \\
(c)(a)(b)(d) & \\
(c)(a)(d)(b) &
\end{array}
$$

## Experiment

On an Intel Core 2 Duo:

$$
\{x=0 ; y=0 ;\}
$$

| P0 | P1 |
| :---: | :---: |
| MOV [y],\$1 | MOV [x],\$1 |
| MOV EAX, [x] | MOV EAX, [y] |

exists ( $0: E A X=0 / \backslash 1: E A X=0$ )
Certain instructions appear to be reordered w.r.t. the program order.
Let us check that on my machine.

## Weak memory models

For performance reasons, modern architectures provide several features that are weakenings of SC:

For some applications, achieving sequential consistency may not be worth the price of slowing down the processors. In this case, one must be aware that conventional methods for designing multiprocess algorithms cannot be relied upon to produce correctly executing programs.

## How can we make sure that we write correct programs?

- We need to understand precisely what memory models guarantee to write correct concurrent programs.
- This problem spreads to high level languages and is potentially much worse, due to compiler optimisations.


## Surely there are specs?

Documentation is (at least) ambiguous, since written in natural language.

## Surely there are specs?

> "all that horrible horribly incomprehensible and confusing [...] text that no-one can parse or reason with
> - not even the people who wrote it"
> Anonymous Processor Architect, 2011

## Describing executions

## Style of modelling

Memory models roughly fall into two classes:

- Operational
- Axiomatic


## Building an execution

| rlns |  |
| :---: | :---: |
| $P_{0}$ | $P_{1}$ |
| $(a) \mathrm{x} \leftarrow 2$ | $(b) \mathrm{x} \leftarrow 1$ <br> $(c) \mathrm{r} 1 \leftarrow \mathrm{x}$ |
| Allowed: 1: $\mathrm{r} 1=1 ; \mathrm{x}=2 ;$ |  |

## Building an execution : Events $\mathbb{E}$ and program order po

## rlns

| $P_{0}$ | $P_{1}$ |
| :---: | :---: |
| $(a) \mathrm{x} \leftarrow 2$ | $(b) \mathrm{x} \leftarrow 1$ |
|  | $(c) \mathrm{r} 1 \leftarrow \mathrm{x}$ |,

$$
\mathrm{a}: \mathrm{W}[\mathrm{x}]=2 \quad \mathrm{~b}: \mathrm{W}[\mathrm{x}]=1
$$

We write $E \triangleq(\mathbb{E}, \mathrm{po})$ for such a structure.

## Building an execution: Coherence co

| rlns |  |
| :---: | :---: |
| $P_{0}$ | $P_{1}$ |
| $(a) \mathrm{x} \leftarrow 2$ | $(b) \mathrm{x} \leftarrow 1$ <br> $(c) \mathrm{r} 1 \leftarrow \mathrm{x}$ |
| Allowed: 1: $\mathrm{r} 1=1 ; \mathrm{x}=2 ;$ |  |

$$
\begin{aligned}
\mathrm{a}: \mathrm{W}[\mathrm{x}]=2 \mathrm{co} \mathrm{~b}: \mathrm{W}[\mathrm{x}]=1 \\
\mathrm{po} \\
\mathrm{c}: \mathrm{R}[\mathrm{x}]=1
\end{aligned}
$$

The coherence co orders totally all the write events to the same memory location.

## Building an execution: Read-from rf



The read-from map rf links a write and any read that reads from it.

## Building an execution: From-read map fr

rlns

| $P_{0}$ | $P_{1}$ |
| :---: | :---: |
| (a) $\mathrm{x} \leftarrow 2$ | $(b) \mathrm{x} \leftarrow 1$ |
|  | $(c) \mathrm{r} 1 \leftarrow \mathrm{x}$ |, | Allowed: $1: \mathrm{r} 1=1 ; \mathrm{x}=2 ;$ |
| :---: |



We derive the from-read map fr from co and rf.

## Building an execution : Execution witness $X \triangleq(c o, r f)$

## rlns

| $P_{0}$ | $P_{1}$ |
| :---: | :---: |
| $(a) \mathrm{x} \leftarrow 2$ | $(b) \mathrm{x} \leftarrow 1$ |
|  | $(c) \mathrm{r} 1 \leftarrow \mathrm{x}$ |,

$$
\mathrm{a}: \mathrm{W}[\mathrm{x}]=2 \mathrm{ci}_{\mathrm{co}}^{\mathrm{co}} \mathrm{~b}: \mathrm{W}[\mathrm{x}]=1
$$

We define an execution witness as $X \triangleq(c o, r f)$.

## Describing architectures

## Four axioms

- Uniproc
- No thin air
- Causality
- Propagation


## Uniproc (Coherence)

All the models I have studied preserve SC per location.

```
a: W[x]=1
co\ \po
b: W[x]=2
```


## Uniproc (Coherence)

All the models I have studied preserve SC per location.

$$
\begin{aligned}
& \text { a: } \mathrm{R}[\mathrm{x}]=1 \\
& \mathrm{rf} \prod_{\mathrm{W}[\mathrm{x}=1}^{\mathrm{b}} \mathrm{~b}=
\end{aligned}
$$

## Uniproc (Coherence)

All the models I have studied preserve SC per location.


## Uniproc (Coherence)

All the models I have studied preserve SC per location.


## Uniproc (Coherence)

All the models I have studied preserve SC per location.


## Uniproc (Coherence)

All the models I have studied preserve SC per location.

This ensures that non-relational analyses are sound on weak memory.

## No thin air

All the models I have studied define a happens-before relation:


## No thin air

All the models I have studied define a happens-before relation:

which should be acyclic

## Causality (mp)

This happens-before relation determines which message passing idioms work as intended:


## Causality (wrc)

This happens-before relation determines which write-to-read causality idioms work as intended:

$$
a: W x=1
$$

## Propagation $(2+2 w)$

Fences constrain the order in which writes to different locations propagate:


## Propagation $(w+r w+2 w)$

Fences constrain the order in which writes to different locations propagate:

## A real-world excerpt

## PostgreSQL developers' discussions



## Synchronisation in PostgreSQL

```
void worker(int i)
    { while(! latch [i ]);
        for (;;)
        { assert (! latch [i] || flag [i ]);
            latch[i] = 0;
            if ( flag [i])
            { flag[i] = 0;
            flag [( i+1)%WORKERS] = 1;
            latch [( i+1)%WORKERS] = 1;
        }
        while(! latch [i ]);
        }
    }
```

Each element of the array latch is a shared boolean variable dedicated to interprocess communication.

A process waits to have its latch set then should have work to do, namely passing around a token via the array flag (line 8).
Once the process is done, it sets the latch of the process the token was passed to (line 9).

## Synchronisation in PostgreSQL

void worker(int i)
\{ while(! latch [i]);
for (; ;)
\{ assert (! latch [i] || flag[i]);
latch [i] $=0$;
if (flag [i])
$\{$ flag $[i]=0$; flag $[(\mathrm{i}+1) \%$ WORKERS $]=1$; latch $[(i+1) \%$ WORKERS $]=1$; \}
while(! latch [i]); \}
\}

Starvation seemingly cannot occur: when a process is woken up, it has work to do. Yet, the developers observed that the wait in line 11
would time out,
i.e. starvation of the ring of processes.
The processor can delay the write in line 8 until after the latch had been set in line 9 .

## Message passing idiom in PostgreSQL

This corresponds to the message passing idiom

| pgsql (mp) |  |
| :--- | :--- |
| Worker 0 | Worker 1 |
| $(8) f[1]=1 ;$ | $(2)$ while $(!1[1])$; |
| $(9) 1[1]=1 ;$ | $(6)$ if $(f[1])$ |
| Observed: $1[1]=1 ; \quad \mathrm{f}[1]=0$ |  |



## Message passing idiom in PostgreSQL

This corresponds to the message passing idiom which requires synchronisation to behave as on SC

| pgsql (mp) |  |
| :---: | :---: |
| Worker 0 | Worker 1 |
| $(8) \mathrm{f}[1]=1 ;$ | $(2)$ while (! $1[1])$; |
| lwsync | dependency |
| $(9)][1]=1 ;$ | $(6)$ if ( $f[1])$ |



Forbidden: $1[1]=1 ; f[1]=0$

## Verification

## Porte ouverte à deux battants

We propose two ways of verifying concurrent software running on weak memory:

- we instrument the program to embed the weak memory semantics inside it, then feed the transformed program to an SC verification tool;
- we explicitly build partial order models representing the possible executions of the program on weak memory.


## Independent Reads of Independent Writes

| iriw |  |  |  |
| :--- | :---: | :---: | :---: |
| $P_{0}$ | $P_{1}$ | $P_{2}$ | $P_{3}$ |
| (a) $\mathrm{r} 1 \leftarrow \mathrm{x}$ | $(c) \mathrm{r} 3 \leftarrow \mathrm{y}$ | $(e) \mathrm{x} \leftarrow 1$ | $(f) \mathrm{y} \leftarrow 2$ |
| $(b) \mathrm{r} 2 \leftarrow \mathrm{y}$ | $(d) \mathrm{r} 4 \leftarrow \mathrm{x}$ |  |  |
| $\mathrm{r} 1=1 ; \mathrm{r} 2=0 ; \mathrm{r} 3=2 ; \mathrm{r} 4=0 ;$ |  |  |  |



## iriw on SC

iriw

| $P_{0}$ | $P_{1}$ | $P_{2}$ | $P_{3}$ |
| :---: | :---: | :---: | :---: |
| (a) $\mathrm{r} 1 \leftarrow \mathrm{x}$ <br> $(b) \mathrm{r} 2 \leftarrow \mathrm{y}$ | $(c) \mathrm{r} 3 \leftarrow \mathrm{y}$ | $(d) \mathrm{r} 4 \leftarrow \mathrm{x}$ | $(e) \leftarrow 1$ |
| $\mathrm{r} 1=1 ; \mathrm{r} 2=0 ; \mathrm{r} 3=2 ; \mathrm{r} 4=0 ;$ |  |  |  |



## iriw on Power

iriw

| $P_{0}$ | $P_{1}$ | $P_{2}$ | $P_{3}$ |
| :---: | :---: | :---: | :---: |
| (a) $\mathrm{r} 1 \leftarrow \mathrm{x}$ | (c) $\mathrm{r} 3 \leftarrow \mathrm{y}$ | $(e) \mathrm{x} \leftarrow 1$ | $(f) \mathrm{y} \leftarrow 2$ |
| (b) $\mathrm{r} 2 \leftarrow \mathrm{y}$ | (d) $\mathrm{r} 4 \leftarrow \mathrm{x}$ |  |  |



## Validity of an execution

- An execution is valid on an architecture if it does not show certain cycles.
- So we assign a clock to each event
- Then see if we can order these clocks w.r.t. less-than over $\mathbb{N}$


## On iriw



$$
\begin{align*}
& \left(\operatorname{lpo} P_{0}\right) c_{a b} \quad\left(\text { po } P_{1}\right) c_{c d} \\
& (\mathrm{rf} x) s_{e a} \wedge s_{i_{0} d} \quad(\mathrm{rfy} y) s_{f c} \wedge s_{i_{1} b} \\
& (\mathrm{ws} x) c_{i_{0} e} \quad(\mathrm{ws} y) c_{i_{1} f} \\
& (\mathrm{fr} x) \quad\left(s_{i_{0} d} \wedge c_{i_{0} e}\right) \Rightarrow c_{d e} \quad(\mathrm{fr} y) \quad\left(s_{i_{1} b} \wedge c_{i_{1} f}\right) \Rightarrow c_{b f} \\
& (\operatorname{grf} x) \quad\left(s_{e a} \Rightarrow c_{e a}\right) \quad(\operatorname{grf} y) \quad\left(s_{f c} \Rightarrow c_{f c}\right) \tag{1}
\end{align*}
$$

## iriw on SC



$$
\begin{align*}
& \left(\operatorname{lpo} P_{0}\right) c_{a b} \quad\left(\text { po } P_{1}\right) c_{c d} \\
& (\mathrm{rf} x) s_{e a} \wedge s_{i_{0} d} \quad(\mathrm{rfy} y) s_{f c} \wedge s_{i_{1} b} \\
& (\mathrm{ws} x) c_{i_{0} e} \quad(\mathrm{ws} y) c_{i_{1} f} \\
& (\mathrm{fr} x) \quad\left(s_{i_{0} d} \wedge c_{i_{0} e}\right) \Rightarrow c_{d e} \quad(\mathrm{fr} y) \quad\left(s_{i_{1} b} \wedge c_{i_{1} f}\right) \Rightarrow c_{b f} \\
& (\operatorname{grf} x) \quad\left(s_{e a} \Rightarrow c_{e a}\right) \quad(\text { grf } y) \quad\left(s_{f c} \Rightarrow c_{f c}\right) \tag{2}
\end{align*}
$$

## iriw on Power



$$
\begin{align*}
& \left(\operatorname{lpo} P_{0}\right) c_{a b} \quad\left(\text { po } P_{1}\right) c_{c d} \\
& (\mathrm{rf} x) s_{e a} \wedge s_{i_{0} d} \quad(\mathrm{rfy} y) s_{f c} \wedge s_{i_{1} b} \\
& (\mathrm{ws} x) c_{i_{0} e} \quad(\mathrm{ws} y) c_{i_{1} f} \\
& (\mathrm{fr} x) \quad\left(s_{i_{0} d} \wedge c_{i_{0} e}\right) \Rightarrow c_{d e} \quad(\mathrm{fr} y)\left(s_{i_{1} b} \wedge c_{i_{1} f}\right) \Rightarrow c_{b f} \\
& (\operatorname{grf} x) \quad\left(s_{e a} \Rightarrow c_{e a}\right) \quad(\text { grf } y) \quad\left(s_{f c} \Rightarrow c_{f c}\right) \tag{3}
\end{align*}
$$

## Tools

Testing hardware, simulating models:
http://diy.inria.fr

Verifying software:
www. cprover.org/wmm

Thanks!

