# Foundations: Synchronization Execution Abstractions

Chris Rossbach CS378H Fall 2018 9/10/18

# Today

- Questions?
- Administrivia
  - Lab 1 due sooner than you'd like
- Foundations
  - Threads/Processes/Fibers
  - Cache coherence (maybe)
- Acknowledgments: some materials in this lecture borrowed from
  - Emmett Witchel (who borrowed them from: Kathryn McKinley, Ron Rockhold, Tom Anderson, John Carter, Mike Dahlin, Jim Kurose, Hank Levy, Harrick Vin, Thomas Narten, and Emery Berger)
  - Andy Tannenbaum



## Faux Quiz (answer any 2, 5 min)

- What is the maximum possible speedup of a 75% parallelizable program on 8 CPUs
- What is super-linear speedup? List two ways in which super-linear speedup can occur.
- What is the difference between strong and weak scaling?
- Define Safety, Liveness, Bounded Waiting, Failure Atomicity
- What is the difference between processes and threads?
- What's a fiber? When and why might fibers be a better abstraction than threads?

## Faux Quiz (answer any 2, 5 min)

- What is the maximum possible speedup of a 75% parallelizable program on 8 CPUs
- What is super-linear speedup? List two ways in which super-linear speedup can occur.
- What is the difference between strong and weak scaling?
- Define Safety, Liveness, Bounded Waiting, Failure Atomicity
- What is the difference between processes and threads?
- What's a fiber? When and why might fibers be a better abstraction than threads?

## Processes and Threads and Fibers...

- Abstractions
- Containers
- State
  - Where is shared state?
  - How is it accessed?
  - Is it mutable?







## Processes and Threads and Fibers...



# Programming and Machines: a mental model



```
struct machine_state{
  uint64 pc;
  uint64 Registers[16];
  uint64 cr[6]; // control registers cr0-cr4 and EFER on AMD
} machine;
while(1) {
  fetch_instruction(machine.pc);
  decode_instruction(machine.pc);
  execute_instruction(machine.pc);
void execute_instruction(i) {
  switch(opcode) {
  case add_rr:
   machine.Registers[i.dst] += machine.Registers[i.src];
   break;
```

## Parallel Machines: a mental model





```
struct machine_state{
  uint64 pc;
  uint64 Registers[16];
  uint64 cr[6]; // control registers cr0-cr4 and EFER on AMD
...
} machine;
while(1) {
  fetch_instruction(machine.pc);
  decode_instruction(machine.pc);
  execute_instruction(machine.pc);
}
void execute_instruction(i) {
  switch(opcode) {
  case add_rr:
   machine.Registers[i.dst] += machine.Registers[i.src];
   break;
}
```

```
struct machine_state{
   uint64 pc;
   uint64 Registers[16];
   uint64 cr[6]; // control registers cr0-cr4 and EFER on AMD
...
} machine;
while(1) {
   fetch_instruction(machine.pc);
   decode_instruction(machine.pc);
   execute_instruction(machine.pc);
}
void execute_instruction(i) {
   switch(opcode) {
   case add_rr:
    machine.Registers[i.dst] += machine.Registers[i.src];
    break;
}
```

## Processes

#### Model



- Multiprogramming of four programs
- Conceptual model of 4 independent, sequential processes
- Only one program active at any instant

## Processes

#### Model



- Multiprogramming of four programs
- Conceptual model of 4 independent, sequential processes
- Only one program active at any instant

#### Implementation

| Process management        | Memory management        | File management   |
|---------------------------|--------------------------|-------------------|
| Registers                 | Pointer to text segment  | Root directory    |
| Program counter           | Pointer to data segment  | Working directory |
| Program status word       | Pointer to stack segment | File descriptors  |
| Stack pointer             |                          | User ID           |
| Process state             |                          | Group ID          |
| Priority                  |                          |                   |
| Scheduling parameters     |                          |                   |
| Process ID                |                          |                   |
| Parent process            |                          |                   |
| Process group             |                          |                   |
| Signals                   |                          |                   |
| Time when process started |                          |                   |
| CPU time used             |                          |                   |
| Children's CPU time       |                          |                   |
| Time of next alarm        |                          |                   |













(a) Three processes each with one thread





**Process** 

(a) Three processes each with one thread

(b) One process with three threads





(a) Three processes each with one thread

(b) One process with three threads









(a) Three processes each with one thread

(b) One process with three threads



When might (a) be better than (b)? Vice versa?







(a) Three processes each with one thread

(b) One process with three threads



When might (a) be better than (b)? Vice versa?

Could you do lab 1 with processes instead of threads?







(a) Three processes each with one thread

(b) One process with three threads



When might (a) be better than (b)? Vice versa?

Could you do lab 1 with processes instead of threads?

Threads simplify sharing and reduce context overheads



Per process items
Address space
Global variables
Open files
Child processes
Pending alarms
Signals and signal handlers
Accounting information

Per thread items
Program counter
Registers
Stack
State

| Per process items           | Per thread items |
|-----------------------------|------------------|
| Address space               | Program counter  |
| Global variables            | Registers        |
| Open files                  | Stack            |
| Child processes             | State            |
| Pending alarms              |                  |
| Signals and signal handlers |                  |
| Accounting information      |                  |

• Items shared by all threads in a process



#### Per process items

Address space

Global variables

Open files

Child processes

Pending alarms

Signals and signal handlers

Accounting information

#### Per thread items

Program counter

Registers

Stack

State

- Items shared by all threads in a process
- Items private to each thread



#### Per process items

Address space

Global variables

Open files

Child processes

Pending alarms

Signals and signal handlers

Accounting information

#### Per thread items

Program counter

Registers

Stack

State

- Items shared by all threads in a process
- Items private to each thread
- Decouples memory and control abstractions!



#### Per process items

Address space

Global variables

Open files

Child processes

Pending alarms

Signals and signal handlers

Accounting information

#### Per thread items

Program counter

Registers

Stack

State

- Items shared by all threads in a process
- Items private to each thread
- Decouples memory and control abstractions!
- What is the advantage of that?

Thread 2 Fach thread has Thread 3 Thread 1 its own stack Process Thread 3's stack Thread 1's stack Kernel

**Memory management** 

Pointer to text segment

Pointer to data segment

Pointer to stack segment

#### Per process items

Address space

Global variables

Open files

Child processes

Pending alarms

Signals and signal handlers

Accounting information

#### Per thread items

Program counter

Registers

Stack

State

#### **Process management**

Program counter

Process group

Time when process started

CPU time used

Children's CPU time Time of next alarm

#### Registers

Program status word

Stack pointer

Process state

Priority

Scheduling parameters.

Process ID

Parent process

Signals

#### Items shared by all threads in a process

- Items private to each thread
- Decouples memory and control abstractions
- What is the advantage of that?

#### File management

Root directory Working directory File descriptors

User ID Group ID

User Space Kernel Space

User Space Kernel Space



A user-level threads package

#### **User Space**



A user-level threads package

Kernel Space



A threads package managed by the kernel



A user-level threads package

A threads package managed by the kernel

# **Execution Context Management**

"Task" == "Flow of Control", but with less typing
"Stack" == Task State



## **Execution Context Management**

"Task" == "Flow of Control", but with less typing
"Stack" == Task State

#### Task Management



## **Execution Context Management**

"Task" == "Flow of Control", but with less typing "Stack" == Task State

#### Task Management

- Preemptive
  - Interleave on uniprocessor
  - Overlap on multiprocessor



"Task" == "Flow of Control", but with less typing
"Stack" == Task State

#### Task Management

- Preemptive
  - Interleave on uniprocessor
  - Overlap on multiprocessor
- Serial
  - One at a time, no conflict



"Task" == "Flow of Control", but with less typing "Stack" == Task State

#### Task Management

- Preemptive
  - Interleave on uniprocessor
  - Overlap on multiprocessor
- Serial
  - One at a time, no conflict
- Cooperative
  - Yields at well-defined points
  - E.g. wait for long-running I/O



"Task" == "Flow of Control", but with less typing "Stack" == Task State

#### Task Management Stack Management

- Preemptive
  - Interleave on uniprocessor
  - Overlap on multiprocessor
- Serial
  - One at a time, no conflict
- Cooperative
  - Yields at well-defined points
  - E.g. wait for long-runningI/O

#### Manual

- Inherent in Cooperative
- Changing at quiescent points
- Automatic
  - Inherent in pre-emptive
  - Downside: Hidden concurrency assumptions



"Task" == "Flow of Control", but with less typing "Stack" == Task State

#### Task Management

- Preemptive
  - Interleave on uniprocessor
  - Overlap on multiprocessor
- Serial
  - One at a time, no conflict
- Cooperative
  - Yields at well-defined points
  - E.g. wait for long-runningI/O





- Inherent in Cooperative
- Changing at quiescent points
- Automatic
  - Inherent in pre-emptive
  - Downside: Hidden concurrency assumptions

These dimensions can be orthogonal



- Cooperative tasks
  - most desirable when reasoning about concurrency
  - usually associated with event-driven programming

- Cooperative tasks
  - most desirable when reasoning about concurrency
  - usually associated with event-driven programming
- Automatic stack management
  - most desirable when reading/maintaining code
  - Usually associated with threaded (or serial) programming

- Cooperative tasks
  - most desirable when reasoning about concurrency
  - usually associated with event-driven programming
- Automatic stack management
  - most desirable when reading/maintaining code
  - Usually associated with threaded (or serial) programming



- Cooperative tasks
  - most desirable when reasoning about concurrency
  - usually associated with event-driven programming
- Automatic stack management
  - most desirable when reading/maintaining code
  - Usually associated with threaded (or serial) programming

Fibers: cooperative threading with automatic stack management



Blah blah **fibers** blah **thread** blah...



Blah blah **fibers** blah **thread** blah...



• Like threads, just an abstraction for flow of control

Blah blah fibers blah thread blah...



- Like threads, just an abstraction for flow of control
- Lighter weight than threads
  - In Windows, just a stack, subset of arch. registers, non-preemptive
  - \*Not\* just threads without exception support
  - stack management/impl has interplay with exceptions
  - Can be completely exception safe

Blah blah **fibers** blah **thread** blah...



- Like threads, just an abstraction for flow of control
- Lighter weight than threads
  - In Windows, just a stack, subset of arch. registers, non-preemptive
  - \*Not\* just threads without exception support
  - stack management/impl has interplay with exceptions
  - Can be completely exception safe
- *Takeaway*: diversity of abstractions/containers for execution flows

## x86\_64 Architectural Registers



• Register map diagram courtesy of: By Immae - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=32745525

```
switch_to(x,y) should switch_tasks_toom_x to y.
                  * This could still be optimized:
                                                                                                            Linux x86 64 context
                  * - fold all the options into a flag word and test it with a single test
                  * - could test fs/gs bitsliced
                                                                                                                   switch excerpt
                                                                                                                                                                                                                Complete fiber
                  * Kprobes not supported here. Set the probe on schedule inst
                  * Function graph tracer not supported too.
                                                                                                                                                                                                             context switch on
                 __visible __notrace_funcgraph struct task_struct *
                 __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
                                                                                                                                                                                                            Unix and Windows
                        struct thread_struct *prev = &prev p->thread;
                        struct thread_struct *next = &next p->thread;
                        struct fpu *prev_fpu = &prev->fpu;
                        struct fpu *next_fpu = &next->fpu;
                        int cpu = smp_processor_id()
                        struct tss_struct *tss = &per_cpu(cpu_tss_rw, cpu);
                                                                                                                                * The AMD64 architecture provides 16 general 64-bit registers together with 16
                                                                                                                                * 128-bit SSE registers, overlapping with 8 legacy 80-bit x87 floating point
                        WARN_ON_ONCE(IS_ENABLED(CONFIG_DEBUG_ENTRY) &&
                                                                                         ST(0) MM0
                                                                                                              ST(1) MM1
                                   this_cpu_read(irq_count) != -1);
                                                                                                                                                                                                           CR<sub>0</sub>
                                                                                                                                                                                                                       CR4
                        switch_fpu_prepare(prev_fpu, cpu);
                                                                                                                                              Both
                                                                                                                                                               Unix only
                                                                                                                                                                                Windows only
                                                                                         ST(2) MM2
                                                                                                              ST(3) MM3
                                                                                                                                                                                                           CR1
                                                                                                                                                                                                                       CR5
                        /* We must save %fs and %gs before load_TLS() because
                                                                                                                                             Result register
                         * %fs and %gs may be cleared by load_TLS().
                                                                                                              ST(5) MM5
                                                                                         ST(4) MM4
                                                                                                                                                                                                           CR<sub>2</sub>
                                                                                                                                                                                                                       CR6
                                                                                                                                             Must be preserved
                         * (e.g. xen_load_tls())
                                                                                                                                                               Fourth argument
                                                                                                                                                                                First argument
                                                                                         ST(6) MM6
                                                                                                                                                                                                           CR3
                                                                                                                                                                                                                       CR7
                                                                                                                                   rdx
                                                                                                                                                               Third argument
                                                                                                                                                                                Second argument
                        save_fsgs(prev_p);
                                                                                                                                             Stack pointer, must be preserved
                                                                                                                                                                                                           CR3
                                                                                                                                                                                                                       CR8
                                                                                                                                   rbp
                                                                                                                                             Frame pointer, must be preserved
                                                                                                                                                               Second argument
                         * Load TLS before restoring any segments so that segment loads
                                                                                                                                   rsi
                                                                                                                                                                                Must be preserved
                         * reference the correct GDT entries.
                                                                                                             FP DP
                                                                                                                                                                                Must be preserved
                                                                                                                                   rdi
                                                                                                                                                               First argument
                                                                                                                                                                                                          MSW
                                                                                                                                                                                                                       CR9
                                                                                                                                   r8
                                                                                                                                                               Fifth argument
                                                                                                                                                                                Third argument
                        load_TLS(next, cpu);
                                                                                                                                * r9
                                                                                          SW
                                                                                                                                                               Sixth argument
                                                                                                                                                                                Fourth argument
                                                                                                                                                                                                                      CR10
                                                                                                                                 * r10-r11
                                                                                                                                             Volatile
                         * Leave lazy mode, flushing any hypercalls made here. This
                                                                                                                                                                                                       register
                                                                                                                8-bit registe * r12-r15
                                                                                                                                              Must be preserved
                                                                                          TW
                                                                                                                                                                                                                      CR11
                         * must be done after loading TLS entries in the GDT but before
                                                                                                                                              Volatile
                         * loading segments that might reference them, and and it must
                                                                                                                                                                                                       register
                                                                                                                                                                                Must be preserved
                                                                                                                                                               Volatile
                         * be done before fpu_restore(), so the TS bit is up to
                                                                                         FP_DS
                                                                                                                                                                                                                      CR12
ZMM16 ZMM
                                                                                                                                   fpcsr
                                                                                                                                              Non volatile
                                                                                                                                   mxcsr
                                                                                                                                              Non volatile
                                                                                         P OPC FP DP FP IP
                                                                                                                                                                                                                      CR13
ZMM24|| ZMM
                        arch_end_context_switch(next_p);
                                                                                                                                                                                                         DR6
                                                                                                                               * Thus for the two architectures we get slightly different lists of registers
                        /* Switch DS and ES.
                                                                                                                                                                                                         DR7
                                                                                                                                                                                                                      CR14
                                                                                                                                * to preserve.
                         * Reading them only returns the selectors, but writing them (if
                         * nonzero) loads the full descriptor from the GDT or LDT. The
                                                                                                                                                                                                                      CR15 MXCSR
                                                                                                                                                                                                         DR8
                                                                                                                                * Registers "owned" by caller:
                         * LDT for next is loaded in switch mm, and the GDT is loaded
                                                                                                                                             rbx, rsp, rbp, r12-r15, mxcsr (control bits), x87 CW
                         * above.
                                                                                                                                * Windows: rbx, rsp, rbp, rsi, rdi, r12-r15, xmm6-15
                                                                                                                                                                                                         DR9
                         * We therefore need to write new values to the segment
                         * reaisters on every context switch unless both the new and old
                                                                                                                                                                                                                    DR12
                                                                                                                                                                                             DR4
                                                                                                                                                                                                        DR10
                                                                                                                                                                                                                               DR14
                         * values are zero.
                         * Note that we don't need to do anything for CS and SS, as
                                                                                                                                                                                                                    DR13
                                                                                                                                                                                             DR5
                                                                                                                                                                                                        DR11
                                                                                                                                                                                                                               DR15
                         * those are saved and restored as part of pt_regs.
                        savesegment(es, prev->es);
```

Reg

if (unlikely(next->es | prev->es)) loadsegment(es, next->es);

load\_seg\_legacy(prev->fsindex, prev->fsbase

load\_seg\_legacy(prev->gsindex, prev->gsbase,

next->fsindex, next->fsbase, FS);

next->asindex. next->asbase. GS

savesegment(ds. prev->ds); if (unlikely(next->ds | prev->ds)) loadsegment(ds, next->ds);

ZMM0

ZMM2

ZMM4

ZMM6

ZMM8

ZMM10

**ZMM12** 

ZMM14

wn work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=32745525

# x86\_64 Registers and Threads



switch\_to(x,y) should switch tosks from x to y.
This could still be optimized:
- fold all the avointon into a flag word and test it with a single test.
- could test fifgs disiliced
- could test fifgs disiliced
- Reproduce not supported here. Set the probe an schedule instead,
- Reprofice areast income not supported too.

int cpu = smp\_processor\_td(); struct tss\_struct "tss = !per\_cpu(cpu\_tss\_rw, cpu);

this\_cpu\_read(irq\_count) |switch\_fpu\_prepare(prev\_fpu, cpu);

save\_fses(prev.p):

• Register map diagram courtesy of: By Immae - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=32745525

# x86\_64 Registers and Threads



switch\_to(x,y) should switch tasks from x to y.

This could still be optimized:
- fold all the optimized to flag word and test it with a single test
- could test faigs dissilted
- could test faigs dissilted
- speake not supported here. Set the probe on schedule instead,
- Reviction areast incore not supported too.

int cpu = smp\_processor\_td(); struct tss\_struct "tss = !per\_cpu(cpu\_tss\_rw, cpu);

this\_cpu\_read(irq\_count) |pritch\_fpu\_prepare(prev\_fpu\_cpu):

save\_fses(prev.p);

• Register map diagram courtesy of: By Immae - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=32745525

## x86\_64 Registers and Fibers



\* The AMD64 architecture provides 16 general 64-bit registers together with 16
\* 128-bit SSE registers, overlapping with 8 legacy 80-bit x87 floating point

\* Thus for the two architectures we get slightly different lists of registers

Stack pointer, must be preserved Frame pointer, must be preserved

\* r10-r11 Volatile
\* r12-r15 Must be preserved
\* xmm0-5 Volatile
\* xmm6-15
\* fpcsr Non volatile
\* mxsr Non volatile

Third argument Second argument

Second argument Must be preserved First argument Must be preserved Fifth argument Third argument Sixth argument Fourth argument

Register map diagram courtesy of: By Immae - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=32745525

## x86\_64 Registers and Fibers



\* The AMD64 architecture provides 16 general 64-bit registers together with 16
\* 128-bit SSE registers, overlapping with 8 legacy 80-bit x87 floating point

Third argument

Second argument

Sixth argument Fourth argument

Stack pointer, must be preserved Frame pointer, must be preserved

\* r12-r15 Hust be preserved \* xmm0-5 Volatile \* xmm6-15 \* fpcsr Non volatile \* mxcsr Non volatile

Register map diagram courtesy of: By Immae - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=32745525

## x86\_64 Registers and Fibers



\* The AMD64 architecture provides 16 general 64-bit registers together with 16
\* 128-bit SSE registers, overlapping with 8 legacy 80-bit x87 floating point

Third argument Stack pointer, must be preserved Frame pointer, must be preserved

\* r12-r15 Must be preserved
\* xmm0-5 Volatile
\* xmm6-15
\* fpcsr Mon volatile
\* mxcsr Mon volatile

Register map diagram courtesy of: By Immae - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=32745525

#### Pthreads

- POSIX standard thread model,
- Specifies the API and call semantics.
- Popular most thread libraries are Pthreads-compatible

## Can you find the bug here?

```
What is printed for myNum?
 void *threadFunc(void *pArg) {
   int* p = (int*)pArg;
   int myNum = *p;
   printf( "Thread number %d\n", myNum);
    from main():
 for (int i = 0; i < numThreads; i++) {
    pthread_create(&tid[i], NULL, threadFunc, &i);
```

• Type: pthread\_mutex\_t

• Type: pthread\_mutex\_t

```
int pthread_mutex_init(pthread_mutex_t *mutex,
```

int pthread\_mutex\_lock(pthread\_mutex\_t \*mutex);

int pthread\_mutex\_unlock(pthread\_mutex\_t \*mutex);

• Type: pthread mutex t

• Type: pthread mutex t

- Attributes: for shared mutexes/condition vars among processes, for priority inheritance, etc.
  - use defaults

• Type: pthread mutex t

- Attributes: for shared mutexes/condition vars among processes, for priority inheritance, etc.
  - use defaults
- Important: Mutex scope must be visible to all threads!

• Type: pthread\_spinlock\_t

• Type: pthread spinlock t

```
int pthread_spinlock_init(pthread_spinlock_t *lock);
```

• Type: pthread\_spinlock\_t

```
int pthread_spinlock_init(pthread_spinlock_t *lock);
int pthread_spinlock_destroy(pthread_spinlock_t *lock);
```

• Type: pthread spinlock t

```
int pthread_spinlock_init(pthread_spinlock_t *lock);
int pthread_spinlock_destroy(pthread_spinlock_t *lock);
int pthread_spin_lock(pthread_spinlock_t *lock);
```

• Type: pthread spinlock t

```
int pthread_spinlock_init(pthread_spinlock_t *lock);
int pthread_spinlock_destroy(pthread_spinlock_t *lock);
int pthread_spin_lock(pthread_spinlock_t *lock);
int pthread_spin_unlock(pthread_spinlock_t *lock);
```

• Type: pthread spinlock t

```
int pthread_spinlock_init(pthread_spinlock_t *lock);
int pthread_spinlock_destroy(pthread_spinlock_t *lock);
int pthread_spin_lock(pthread_spinlock_t *lock);
int pthread_spin_unlock(pthread_spinlock_t *lock);
int pthread_spin_trylock(pthread_spinlock_t *lock);
```

• Type: pthread spinlock t

```
int pthread_spinlock_init(pthread_spinlock_t *lock);
int pthread_spinlock_destroy(pthread_spinlock_t *lock);
int pthread_spin_lock(pthread_spinlock_t *lock);
int pthread_spin_unlock(pthread_spinlock_t *lock);
int pthread_spin_trylock(pthread_spinlock_t *lock);
```

Wait...what's the difference?

```
int pthread_mutex_init(pthread_mutex_t *mutex,...);
int pthread_mutex_destroy(pthread_mutex_t *mutex);
int pthread_mutex_lock(pthread_mutex_t *mutex);
int pthread_mutex_unlock(pthread_mutex_t *mutex);
int pthread_mutex_trylock(pthread_mutex_t *mutex);
```

```
while(1) {
    Entry section
    Critical section
    Exit section
    Non-critical section
}
```

- Safety
  - Only one thread in the critical region

```
while(1) {
    Entry section
    Critical section
    Exit section
    Non-critical section
}
```

- Safety
  - Only one thread in the critical region
- Liveness
  - Some thread that enters the entry section eventually enters the critical region
  - Even if other thread takes forever in non-critical region

```
while(1) {
    Entry section
    Critical section
    Exit section
    Non-critical section
}
```

- Safety
  - Only one thread in the critical region
- Liveness
  - Some thread that enters the entry section eventually enters the critical region
  - Even if other thread takes forever in non-critical region
- Bounded waiting
  - A thread that enters the entry section enters the critical section within some bounded number of operations.

```
while(1) {
    Entry section
    Critical section
    Exit section
    Non-critical section
}
```

- Safety
  - Only one thread in the critical region
- Liveness
  - Some thread that enters the entry section eventually enters the critical region
  - Even if other thread takes forever in non-critical region
- Bounded waiting
  - A thread that enters the entry section enters the critical section within some bounded number of operations.
  - If a thread i is in entry section, then there is a bound on the number of times that other threads are allowed to enter the critical section before thread i's request is granted

```
while(1) {
    Entry section
    Critical section
    Exit section
    Non-critical section
}
```

- Safety
  - Only one thread in the critical region
- Liveness
  - Some thread that enters the entry section eventually enters the critical region
  - Even if other thread takes forever in non-critical region
- Bounded waiting
  - A thread that enters the entry section enters the critical section within some bounded number of operations.
  - If a thread i is in entry section, then there is a bound on the number of times that other threads are allowed to enter the critical section before thread i's request is aranted

Theorem: Every property is a combination of a safety property and a liveness property.

-Bowen Alpern & Fred Schneider

https://www.cs.cornell.edu/fbs/publications/defliveness.pdf

```
while(1) {
    Entry section
    Critical section
    Exit section
    Non-critical section
}
```

- Safety
  - Only one thread in the critical region
- Liveness
  - Some thread that enters the entry section eventually enters the critical region
  - Even if other thread takes forever in non-critical region
- Bounded waiting
  - A thread that enters the entry section enters the critical section within some bounded number of operations.
  - If a thread i is in entry section, then there is a bound on the number of times that other threads are allowed to enter the critical section before thread i's request is granted

while (1) {

Mutex, spinlock, etc. are ways to implement these

While (1) {

Entry section

Critical section

Exit section

Non-critical section

Theorem: Every property is a combination of a safety property and a liveness property.

-Bowen Alpern & Fred Schneider

https://www.cs.cornell.edu/fbs/publications/defliveness.pdf

- Safety
  - Only one thread in the critical region
- Liveness
  - Some thread that enters the entry section eventually enters the critical region
  - Even if other thread takes forever in non-critical region
- Bounded waiting
  - A thread that enters the entry section enters the critical section within some bounded number of operations.
  - If a thread i is in entry section, then there is a bound on the number of times that other threads are allowed to enter the critical section before thread i's request is granted

Mutex, spinlock, etc. are ways to implement

Did we get all the important conditions? Why is correctness defined in terms of locks?

Theorem: Every property is a combination of a safety property and a liveness property.

-Bowen Alpern & Fred Schneider

https://www.cs.cornell.edu/fbs/publications/defliveness.pdf

while(1) {
 Entry section
 Critical section
 Exit section
 Non-critical section

```
int lock_value = 0;
int* lock = &lock_value;
```

```
int lock_value = 0;
int* lock = &lock_value;
```

```
Lock::Acquire() {
  while (*lock == 1)
    ; //spin
  *lock = 1;
}
```

```
int lock_value = 0;
int* lock = &lock_value;
```

```
Lock::Acquire() {
  while (*lock == 1)
    ; //spin
  *lock = 1;
}
```

```
Lock::Release() {
   *lock = 0;
}
```

```
int lock_value = 0;
int* lock = &lock_value;
```

```
Lock::Acquire() {
   while (*lock == 1)
        ; //spin
   *lock = 1;
}
```

```
Lock::Release() {
    *lock = 0;
}
```

#### What are the problem(s) with this?

- > A. CPU usage
- ➤ B. Memory usage
- > C. Lock::Acquire() latency
- > D. Memory bus usage
- > E. Does not work

```
int lock_value = 0;
int* lock = &lock_value;
```

```
Lock::Acquire() {
   while (*lock == 1)
        ; //spin
   *lock = 1;
}
```

Completely and utterly broken. How can we fix it?

```
Lock::Release() {
    *lock = 0;
}
```

#### What are the problem(s) with this?

- > A. CPU usage
- ➤ B. Memory usage
- > C. Lock::Acquire() latency
- > D. Memory bus usage
- > E. Does not work

IDEA: hardware implements something like:

```
bool rmw(addr, value) {
  atomic {
    tmp = *addr;
    newval = modify(tmp);
    *addr = newval;
  }
}
```

```
IDEA: hardware implements something like:
```

```
bool rmw(addr, value) {
  atomic {
    tmp = *addr;
    newval = modify(tmp);
    *addr = newval;
  }
}
```

Why is that hard? How can we do it?

IDEA: hardware implements something like:

```
bool rmw(addr, value) {
  atomic {
    tmp = *addr;
    newval = modify(tmp);
    *addr = newval;
  }
}
```

Why is that hard? How can we do it?

IDEA: hardware implements something like:

```
bool rmw(addr, value) {
  atomic {
    tmp = *addr;
    newval = modify(tmp);
    *addr = newval;
  }
}
```

Why is that hard? How can we do it?

#### Preview of Techniques:

Bus locking

IDEA: hardware implements something like:

```
bool rmw(addr, value) {
  atomic {
    tmp = *addr;
    newval = modify(tmp);
    *addr = newval;
  }
}
```

Why is that hard? How can we do it?

- Bus locking
- Single Instruction ISA extensions
  - Test&Set
  - CAS: Compare & swap
  - Exchange, locked increment, locked decrement (x86)

IDEA: hardware implements something like:

```
bool rmw(addr, value) {
  atomic {
    tmp = *addr;
    newval = modify(tmp);
    *addr = newval;
  }
}
```

Why is that hard? How can we do it?

- Bus locking
- Single Instruction ISA extensions
  - Test&Set
  - CAS: Compare & swap
  - Exchange, locked increment, locked decrement (x86)
- Multi-instruction ISA extensions:
  - LLSC: (PowerPC, Alpha, MIPS)
  - Transactional Memory (x86, PowerPC)

```
IDEA: hardware implements something like:
```

```
bool rmw(addr, value) {
  atomic {
    tmp = *addr;
    newval = modify(tmp);
    *addr = newval;
  }
}
```

Why is that hard? How can we do it?

- Bus locking
- Single Instruction ISA extensions
  - Test&Set
  - CAS: Compare & swap
  - Exchange, locked increment, locked decrement (x86)
- Multi-instruction ISA extensions:
  - LLSC: (PowerPC, Alpha, MIPS)
  - Transactional Memory (x86, PowerPC)

```
int lock_value = 0;
int* lock = &lock_value;
```

```
int lock_value = 0;
int* lock = &lock_value;
```

```
Lock::Acquire() {
while (test&set(lock) == 1)
  ; //spin
}
```



(test & set ~= CAS ~= LLSC)
TST: *Test&set* 

- Reads a value from memory
- Write "1" back to memory location

```
int lock_value = 0;
int* lock = &lock_value;
```

```
Lock::Acquire() {
while (test&set(lock) == 1)
; //spin
}
```



```
(test & set ~= CAS ~= LLSC)
TST: Test&set
```

- Reads a value from memory
- Write "1" back to memory location

```
Lock::Release() {
    *lock = 0;
}
```

```
int lock_value = 0;
int* lock = &lock_value;
```

```
Lock::Acquire() {
while (test&set(lock) == 1)
; //spin
}
```



```
(test & set ~= CAS ~= LLSC)
TST: Test&set
```

- Reads a value from memory
- Write "1" back to memory location

```
Lock::Release() {
    *lock = 0;
}
```

#### What are the problem(s) with this?

- > A. CPU usage
- ➤ B. Memory usage
- > C. Lock::Acquire() latency
- > D. Memory bus usage
- > E. Does not work

```
int lock_value = 0;
int* lock = &lock_value;
```

```
Lock::Acquire() {
while (test&set(lock) == 1)
; //spin
}
```



```
(test & set ~= CAS ~= LLSC)
TST: Test&set
```

- Reads a value from memory
- Write "1" back to memory location

```
Lock::Release() {
    *lock = 0;
}
```

#### What are the problem(s) with this?

- > A. CPU usage
- ➤ B. Memory usage
- > C. Lock::Acquire() latency
- > D. Memory bus usage
- E. Does not work

```
int lock_value = 0;
int* lock = &lock_value;
```

```
int lock_value = 0;
int* lock = &lock_value;
```

```
Lock::Acquire() {
  while (*lock == 1)
    ; //spin
  *lock = 1;
}
```

```
int lock_value = 0;
int* lock = &lock_value;
```

```
Lock::Acquire() {
  while (*lock == 1)
    ; //spin
  *lock = 1;
}
```

```
Lock::Release() {
   *lock = 0;
}
```

```
int lock_value = 0;
int* lock = &lock_value;
```

```
Lock::Acquire() {
   while (*lock == 1)
        ; //spin
   *lock = 1;
}
```

```
Lock::Release() {
    *lock = 0;
}
```

#### What are the problem(s) with this?

- > A. CPU usage
- ➤ B. Memory usage
- > C. Lock::Acquire() latency
- > D. Memory bus usage
- > E. Does not work

# Multiprocessor Cache Coherence

$$F = ma$$

### Multiprocessor Cache Coherence

Physics Concurrency

F = ma ~ coherence





• P1: read X



• P1: read X



• P1: read X

• P2: read X



• P1: read X

• P2: read X



• P1: read X

• P2: read X

• P2: X++



• P1: read X

• P2: read X

• P2: X++



• P1: read X

• P2: read X

• P2: X++

• P3: read X



• P1: read X

• P2: read X

• P2: X++

• P3: read X











Each cache line has a state (M, E, S, I)

• Processors "snoop" bus to maintain states





- Processors "snoop" bus to maintain states
- Initially  $\rightarrow$  'I'  $\rightarrow$  Invalid





- Processors "snoop" bus to maintain states
- Initially  $\rightarrow$  'I'  $\rightarrow$  Invalid
- Read one  $\rightarrow$  'E'  $\rightarrow$  exclusive





- Processors "snoop" bus to maintain states
- Initially  $\rightarrow$  'I'  $\rightarrow$  Invalid
- Read one  $\rightarrow$  'E'  $\rightarrow$  exclusive
- Reads → 'S' → multiple copies possible





- Processors "snoop" bus to maintain states
- Initially  $\rightarrow$  'I'  $\rightarrow$  Invalid
- Read one  $\rightarrow$  'E'  $\rightarrow$  exclusive
- Reads → 'S' → multiple copies possible
- Write  $\rightarrow$  'M'  $\rightarrow$  single copy  $\rightarrow$  lots of cache coherence traffic





- Processors "snoop" bus to maintain states
- Initially  $\rightarrow$  'I'  $\rightarrow$  Invalid
- Read one  $\rightarrow$  'E'  $\rightarrow$  exclusive
- Reads → 'S' → multiple copies possible
- Write  $\rightarrow$  'M'  $\rightarrow$  single copy  $\rightarrow$  lots of cache coherence traffic





- Processors "snoop" bus to maintain states
- Initially  $\rightarrow$  'I'  $\rightarrow$  Invalid
- Read one  $\rightarrow$  'E'  $\rightarrow$  exclusive
- Reads  $\rightarrow$  'S'  $\rightarrow$  multiple copies possible
- Write → 'M' → single copy → lots of cache coherence traffic





```
EXCLUSIVE
PrWr/
BusRdX
          PrRd/-
BusRd(S)
```



```
EXCLUSIVE
PrWr/
BusRdX
          PrRd/-
BusRd(S)
```



```
EXCLUSIVE
PrWr/
BusRdX
          PrRd/-
BusRd(S)
```



```
EXCLUSIVE
PrWr/
BusRdX
          PrRd/-
BusRd(S)
```



```
EXCLUSIVE
PrWr/
BusRdX
          PrRd/-
BusRd(S)
```





```
EXCLUSIVE
PrWr/
BusRdX
           PrRd/-
BusRd(S)
```

P1

lock() {

// (straw-person lock impl)

try: load lock, R0

test R0

bnz try

store lock, 1

// Initially, lock == 0 (unheld)



P2

PrWr/ BusRdX

> PrRd/-BusRd(S)





















```
// (straw-person lock impl)
                                                                // (straw-person lock impl)
// Initially, lock == 0 (unheld)
                                                                // Initially, lock == 0 (unheld)
lock() {
                                                                lock() {
try: load lock, R0
                                                                      load lock, R0
                                                                try:
        test R0
                                                                        test R0
        bnz try
                                                                         bnz try
        store lock, 1
                                                                         store lock, 1
                                                    SAFE!
```





















```
EXCLUSIVE
PrWr/
BusRdX
          PrRd/-
BusRd(S)
                        BusRdX/Flush
                                                    P2
```



```
EXCLUSIVE
            PrWr/
BusRdX
                   PrRd/-
BusRd(S)
                             BusRdX/Flush
                                                P2
// (straw-person lock impl)
```



PrWr/ BusRdX

# Read-Modify-Write (RMW)

- Implementing locks requires read-modify-write operations
- Required effect is:
  - An atomic and isolated action
    - 1. read memory location AND
    - 2. write a new value to the location
  - RMW is *very tricky* in multi-processors
  - Cache coherence alone doesn't solve it

```
// (straw-person lock impl)
// Initially, lock == 0 (unheld)
lock() {
try: load lock, R0
          test R0
          bnz try
          store lock, 1
}
```



# Essence of HW-supported RMW

```
// (straw-person lock impl)
// Initially, lock == 0 (unheld)
lock() {
try: load lock, R0
test R0
bnz try
store lock, 1
}
Make this into a single
(atomic hardware instruction)
```

# HW Support for Read-Modify-Write (RMW)

| Test & Set                                                                                                     | CAS                                                                                                                                     | Exchange, locked increment/decrement,                                                                | LLSC: load-linked store-conditional                                                                                                              |
|----------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------|
| Most architectures                                                                                             | Many architectures                                                                                                                      | x86                                                                                                  | PPC, Alpha, MIPS                                                                                                                                 |
| <pre>int TST(addr) {    atomic {     ret = *addr;     if(!*addr)        *addr = 1;    return ret;    } }</pre> | <pre>bool cas(addr, old, new) {    atomic {     if(*addr == old) {       *addr = new;       return true;     }    return false; }</pre> | <pre>int XCHG(addr, val) {    atomic {     ret = *addr;     *addr = val;     return ret;   } }</pre> | <pre>bool LLSC(addr, val) {   ret = *addr;   atomic {     if(*addr == ret) {       *addr = val;       return true;     }   return false; }</pre> |

# HW Support for Read-Modify-Write (RMW)

| Test & Set                                                                                                    | CAS                                                                                                                                      | Exchange, locked increment/decrement,                                                                | LLSC: load-linked store-conditional                                                                                                              |
|---------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------|
| Most architectures                                                                                            | Many architectures                                                                                                                       | x86                                                                                                  | PPC, Alpha, MIPS                                                                                                                                 |
| <pre>int TST(addr) {    atomic {     ret = *addr;     if(!*addr)        *addr = 1;    return ret;   } }</pre> | <pre>bool cas(addr, old, new) {    atomic {     if(*addr == old) {       *addr = new;       return true;     }     return false; }</pre> | <pre>int XCHG(addr, val) {    atomic {     ret = *addr;     *addr = val;     return ret;   } }</pre> | <pre>bool LLSC(addr, val) {   ret = *addr;   atomic {     if(*addr == ret) {       *addr = val;       return true;     }   return false; }</pre> |

```
void CAS_lock(lock) {
  while(CAS(&lock, 0, 1) != true);
}
```

# HW Support for Read-Modify-Write (RMW)

| Test & Set                                                                                                     | CAS                                                                                                                                     | Exchange, locked increment/decrement,                                                                | LLSC: load-linked store-conditional                                                                                                              |
|----------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------|
| Most architectures                                                                                             | Many architectures                                                                                                                      | x86                                                                                                  | PPC, Alpha, MIPS                                                                                                                                 |
| <pre>int TST(addr) {    atomic {     ret = *addr;     if(!*addr)        *addr = 1;    return ret;    } }</pre> | <pre>bool cas(addr, old, new) {    atomic {     if(*addr == old) {       *addr = new;       return true;     }    return false; }</pre> | <pre>int XCHG(addr, val) {    atomic {     ret = *addr;     *addr = val;     return ret;   } }</pre> | <pre>bool LLSC(addr, val) {   ret = *addr;   atomic {     if(*addr == ret) {       *addr = val;       return true;     }   return false; }</pre> |

# HW Support for RMW: LL-SC

# LLSC: load-linked store-conditional PPC, Alpha, MIPS bool LLSC(addr, val) { ret = \*addr; atomic { if(\*addr == ret) { \*addr = val; return true; } return false; }

- load-linked is a load that is "linked" to a subsequent store-conditional
- Store-conditional only succeeds if value from linked-load is unchanged

# HW Support for RMW: LL-SC

### **LLSC:** load-linked store-conditional

```
PPC, Alpha, MIPS
bool LLSC(addr, val) {
  ret = *addr;
  atomic {
   if(*addr == ret) {
     *addr = val;
     return true;
   }
  return false;
}
```

```
void LLSC_lock(lock) {
  while(1) {
    old = load-linked(lock);
    if(old == 0 && store-cond(lock, 1))
      return;
  }
}
```

- load-linked is a load that is "linked" to a subsequent store-conditional
- Store-conditional only succeeds if value from linked-load is unchanged



```
State Data

State Data

lock:

lock:

Iock:

O
```

```
P2
lock(lock) {
  while(1) {
    old = ll(lock);
    if(old == 0)
       if(sc(lock, 1))
       return;
  }
}
```



```
P1
lock(lock) {
  while(1) {
    old = ll(lock);
    if(old == 0)
        if(sc(lock, 1))
        return;
  }
}
```

```
P2_
lock(lock) {
  while(1) {
    old = ll(lock);
    if(old == 0)
       if(sc(lock, 1))
       return;
  }
}
```



```
State Data

lock: S[L] 0

lock: 0
```

```
P1
lock(lock) {
  while(1) {
    old = ll(lock);
    if(old == 0)
        if(sc(lock, 1))
        return;
  }
}
```

```
P2_________
lock(lock) {
  while(1) {
    old = ll(lock);
    if(old == 0)
       if(sc(lock, 1))
       return;
  }
}
```



```
State Data

lock: S[L] 0

lock: 0
```

```
P1
lock(lock) {
  while(1) {
    old = ll(lock);
    if(old == 0)
       if(sc(lock, 1))
       return;
  }
}
```

```
P2
lock(lock) {
  while(1) {
    old = ll(lock);
    if(old == 0)
       if(sc(lock, 1))
       return;
  }
}
```



```
State Data
lock: M 1
lock: 0
```

```
P1
lock(lock) {
  while(1) {
    old = ll(lock);
    if(old == 0)
       if(sc(lock, 1))
       return;
  }
}
```

```
P2_________
lock(lock) {
  while(1) {
    old = ll(lock);
    if(old == 0)
       if(sc(lock, 1))
       return;
  }
}
```



```
State Data
lock:

State Data
lock:

lock:

O
```

```
P2_________
lock(lock) {
  while(1) {
    old = ll(lock);
    if(old == 0)
       if(sc(lock, 1))
       return;
  }
}
```



```
State Data
lock:

State Data
lock:

State Data
lock:

S[L] 0
```

```
P2
lock(lock) {
  while(1) {
    old = ll(lock);
    if(old == 0)
       if(sc(lock, 1))
       return;
  }
}
```



```
State Data

lock: S[L] 0

lock: 0
```

```
P1
lock(lock) {
  while(1) {
    old = ll(lock);
    if(old == 0)
        if(sc(lock, 1))
        return;
  }
}
```

```
P2_
lock(lock) {
  while(1) {
    old = ll(lock);
    if(old == 0)
       if(sc(lock, 1))
       return;
  }
}
```



```
State Data
lock: S[L] 0

lock: S[L] 0
```

```
P1_______lock(lock) {
    while(1) {
        old = ll(lock);
        if(old == 0)
            if(sc(lock, 1))
            return;
    }
}
```

```
P2_________
lock(lock) {
  while(1) {
    old = ll(lock);
    if(old == 0)
       if(sc(lock, 1))
       return;
  }
}
```



```
State Data
lock: S[L] 0

lock: S[L] 0
```

```
P1
lock(lock) {
  while(1) {
    old = ll(lock);
    if(old == 0)
       if(sc(lock, 1))
       return;
  }
}
```

```
P2
lock(lock) {
  while(1) {
    old = ll(lock);
    if(old == 0)
       if(sc(lock, 1))
       return;
  }
}
```



```
State Data

lock: M 1

lock: 0
```

```
P2_______
lock(lock) {
  while(1) {
    old = ll(lock);
    if(old == 0)
        if(sc(lock, 1))
        return;
  }
}
```



```
P1_
lock(lock) {
  while(1) {
    old = ll(lock);
    if(old == 0)
        if(sc(lock, 1))
        return;
  }
}
```

```
Store
       conditional
          fails
lock(lock) {
 while(1) {
    old = 11(lock);
    if(old == 0)
      if(sc(lock, 1))
        return;
```