diff --git a/book.toml b/book.toml index 693aca4a..cf98b4e6 100644 --- a/book.toml +++ b/book.toml @@ -31,5 +31,8 @@ git-repository-url = "https://github.com/rust-lang/nomicon" "./arc-layout.html" = "./arc-mutex/arc-layout.html" "./arc.html" = "./arc-mutex/arc.html" +# Atomics chapter +"./atomics.html" = "./atomics/atomics.html" + [rust] edition = "2021" diff --git a/src/SUMMARY.md b/src/SUMMARY.md index f1d15a71..01304c5e 100644 --- a/src/SUMMARY.md +++ b/src/SUMMARY.md @@ -41,7 +41,12 @@ * [Concurrency](concurrency.md) * [Races](races.md) * [Send and Sync](send-and-sync.md) - * [Atomics](atomics.md) + * [Atomics](./atomics/atomics.md) + * [Multithreaded Execution](./atomics/multithread.md) + * [Relaxed](./atomics/relaxed.md) + * [Acquire and Release](./atomics/acquire-release.md) + * [SeqCst](./atomics/seqcst.md) + * [Fences](./atomics/fences.md) * [Implementing Vec](./vec/vec.md) * [Layout](./vec/vec-layout.md) * [Allocating](./vec/vec-alloc.md) diff --git a/src/arc-mutex/arc-clone.md b/src/arc-mutex/arc-clone.md index 1adc6c9e..29cb5c77 100644 --- a/src/arc-mutex/arc-clone.md +++ b/src/arc-mutex/arc-clone.md @@ -28,7 +28,7 @@ happens-before relationship but is atomic. When `Drop`ping the Arc, however, we'll need to atomically synchronize when decrementing the reference count. This is described more in [the section on the `Drop` implementation for `Arc`](arc-drop.md). For more information on atomic relationships and Relaxed -ordering, see [the section on atomics](../atomics.md). +ordering, see [the section on atomics](../atomics/atomics.md). Thus, the code becomes this: diff --git a/src/atomics.md b/src/atomics.md deleted file mode 100644 index 72a2d56f..00000000 --- a/src/atomics.md +++ /dev/null @@ -1,239 +0,0 @@ -# Atomics - -Rust pretty blatantly just inherits the memory model for atomics from C++20. This is not -due to this model being particularly excellent or easy to understand. Indeed, -this model is quite complex and known to have [several flaws][C11-busted]. -Rather, it is a pragmatic concession to the fact that *everyone* is pretty bad -at modeling atomics. At very least, we can benefit from existing tooling and -research around the C/C++ memory model. -(You'll often see this model referred to as "C/C++11" or just "C11". C just copies -the C++ memory model; and C++11 was the first version of the model but it has -received some bugfixes since then.) - -Trying to fully explain the model in this book is fairly hopeless. It's defined -in terms of madness-inducing causality graphs that require a full book to -properly understand in a practical way. If you want all the nitty-gritty -details, you should check out the [C++ specification][C++-model]. -Still, we'll try to cover the basics and some of the problems Rust developers -face. - -The C++ memory model is fundamentally about trying to bridge the gap between the -semantics we want, the optimizations compilers want, and the inconsistent chaos -our hardware wants. *We* would like to just write programs and have them do -exactly what we said but, you know, fast. Wouldn't that be great? - -## Compiler Reordering - -Compilers fundamentally want to be able to do all sorts of complicated -transformations to reduce data dependencies and eliminate dead code. In -particular, they may radically change the actual order of events, or make events -never occur! If we write something like: - - -```rust,ignore -x = 1; -y = 3; -x = 2; -``` - -The compiler may conclude that it would be best if your program did: - - -```rust,ignore -x = 2; -y = 3; -``` - -This has inverted the order of events and completely eliminated one event. -From a single-threaded perspective this is completely unobservable: after all -the statements have executed we are in exactly the same state. But if our -program is multi-threaded, we may have been relying on `x` to actually be -assigned to 1 before `y` was assigned. We would like the compiler to be -able to make these kinds of optimizations, because they can seriously improve -performance. On the other hand, we'd also like to be able to depend on our -program *doing the thing we said*. - -## Hardware Reordering - -On the other hand, even if the compiler totally understood what we wanted and -respected our wishes, our hardware might instead get us in trouble. Trouble -comes from CPUs in the form of memory hierarchies. There is indeed a global -shared memory space somewhere in your hardware, but from the perspective of each -CPU core it is *so very far away* and *so very slow*. Each CPU would rather work -with its local cache of the data and only go through all the anguish of -talking to shared memory only when it doesn't actually have that memory in -cache. - -After all, that's the whole point of the cache, right? If every read from the -cache had to run back to shared memory to double check that it hadn't changed, -what would the point be? The end result is that the hardware doesn't guarantee -that events that occur in some order on *one* thread, occur in the same -order on *another* thread. To guarantee this, we must issue special instructions -to the CPU telling it to be a bit less smart. - -For instance, say we convince the compiler to emit this logic: - -```text -initial state: x = 0, y = 1 - -THREAD 1 THREAD 2 -y = 3; if x == 1 { -x = 1; y *= 2; - } -``` - -Ideally this program has 2 possible final states: - -* `y = 3`: (thread 2 did the check before thread 1 completed) -* `y = 6`: (thread 2 did the check after thread 1 completed) - -However there's a third potential state that the hardware enables: - -* `y = 2`: (thread 2 saw `x = 1`, but not `y = 3`, and then overwrote `y = 3`) - -It's worth noting that different kinds of CPU provide different guarantees. It -is common to separate hardware into two categories: strongly-ordered and weakly-ordered. -Most notably x86/64 provides strong ordering guarantees, while ARM -provides weak ordering guarantees. This has two consequences for concurrent -programming: - -* Asking for stronger guarantees on strongly-ordered hardware may be cheap or - even free because they already provide strong guarantees unconditionally. - Weaker guarantees may only yield performance wins on weakly-ordered hardware. - -* Asking for guarantees that are too weak on strongly-ordered hardware is - more likely to *happen* to work, even though your program is strictly - incorrect. If possible, concurrent algorithms should be tested on - weakly-ordered hardware. - -## Data Accesses - -The C++ memory model attempts to bridge the gap by allowing us to talk about the -*causality* of our program. Generally, this is by establishing a *happens -before* relationship between parts of the program and the threads that are -running them. This gives the hardware and compiler room to optimize the program -more aggressively where a strict happens-before relationship isn't established, -but forces them to be more careful where one is established. The way we -communicate these relationships are through *data accesses* and *atomic -accesses*. - -Data accesses are the bread-and-butter of the programming world. They are -fundamentally unsynchronized and compilers are free to aggressively optimize -them. In particular, data accesses are free to be reordered by the compiler on -the assumption that the program is single-threaded. The hardware is also free to -propagate the changes made in data accesses to other threads as lazily and -inconsistently as it wants. Most critically, data accesses are how data races -happen. Data accesses are very friendly to the hardware and compiler, but as -we've seen they offer *awful* semantics to try to write synchronized code with. -Actually, that's too weak. - -**It is literally impossible to write correct synchronized code using only data -accesses.** - -Atomic accesses are how we tell the hardware and compiler that our program is -multi-threaded. Each atomic access can be marked with an *ordering* that -specifies what kind of relationship it establishes with other accesses. In -practice, this boils down to telling the compiler and hardware certain things -they *can't* do. For the compiler, this largely revolves around re-ordering of -instructions. For the hardware, this largely revolves around how writes are -propagated to other threads. The set of orderings Rust exposes are: - -* Sequentially Consistent (SeqCst) -* Release -* Acquire -* Relaxed - -(Note: We explicitly do not expose the C++ *consume* ordering) - -TODO: negative reasoning vs positive reasoning? TODO: "can't forget to -synchronize" - -## Sequentially Consistent - -Sequentially Consistent is the most powerful of all, implying the restrictions -of all other orderings. Intuitively, a sequentially consistent operation -cannot be reordered: all accesses on one thread that happen before and after a -SeqCst access stay before and after it. A data-race-free program that uses -only sequentially consistent atomics and data accesses has the very nice -property that there is a single global execution of the program's instructions -that all threads agree on. This execution is also particularly nice to reason -about: it's just an interleaving of each thread's individual executions. This -does not hold if you start using the weaker atomic orderings. - -The relative developer-friendliness of sequential consistency doesn't come for -free. Even on strongly-ordered platforms sequential consistency involves -emitting memory fences. - -In practice, sequential consistency is rarely necessary for program correctness. -However sequential consistency is definitely the right choice if you're not -confident about the other memory orders. Having your program run a bit slower -than it needs to is certainly better than it running incorrectly! It's also -mechanically trivial to downgrade atomic operations to have a weaker -consistency later on. Just change `SeqCst` to `Relaxed` and you're done! Of -course, proving that this transformation is *correct* is a whole other matter. - -## Acquire-Release - -Acquire and Release are largely intended to be paired. Their names hint at their -use case: they're perfectly suited for acquiring and releasing locks, and -ensuring that critical sections don't overlap. - -Intuitively, an acquire access ensures that every access after it stays after -it. However operations that occur before an acquire are free to be reordered to -occur after it. Similarly, a release access ensures that every access before it -stays before it. However operations that occur after a release are free to be -reordered to occur before it. - -When thread A releases a location in memory and then thread B subsequently -acquires *the same* location in memory, causality is established. Every write -(including non-atomic and relaxed atomic writes) that happened before A's -release will be observed by B after its acquisition. However no causality is -established with any other threads. Similarly, no causality is established -if A and B access *different* locations in memory. - -Basic use of release-acquire is therefore simple: you acquire a location of -memory to begin the critical section, and then release that location to end it. -For instance, a simple spinlock might look like: - -```rust -use std::sync::Arc; -use std::sync::atomic::{AtomicBool, Ordering}; -use std::thread; - -fn main() { - let lock = Arc::new(AtomicBool::new(false)); // value answers "am I locked?" - - // ... distribute lock to threads somehow ... - - // Try to acquire the lock by setting it to true - while lock.compare_and_swap(false, true, Ordering::Acquire) { } - // broke out of the loop, so we successfully acquired the lock! - - // ... scary data accesses ... - - // ok we're done, release the lock - lock.store(false, Ordering::Release); -} -``` - -On strongly-ordered platforms most accesses have release or acquire semantics, -making release and acquire often totally free. This is not the case on -weakly-ordered platforms. - -## Relaxed - -Relaxed accesses are the absolute weakest. They can be freely re-ordered and -provide no happens-before relationship. Still, relaxed operations are still -atomic. That is, they don't count as data accesses and any read-modify-write -operations done to them occur atomically. Relaxed operations are appropriate for -things that you definitely want to happen, but don't particularly otherwise care -about. For instance, incrementing a counter can be safely done by multiple -threads using a relaxed `fetch_add` if you're not using the counter to -synchronize any other accesses. - -There's rarely a benefit in making an operation relaxed on strongly-ordered -platforms, since they usually provide release-acquire semantics anyway. However -relaxed operations can be cheaper on weakly-ordered platforms. - -[C11-busted]: http://plv.mpi-sws.org/c11comp/popl15.pdf -[C++-model]: https://en.cppreference.com/w/cpp/atomic/memory_order diff --git a/src/atomics/acquire-release.md b/src/atomics/acquire-release.md new file mode 100644 index 00000000..6d6f0b95 --- /dev/null +++ b/src/atomics/acquire-release.md @@ -0,0 +1,354 @@ +# Acquire and Release + +Next, we’re going to try and implement one of the simplest concurrent utilities +possible — a mutex, but without support for waiting (since that’s not really +related to what we’re doing now). It will hold both an atomic flag that +indicates whether it is locked or not, and the protected data itself. In code +this translates to: + +```rs +use std::cell::UnsafeCell; +use std::sync::atomic::AtomicBool; + +pub struct Mutex { + locked: AtomicBool, + data: UnsafeCell, +} + +impl Mutex { + pub const fn new(data: T) -> Self { + Self { + locked: AtomicBool::new(false), + data: UnsafeCell::new(data), + } + } +} +``` + +Now for the lock function. We need to use an RMW here, since we need to both +check whether it is locked and lock it if it isn’t in a single atomic step; this +can be most simply done with a `compare_exchange` (unlike before, it doesn’t +need to be in a loop this time). For the ordering, we’ll just use `Relaxed` +since we don’t know of any others yet. + +```rust +# use std::cell::UnsafeCell; +# use std::sync::atomic::{self, AtomicBool}; +# pub struct Mutex { +# locked: AtomicBool, +# data: UnsafeCell, +# } +impl Mutex { + pub fn lock(&self) -> Option> { + match self.locked.compare_exchange( + false, + true, + atomic::Ordering::Relaxed, + atomic::Ordering::Relaxed, + ) { + Ok(_) => Some(Guard(self)), + Err(_) => None, + } + } +} + +pub struct Guard<'mutex, T>(&'mutex Mutex); +// Deref impl omitted… +``` + +We also need to implement `Drop` for `Guard` to make sure the lock on the mutex +is released once the guard is destroyed. Again we’re just using the `Relaxed` +ordering. + +```rust +# use std::cell::UnsafeCell; +# use std::sync::atomic::{self, AtomicBool}; +# pub struct Mutex { +# locked: AtomicBool, +# data: UnsafeCell, +# } +# pub struct Guard<'mutex, T>(&'mutex Mutex); +impl Drop for Guard<'_, T> { + fn drop(&mut self) { + self.0.locked.store(false, atomic::Ordering::Relaxed); + } +} +``` + +Great! In the normal operation then, this primitive should allow unique access +to the data of the mutex to be transferred across different threads. Usual usage +could look like this: + +```rust,ignore +// Initial state +let mutex = Mutex::new(0); +// Thread 1 +if let Some(guard) = mutex.lock() { + *guard += 1; +} +// Thread 2 +if let Some(guard) = mutex.lock() { + println!("{}", *guard); +} +``` + +Now, there are many possible executions of this code. For example, Thread 2 (the +reader thread) could lock the mutex first, and Thread 1 (the writer thread) +could fail to lock it: + +```text +Thread 1 locked data Thread 2 +╭───────╮ ┌────────┐ ┌───┐ ╭───────╮ +│ cas ├─┐ │ false │ │ 0 ├╌┐ ┌─┤ cas │ +╰───────╯ │ └────────┘ └───┘ ┊ │ ╰───╥───╯ + │ ┌────────┬───────┼─┘ ╭───⇓───╮ + └─┤ true │ └╌╌╌┤ guard │ + └────────┘ ╰───╥───╯ + ┌────────┬─────────┐ ╭───⇓───╮ + │ false │ └─┤ store │ + └────────┘ ╰───────╯ +``` + +Or potentially Thread _1_ could lock the mutex first, and Thread _2_ could fail +to lock it: + +```text +Thread 1 locked data Thread 2 +╭───────╮ ┌────────┐ ┌───┐ ╭───────╮ +│ cas ├─┐ │ false │ ┌─│ 0 │───┤ cas │ +╰───╥───╯ │ └────────┘ │┌┼╌╌╌┤ ╰───────╯ +╭───⇓───╮ └─┬────────┐ │├┼╌╌╌┤ +│ += 1; ├╌┐ │ true ├─┘┊│ 1 │ +╰───╥───╯ ┊ └────────┘ ┊└───┘ +╭───⇓───╮ └╌╌╌╌╌╌╌╌╌╌╌╌╌┘ +│ store ├───┬────────┐ +╰───────╯ │ false │ + └────────┘ +``` + +But the interesting case comes in when Thread 1 successfully locks and unlocks +the mutex, and then Thread 2 locks it. Let’s draw that one out too: + +```text +Thread 1 locked data Thread 2 +╭───────╮ ┌────────┐ ┌───┐ ╭───────╮ +│ cas ├─┐ │ false │ │ 0 │ ┌───┤ cas │ +╰───╥───╯ │ └────────┘ ┌┼╌╌╌┤ │ ╰───╥───╯ +╭───⇓───╮ └─┬────────┐ ├┼╌╌╌┤ │ ╭───⇓───╮ +│ += 1; ├╌┐ │ true │ ┊│ 1 │ │ ?╌┤ guard │ +╰───╥───╯ ┊ └────────┘ ┊└───┘ │ ╰───╥───╯ +╭───⇓───╮ └╌╌╌╌╌╌╌╌╌╌╌╌╌┘ │ ╭───⇓───╮ +│ store ├───┬────────┐ │ ┌─┤ store │ +╰───────╯ │ false │ │ │ ╰───────╯ + └────────┘ │ │ + ┌────────┬─────────┘ │ + │ true │ │ + └────────┘ │ + ┌────────┬───────────┘ + │ false │ + └────────┘ +``` + +Look at the second operation Thread 2 performs (the read of `data`), for which +we haven’t yet joined the line. Where should it connect to? Well actually, it +has multiple options…wait, we’ve seen this before! It’s a data race! + +That’s not good. Last time the solution was to use atomics instead — but in this +case that doesn’t seem to be enough, since even if atomics were used it still +would have the _option_ of reading `0` instead of `1`, and really if we want our +mutex to be sane, it should only be able to read `1`. + +So it seems that what we _want_ is to be able to apply the coherence rules from +before to completely rule out zero from the set of the possible values — if we +were able to draw a large arrow from the Thread 1’s `+= 1;` to Thread 2’s +`guard`, then we could trivially then use the rule to rule out `0` as a value +that could be read. + +This is where the `Acquire` and `Release` orderings come in. Informally put, a +_release store_ will cause an arrow instead of a line to be drawn from the +operation to the destination; and similarly an _acquire load_ will cause an +arrow to be drawn from the destination to the operation. To give a useless +example that illustrates this, for the given program: + +```rust +# use std::sync::atomic::{self, AtomicU32}; +// Initial state +let a = AtomicU32::new(0); +// Thread 1 +a.store(1, atomic::Ordering::Release); +// Thread 2 +a.load(atomic::Ordering::Acquire); +``` + +The two possible executions look like this: + +```text + Possible Execution 1 ┃ Possible Execution 2 + ┃ +Thread 1 a Thread 2 ┃ Thread 1 a Thread 2 +╭───────╮ ┌───┐ ╭──────╮ ┃ ╭───────╮ ┌───┐ ╭──────╮ +│ store ├─┐ │ 0 │ ┌─→ load │ ┃ │ store ├─┐ │ 0 ├───→ load │ +╰───────╯ │ └───┘ │ ╰──────╯ ┃ ╰───────╯ │ └───┘ ╰──────╯ + └─↘───┐ │ ┃ └─↘───┐ + │ 1 ├─┘ ┃ │ 1 │ + └───┘ ┃ └───┘ +``` + +These arrows are a new kind of arrow we haven’t seen yet; they are known as +_happens-before_ (or happens-after) relations and are represented as thin arrows +(→) on these diagrams. They are weaker than the _sequenced-before_ +double-arrows (⇒) that occur inside a single thread, but can still be used with +the coherence rules to determine which values of a memory location are valid to +read. + +When a happens-before arrow stores a data value to an atomic (via a release +operation) which is then loaded by another happens-before arrow (via an acquire +operation) we say that the release operation _synchronized-with_ the acquire +operation, which in doing so establishes that the release operation +_happens-before_ the acquire operation. Therefore, we can say that in the first +possible execution, Thread 1’s `store` synchronizes-with Thread 2’s `load`, +which causes that `store` and everything sequenced-before it to happen-before +the `load` and everything sequenced-after it. + +> More formally, we can say that A happens-before B if any of the following +> conditions are true: +> 1. A is sequenced-before B (i.e. A occurs before B on the same thread) +> 2. A synchronizes-with B (i.e. A is a `Release` operation and B is an +> `Acquire` operation that reads the value written by A) +> 3. A happens-before X, and X happens-before B (transitivity) + +There is one more rule required for these to be useful, and that is _release +sequences_: after a release store is performed on an atomic, happens-before +arrows will connect together each subsequent value of the atomic as long as the +new value is caused by an RMW and not just a plain store (this means any +subsequent normal store, no matter the ordering, will end the sequence). + +> In the C++11 memory model, any subsequent store by the same thread that +> performed the original `Release` store would also contribute to the release +> sequence. However, this was removed in C++20 for simplicity and better +> optimizations and so **must not** be relied upon. + +With those rules in mind, converting Thread 1’s second store to use a `Release` +ordering as well as converting Thread 2’s CAS to use an `Acquire` ordering +allows us to effectively draw that arrow we needed before: + +```text +Thread 1 locked data Thread 2 +╭───────╮ ┌───────┐ ┌───┐ ╭───────╮ +│ cas ├─┐ │ false │ │ 0 │ ┌───→ cas │ +╰───╥───╯ │ └───────┘ ┌┼╌╌╌┤ │ ╰───╥───╯ +╭───⇓───╮ └─┬───────┐ ├┼╌╌╌┤ │ ╭───⇓───╮ +│ += 1; ├╌┐ │ true │ ┊│ 1 ├╌│╌╌╌┤ guard │ +╰───╥───╯ ┊ └───────┘ ┊└───┘ │ ╰───╥───╯ +╭───⇓───╮ └╌╌╌╌╌╌╌╌╌╌╌╌┘ │ ╭───⇓───╮ +│ store ├───↘───────┐ │ ┌─┤ store │ +╰───────╯ │ false │ │ │ ╰───────╯ + └───┬───┘ │ │ + ┌───↓───┬─────────┘ │ + │ true │ │ + └───────┘ │ + ┌───────┬───────────┘ + │ false │ + └───────┘ +``` + +We now can trace back along the reverse direction of arrows from the `guard` +bubble to the `+= 1` bubble; we have established that Thread 2’s load +happens-after the `+= 1` side effect, because Thread 2’s CAS synchronizes-with +Thread 1’s store. This both avoids the data race and gives the guarantee that +`1` will be always read by Thread 2 (as long as it locks after Thread 1, of +course). + +However, that is not the only execution of the program possible. Even with this +setup, there is another execution that can also cause UB: if Thread 2 locks the +mutex before Thread 1 does. + +```text +Thread 1 locked data Thread 2 +╭───────╮ ┌───────┐ ┌───┐ ╭───────╮ +│ cas ├───┐ │ false │┌──│ 0 │────→ cas │ +╰───╥───╯ │ └───────┘│ ┌┼╌╌╌┤ ╰───╥───╯ +╭───⇓───╮ │ ┌───────┬┘ ├┼╌╌╌┤ ╭───⇓───╮ +│ += 1; ├╌┐ │ │ true │ ┊│ 1 │ ?╌┤ guard │ +╰───╥───╯ ┊ │ └───────┘ ┊└───┘ ╰───╥───╯ +╭───⇓───╮ └╌│╌╌╌╌╌╌╌╌╌╌╌╌┘ ╭───⇓───╮ +│ store ├─┐ │ ┌───────┬────────────┤ store │ +╰───────╯ │ │ │ false │ ╰───────╯ + │ │ └───────┘ + │ └─┬───────┐ + │ │ true │ + │ └───────┘ + └───↘───────┐ + │ false │ + └───────┘ +``` + +Once again `guard` has multiple options for values to read. This one’s a bit +more counterintuitive than the previous one, since it requires “travelling +forward in time” to understand why the `1` is even there in the first place — +but since the abstract machine has no concept of time, it’s just a valid UB as +any other. + +Luckily, we’ve already solved this problem once, so it easy to solve again: just +like before, we’ll have the CAS become acquire and the store become release, and +then we can use the second coherence rule from before to follow _forward_ the +arrow from the `guard` bubble all the way to the `+= 1;`, determining that it is +only possible for that read to see `0` as its value, as in the execution below. + +```text +Thread 1 locked data Thread 2 +╭───────╮ ┌───────┐ ┌───┐ ╭───────╮ +│ cas ←───┐ │ false │┌──│ 0 ├╌┐──→ cas │ +╰───╥───╯ │ └───────┘│ ┌┼╌╌╌┤ ┊ ╰───╥───╯ +╭───⇓───╮ │ ┌───────┬┘ ├┼╌╌╌┤ ┊ ╭───⇓───╮ +│ += 1; ├╌┐ │ │ true │ ┊│ 1 │ └─╌┤ guard │ +╰───╥───╯ ┊ │ └───────┘ ┊└───┘ ╰───╥───╯ +╭───⇓───╮ └╌│╌╌╌╌╌╌╌╌╌╌╌╌┘ ╭───⇓───╮ +│ store ├─┐ │ ┌───────↙────────────┤ store │ +╰───────╯ │ │ │ false │ ╰───────╯ + │ │ └───┬───┘ + │ └─┬───↓───┐ + │ │ true │ + │ └───────┘ + └───↘───────┐ + │ false │ + └───────┘ +``` + +This leads us to the proper memory orderings for any mutex (and other locks like +RW locks too, even): use `Acquire` to lock it, and `Release` to unlock it. So +let’s go back to and update our original mutex definition with this knowledge. + +But wait, `compare_exchange` takes two ordering parameters, not just one! That’s +right — it also takes a second one to apply when the exchange fails (in our case, +when the mutex is already locked). But we don’t need an `Acquire` here, since in +that case we won’t be reading from the `data` value anyway, so we’ll just stick +with `Relaxed`. + +```rust,ignore +impl Mutex { + pub fn lock(&self) -> Option> { + match self.locked.compare_exchange( + false, + true, + atomic::Ordering::Acquire, + atomic::Ordering::Relaxed, + ) { + Ok(_) => Some(Guard(self)), + Err(_) => None, + } + } +} + +impl Drop for Guard<'_, T> { + fn drop(&mut self) { + self.0.locked.store(false, atomic::Ordering::Release); + } +} +``` + +Note that similarly to how atomic operations only make sense when paired with +other atomic operations on the same locations, `Acquire` only makes sense when +paired with `Release` and vice versa. That is, both an `Acquire` with no +corresponding `Release` and a `Release` with no corresponding `Acquire` are +useless, since the arrows will be unable to connect to anything. diff --git a/src/atomics/atomics.md b/src/atomics/atomics.md new file mode 100644 index 00000000..979f102b --- /dev/null +++ b/src/atomics/atomics.md @@ -0,0 +1,125 @@ +# Atomics + +Rust pretty blatantly just inherits the memory model for atomics from C++20. This is not +due to this model being particularly excellent or easy to understand. Indeed, +this model is quite complex and known to have [several flaws][C11-busted]. +Rather, it is a pragmatic concession to the fact that *everyone* is pretty bad +at modeling atomics. At very least, we can benefit from existing tooling and +research around the C/C++ memory model. +(You'll often see this model referred to as "C/C++11" or just "C11". C just copies +the C++ memory model; and C++11 was the first version of the model but it has +received some bugfixes since then.) + +Trying to fully explain the model in this book is fairly hopeless. It's defined +in terms of madness-inducing causality graphs that require a full book to +properly understand in a practical way. If you want all the nitty-gritty +details, you should check out the [C++ specification][C++-model] — +note that Rust atomics correspond to C++’s `atomic_ref`, since Rust allows +accessing atomics via non-atomic operations when it is safe to do so. +In this section we aim to give an informal overview of the topic to cover the +common problems that Rust developers face. + +## Motivation + +The C++ memory model is very large and confusing with lots of seemingly +arbitrary design decisions. To understand the motivation behind this, it can +help to look at what got us in this situation in the first place. There are +three main factors at play here: + +1. Users of the language, who want fast, cross-platform code; +2. compilers, who want to optimize code to make it fast; +3. and the hardware, which is ready to unleash a wrath of inconsistent chaos on + your program at a moment's notice. + +The memory model is fundamentally about trying to bridge the gap between these +three, allowing users to write the algorithms they want while the compiler and +hardware perform the arcane magic necessary to make them run fast. + +### Compiler Reordering + +Compilers fundamentally want to be able to do all sorts of complicated +transformations to reduce data dependencies and eliminate dead code. In +particular, they may radically change the actual order of events, or make events +never occur! If we write something like: + + +```rust,ignore +x = 1; +y = 3; +x = 2; +``` + +The compiler may conclude that it would be best if your program did: + + +```rust,ignore +x = 2; +y = 3; +``` + +This has inverted the order of events and completely eliminated one event. +From a single-threaded perspective this is completely unobservable: after all +the statements have executed we are in exactly the same state. But if our +program is multi-threaded, we may have been relying on `x` to actually be +assigned to 1 before `y` was assigned. We would like the compiler to be +able to make these kinds of optimizations, because they can seriously improve +performance. On the other hand, we'd also like to be able to depend on our +program *doing the thing we said*. + +### Hardware Reordering + +On the other hand, even if the compiler totally understood what we wanted and +respected our wishes, our hardware might instead get us in trouble. Trouble +comes from CPUs in the form of memory hierarchies. There is indeed a global +shared memory space somewhere in your hardware, but from the perspective of each +CPU core it is *so very far away* and *so very slow*. Each CPU would rather work +with its local cache of the data and only go through all the anguish of +talking to shared memory only when it doesn't actually have that memory in +cache. + +After all, that's the whole point of the cache, right? If every read from the +cache had to run back to shared memory to double check that it hadn't changed, +what would the point be? The end result is that the hardware doesn't guarantee +that events that occur in some order on *one* thread, occur in the same +order on *another* thread. To guarantee this, we must issue special instructions +to the CPU telling it to be a bit less smart. + +For instance, say we convince the compiler to emit this logic: + +```text +initial state: x = 0, y = 1 + +THREAD 1 THREAD 2 +y = 3; if x == 1 { +x = 1; y *= 2; + } +``` + +Ideally this program has 2 possible final states: + +* `y = 3`: (thread 2 did the check before thread 1 completed) +* `y = 6`: (thread 2 did the check after thread 1 completed) + +However there's a third potential state that the hardware enables: + +* `y = 2`: (thread 2 saw `x = 1`, but not `y = 3`, and then overwrote `y = 3`) + +It's worth noting that different kinds of CPU provide different guarantees. It +is common to separate hardware into two categories: strongly-ordered and +weakly-ordered, where strongly-ordered hardware implements weak orderings like +`Relaxed` using strong orderings like `Acquire`, while weakly-ordered hardware +makes use of the optimization potential that weak orderings like `Relaxed` give. +Most notably, x86/64 provides strong ordering guarantees, while ARM provides +weak ordering guarantees. This has two consequences for concurrent programming: + +* Asking for stronger guarantees on strongly-ordered hardware may be cheap or + even free because they already provide strong guarantees unconditionally. + Weaker guarantees may only yield performance wins on weakly-ordered hardware. + +* Asking for guarantees that are too weak on strongly-ordered hardware is + more likely to *happen* to work, even though your program is strictly + incorrect. If possible, concurrent algorithms should be tested on + weakly-ordered hardware. + +[C11-busted]: http://plv.mpi-sws.org/c11comp/popl15.pdf +[C++-model]: https://en.cppreference.com/w/cpp/atomic/memory_order diff --git a/src/atomics/fences.md b/src/atomics/fences.md new file mode 100644 index 00000000..6bc08c4f --- /dev/null +++ b/src/atomics/fences.md @@ -0,0 +1,257 @@ +# Fences + +As well as loads, stores, and RMWs, there is one more kind of atomic operation +to be aware of: fences. Fences can be triggered by the +[`core::sync::atomic::fence`] function, which accepts a single ordering +parameter and returns nothing. They don’t do anything on their own, but can be +thought of as events that strengthen the ordering of nearby atomic operations. + +## Acquire fences + +The most common kind of fence is an _acquire fence_, which can be triggered in +three different ways: +1. `atomic::fence(atomic::Ordering::Acquire)` +1. `atomic::fence(atomic::Ordering::AcqRel)` +1. `atomic::fence(atomic::Ordering::SeqCst)` + +An acquire fence retroactively makes every single non-`Acquire` operation that +was sequenced-before it act like an `Acquire` operation that occurred at the +fence — in other words, it causes every prior `Release`d value that was +previously loaded on the thread to synchronize-with the fence. For example, the +following code: + +```rust +# use std::sync::atomic::{self, AtomicU32}; +static X: AtomicU32 = AtomicU32::new(0); + +// t_1 +X.store(1, atomic::Ordering::Release); + +// t_2 +let value = X.load(atomic::Ordering::Relaxed); +atomic::fence(atomic::Ordering::Acquire); +``` + +Can result in two possible executions: + +```text + Possible Execution 1 ┃ Possible Execution 2 + ┃ + t_1 X t_2 ┃ t_1 X t_2 +╭───────╮ ┌───┐ ╭───────╮ ┃ ╭───────╮ ┌───┐ ╭───────╮ +│ store ├─┐ │ 0 │ ┌─┤ load │ ┃ │ store ├─┐ │ 0 ├───┤ load │ +╰───────╯ │ └───┘ │ ╰───╥───╯ ┃ ╰───────╯ │ └───┘ ╰───╥───╯ + └─↘───┐ │ ╭───⇓───╮ ┃ └─↘───┐ ╭───⇓───╮ + │ 1 ├─┘┌→ fence │ ┃ │ 1 │ │ fence │ + └───┴──┘╰───────╯ ┃ └───┘ ╰───────╯ +``` + +In the first execution, `t_1`’s store synchronizes-with and therefore +happens-before `t_2`’s fence due to the prior load, but note that it does _not_ +happen-before `t_2`’s load. + +Acquire fences work on any number of atomics, and on release sequences too. A +more complex example is as follows: + +```rust +# use std::sync::atomic::{self, AtomicU32}; +static X: AtomicU32 = AtomicU32::new(0); +static Y: AtomicU32 = AtomicU32::new(0); + +// t_1 +X.store(1, atomic::Ordering::Release); +X.fetch_add(1, atomic::Ordering::Relaxed); + +// t_2 +Y.store(1, atomic::Ordering::Release); + +// t_3 +let x = X.load(atomic::Ordering::Relaxed); +let y = Y.load(atomic::Ordering::Relaxed); +atomic::fence(atomic::Ordering::Acquire); +``` + +This can result in an execution like so: + +```text + t_1 X t_3 Y t_2 +╭───────╮ ┌───┐ ╭───────╮ ┌───┐ ╭───────╮ +│ store ├─┐ │ 0 │ ┌─┤ load │ │ 0 │ ┌─┤ store │ +╰───╥───╯ │ └───┘ │ ╰───╥───╯ └───┘ │ ╰───────╯ +╭───⇓───╮ └─↘───┐ │ ╭───⇓───╮ ┌───↙─┘ +│ rmw ├─┐ │ 1 │ │ │ load ├───┤ 1 │ +╰───────╯ │ └─┬─┘ │ ╰───╥───╯ ┌─┴───┘ + └─┬─↓─┐ │ ╭───⇓───╮ │ + │ 2 ├─┘┌→ fence ←─┘ + └───┴──┘╰───────╯ +``` + +There are two common scenarios in which acquire fences are used: +1. When an `Acquire` ordering is only necessary when a specific value is loaded. + For example, you may only wish to acquire when an `initialized` boolean is + `true`, since otherwise you won’t be reading the shared state at all. In + this case, you can load with a `Relaxed` ordering and then issue an + `Acquire` fence afterward only if that condition is met, which can aid in + performance sometimes (since the acquire operation is avoided when + `initialized == false`). +2. When several `Acquire` operations on different locations need to be performed + in a row, but individually each operation doesn’t need `Acquire` ordering; + it is often faster to perform all the loads as `Relaxed` first and use a + single `Acquire` fence at the end then it is to make each one separately use + `Acquire`. + +## Release fences + +Release fences are the natural complement to acquire fences, and they similarly +can be triggered in three different ways: +1. `atomic::fence(atomic::Ordering::Release)` +1. `atomic::fence(atomic::Ordering::AcqRel)` +1. `atomic::fence(atomic::Ordering::SeqCst)` + +Release fences convert every subsequent atomic access in the same thread into a +release operation that has its arrow starting from the fence — in other words, +every `Acquire` operation that sees a value that was written by the fence’s +thread after the release fence will synchronize-with the release fence. For +example, the following code: + +```rust +# use std::sync::atomic::{self, AtomicU32}; +static X: AtomicU32 = AtomicU32::new(0); + +// t_1 +atomic::fence(atomic::Ordering::Release); +X.store(1, atomic::Ordering::Relaxed); + +// t_2 +X.load(atomic::Ordering::Acquire); +``` + +Can result in this execution: + +```text + t_1 X t_2 +╭───────╮ ┌───┐ ╭───────╮ +│ fence ├─┐ │ 0 │ ┌─→ load │ +╰───╥───╯ │ └───┘ │ ╰───────╯ +╭───⇓───╮ └─↘───┐ │ +│ store ├───┤ 1 ├─┘ +╰───────╯ └───┘ +``` + +As well as it being possible for a release fence to synchronize-with an acquire +load (fence–atomic synchronization) and a release store to synchronize-with an +acquire fence (atomic–fence synchronization), it is also possible for release +fences to synchronize with acquire fences (fence–fence synchronization). In this +code snippet, only fences and `Relaxed` operations are used to establish a +happens-before relation (in some executions): + +```rust +# use std::sync::atomic::{self, AtomicU32}; +static X: AtomicU32 = AtomicU32::new(0); + +// t_1 +atomic::fence(atomic::Ordering::Release); +X.store(1, atomic::Ordering::Relaxed); + +// t_2 +X.load(atomic::Ordering::Relaxed); +atomic::fence(atomic::Ordering::Acquire); +``` + +The execution with the relation looks like this: + +```text + t_1 X t_2 +╭───────╮ ┌───┐ ╭───────╮ +│ fence ├─┐ │ 0 │ ┌─┤ load │ +╰───╥───╯ │ └───┘ │ ╰───╥───╯ +╭───⇓───╮ └─↘───┐ │ ╭───⇓───╮ +│ store ├───┤ 1 ├─┘┌→ fence │ +╰───────╯ └───┴──┘╰───────╯ +``` + +Like with acquire fences, release fences can be used to optimize over a series +of atomic stores that don’t individually need to be `Release`, since in some +conditions and on some architectures it’s faster to put a single release fence +at the start and use `Relaxed` from that point on than it is to use `Release` +every time. + +## `AcqRel` fences + +`AcqRel` fences are just the combined behaviour of an `Acquire` fence and a +`Release` fence in one operation. There isn’t much special to note about them, +other than that they behave more like an acquire fence followed by a release +fence than the other way around, which is useful to know in situations like the +following: + +```text + t_1 X t_2 Y t_3 +╭───────╮ ┌───┐ ╭───────╮ ┌───┐ ╭───────╮ +│ A │ │ 0 │ ┌─┤ load │ │ 0 │ ┌─→ load │ +╰───╥───╯ └───┘ │ ╰───╥───╯ └───┘ │ ╰───╥───╯ +╭───⇓───╮ ┌─↘───┐ │ ╭───⇓───╮┌──↘───┐ │ ╭───⇓───╮ +│ store ├─┘ │ 1 ├─┘┌→ fence ├┘┌─┤ 1 ├─┘ │ B │ +╰───────╯ └───┴──┘╰───╥───╯ │ └───┘ ╰───────╯ + ╭───⇓───╮ │ + │ store ├─┘ + ╰───────╯ +``` + +Here, A happens-before B, which is singularly due to the `AcqRel` fence’s +ability to “carry over” happens-before relations within itself. + +## `SeqCst` fences + +`SeqCst` fences are the strongest kind of fence. They first of all inherit the +behaviour from an `AcqRel` fence, meaning they have both acquire and release +semantics at the same time, but being `SeqCst` operations they also participate +in _S_. Just as with all other `SeqCst` operations, their placement in _S_ is +primarily determined by strongly happens-before relations (including the +[mixed-`SeqCst` caveat] that comes with it), which then gives additional +guarantees to your code. + +Namely, the power of `SeqCst` fences can be summarized in three points: + +* Everything that happens-before a `SeqCst` fence is not coherence-ordered-after + any `SeqCst` operation that the fence precedes in _S_. +* Everything that happens-after a `SeqCst` fence is not coherence-ordered-before + any `SeqCst` operation that the fence succeeds in _S_. +* Everything that happens-before a `SeqCst` fence X is not + coherence-ordered-after anything that happens-after another `SeqCst` fence + Y, if X preceeds Y in _S_. + +> In C++11, the above three statements were similar, except they only talked +> about what was sequenced-before and sequenced-after the `SeqCst` fences; C++20 +> strengthened this to also include happens-before, because in practice this +> theoretical optimization was not being exploited by anybody. However do note +> that as of the time of writing, [Miri only implements the old, weaker +> semantics][miri scfix] and so you may see false positives when testing with +> it. + +The “motivating use-case” for `SeqCst` demonstrated in the `SeqCst` chapter can +also be rewritten to use exclusively `SeqCst` fences and `Relaxed` operations, +by inserting fences in between the operations in the two threads: + +```text + a static X static Y b +╭─────────╮ ┌───────┐ ┌───────┐ ╭─────────╮ +│ store X ├─┐ │ false │ │ false │ ┌─┤ store Y │ +╰────╥────╯ │ └───────┘ └───────┘ │ ╰────╥────╯ +╭────⇓────╮ └─┬───────┐ ┌───────┬─┘ ╭────⇓────╮ +│ *fence* │ │ true │ │ true │ │ *fence* │ +╰────╥────╯ └───────┘ └───────┘ ╰────╥────╯ +╭────⇓────╮ ╭────⇓────╮ +│ load Y ├─? ?─┤ load X │ +╰─────────╯ ╰─────────╯ +``` + +There are two executions to consider here, depending on which way round the +fences appear in _S_. Should `a`’s fence appear first, the fence–fence `SeqCst` +guarantee tells us that `b`’s load of `X` is not coherence-ordered-after `a`’s +store of `X`, which forbids `b`’s load of `X` from seeing the value `false`. The +same logic can be applied should the fences appear the other way around, proving +that at least one thread must load `true` in the end. + +[`core::sync::atomic::fence`]: https://doc.rust-lang.org/stable/core/sync/atomic/fn.fence.html +[mixed-`SeqCst` caveat]: seqcst.md#the-mixed-seqcst-special-case +[miri scfix]: https://github.com/rust-lang/miri/issues/2301 diff --git a/src/atomics/multithread.md b/src/atomics/multithread.md new file mode 100644 index 00000000..4a4ce3d6 --- /dev/null +++ b/src/atomics/multithread.md @@ -0,0 +1,291 @@ +# Multithreaded Execution + +When you write Rust code to run on your computer, it may surprise you but you’re +not actually writing Rust code to run on your computer — instead, you’re writing +Rust code to run on the _abstract machine_ (or AM for short). The abstract +machine, to be contrasted with the physical machine, is an abstract +representation of a theoretical computer: it doesn’t actually exist _per se_, +but the combination of a compiler, target architecture and target operating +system is capable of emulating a subset of its possible behaviours. + +The Abstract Machine has a few properties that are essential to understand: +1. It is architecture and OS-independent. The Abstract Machine doesn’t care + whether you’re on x86_64 or iOS or a Nintendo 3DS, the rules are the same + for everyone. This enables you to write code without having to think about + what the underlying system does or how it does it, as long as you obey the + Abstract Machine’s rules you know you’ll be fine. +1. It is the lowest common denominator of all supported computer systems. This + means it is allowed to result in executions no sane computer would actually + generate in real life. It is also purposefully built with forward + compatibility in mind, giving compilers the opportunity to make better and + more aggressive optimizations in the future. As a result, it can be quite + hard to test code, especially if you’re on a system that exploits fewer of + the AM’s allowed semantics, so it is highly recommended to utilize tools + that intentionally produce these executions like [Loom] and [Miri]. +1. Its model is highly formalized and not representative of what goes on + underneath. Because C++ needs to be defined by a formal specification and + not just hand-wavy rules about “this is what is allowed and this is what + isn’t”, the Abstract Machine defines things in a very mathematical and, + well, _abstract_, way; instead of saying things like “the compiler is + allowed to do X” it will find a way to define the system such that the + compiler’s ability to do X simply follows as a natural consequence. This + makes it very elegant and keeps the mathematicians happy, but you should + keep in mind that this is not how computers actually function, it is merely + a representation of it. + +With that out of the way, let’s look into how the C++20 Abstract Machine is +actually defined. + +The first important thing to understand is that **the abstract machine has no +concept of time**. You might expect there to be a single global ordering of +events across the program where each happens at the same time or one after the +other, but under the abstract model no such ordering exists; instead, a possible +execution of the program must be treated as a single event that happens +instantaneously. There is never any such thing as “now”, or a “latest value”, +and using that terminology will only lead you to more confusion. Of course, in +reality there does exist a concept of time, but you must keep in mind that +you’re not programming for the hardware, you’re programming for the AM. + +However, while no global ordering of operations exists _between_ threads, there +does exist a single total ordering _within_ each thread, which is known as its +_sequence_. For example, given this simple Rust program: + +```rust +println!("A"); +println!("B"); +``` + +its sequence during one possible execution can be visualized like so: + +```text +╭───────────────╮ +│ println!("A") │ +╰───────╥───────╯ +╭───────⇓───────╮ +│ println!("B") │ +╰───────────────╯ +``` + +That double arrow in between the two boxes (`⇒`) represents that the second +statement is _sequenced-after_ the first (and similarly the first statement is +_sequenced-before_ the second). This is the strongest kind of ordering guarantee +between any two operations, and only comes about when those two operations +happen one after the other and on the same thread. + +If we add a second thread to the mix: + +```rust +// Thread 1: +println!("A"); +println!("B"); +// Thread 2: +eprintln!("01"); +eprintln!("02"); +``` + +it will simply coexist in parallel, with each thread getting its own independent +sequence: + +```text + Thread 1 Thread 2 +╭───────────────╮ ╭─────────────────╮ +│ println!("A") │ │ eprintln!("01") │ +╰───────╥───────╯ ╰────────╥────────╯ +╭───────⇓───────╮ ╭────────⇓────────╮ +│ println!("B") │ │ eprintln!("02") │ +╰───────────────╯ ╰─────────────────╯ +``` + +We can say that the prints of `A` and `B` are _unsequenced_ with regard to the +prints of `01` and `02` that occur in the second thread, since they have no +sequenced-before arrows connecting the boxes together. + +Note that these diagrams are **not** a representation of multiple things that +_could_ happen at runtime — instead, this diagram describes exactly what _did_ +happen when the program ran once. This distinction is key, because it highlights +that even the lowest-level representation of a program’s execution does not have +a global ordering between threads; those two disconnected chains are all there +is. + +Now let’s make things more interesting by introducing some shared data, and have +both threads read it. + +```rust +// Initial state +let data = 0; +// Thread 1: +println!("{data}"); +// Thread 2: +eprintln!("{data}"); +``` + +Each memory location, similarly to threads, can be shown as another column on +our diagram, but holding values instead of instructions, and each access (read +or write) manifests as a line from the instruction that performed the access to +the associated value in the column. So this code can produce (and is in fact +guaranteed to produce) the following execution: + +```text +Thread 1 data Thread 2 +╭──────╮ ┌────┐ ╭──────╮ +│ data ├╌╌╌╌┤ 0 ├╌╌╌╌┤ data │ +╰──────╯ └────┘ ╰──────╯ +``` + +That is, both threads read the same value of `0` from `data`, and the two +operations are unsequenced — they have no relative ordering between them. + +That’s reads done, so we’ll look at the other kind of data access next: writes. +We’ll also return to a single thread for now, just to keep things simple. + +```rust +let mut data = 0; +data = 1; +``` + +Here, we have a single variable that the main thread writes to once — this means +that in its lifetime, it holds two values, first `0`, and then `1`. +Diagrammatically, this code’s execution can be represented like so: + +```text + Thread 1 data +╭───────╮ ┌────┐ +│ = 1 ├╌╌╌┐ │ 0 │ +╰───────╯ ├╌╌╌┼╌╌╌╌┤ + └╌╌╌┼╌╌╌╌┤ + │ 1 │ + └────┘ +``` + +Note the use of dashed padding in between the values of `data`’s column. Those +spaces won’t ever contain a value, but they’re used to represent an +unsynchronized (non-atomic) write — it is garbage data and attempting to read it +would result in a data race. + +Now let’s put all of our knowledge thus far together, and make a program both +that reads _and_ writes data — woah, scary! + +```rust +let mut data = 0; +data = 1; +println!("{data}"); +data = 2; +``` + +Working out executions of code like this is rather like solving a Sudoku puzzle: +you must first lay out all the facts that you know, and then fill in the blanks +with logical reasoning. The initial information we’ve been given is both the +initial value of `data` and the sequential order of Thread 1; we also know that +over its lifetime, `data` takes on a total of three different values that were +caused by two different non-atomic writes. This allows us to start drawing out +some boxes: + +```text + Thread 1 data +╭───────╮ ┌────┐ +│ = 1 ├╌? │ 0 │ +╰───╥───╯ ?╌┼╌╌╌╌┤ +╭───⇓───╮ ?╌┼╌╌╌╌┤ +│ data ├╌? │ ? │ +╰───╥───╯ ?╌┼╌╌╌╌┤ +╭───⇓───╮ ?╌┼╌╌╌╌┤ +│ = 2 ├╌? │ ? │ +╰───────╯ └────┘ +``` + +We know all of those lines need to be joined _somewhere_, but we don’t quite +know _where_ yet. This is where we need to bring in our first rule, a rule that +universally governs all accesses to every location in memory: + +> From the point at which the access occurs, find every other point that can be +> reached by following the reverse direction of arrows, then for each one of +> those, take a single step across every line that connects to the relevant +> memory location. **It is not allowed for the access to read or write any value +> that appears above any one of these points**. + +In our case, there are two potential executions: one, where the first write +corresponds to the first value in `data`, and two, where the first write +corresponds to the second value in `data`. Considering the second case for a +moment, it would also force the second write to correspond to the first +value in `data`. Therefore its diagram would look something like this: + +```text + Thread 1 data +╭───────╮ ┌────┐ +│ = 1 ├╌╌┐ │ 0 │ +╰───╥───╯ ┊ ┌╌╌┼╌╌╌╌┤ +╭───⇓───╮ ┊ ├╌╌┼╌╌╌╌┤ +│ data ├╌?┊ ┊ │ 2 │ +╰───╥───╯ ├╌┼╌╌┼╌╌╌╌┤ +╭───⇓───╮ └╌┼╌╌┼╌╌╌╌┤ +│ = 2 ├╌╌╌╌┘ │ 1 │ +╰───────╯ └────┘ +``` + +However, that second line breaks the rule we just established! Following up the +arrows from the third operation in Thread 1, we reach the first operation, and +from there we can take a single step to reach the space in between the `2` and +the `1`, which excludes the third access from writing any value above that point +— including the `2` that it is currently writing! + +So evidently, this execution is no good. We can therefore conclude that the only +possible execution of this program is the other one, in which the `1` appears +above the `2`: + +```text + Thread 1 data +╭───────╮ ┌────┐ +│ = 1 ├╌╌┐ │ 0 │ +╰───╥───╯ ├╌╌┼╌╌╌╌┤ +╭───⇓───╮ └╌╌┼╌╌╌╌┤ +│ data ├╌? │ 1 │ +╰───╥───╯ ┌╌╌┼╌╌╌╌┤ +╭───⇓───╮ ├╌╌┼╌╌╌╌┤ +│ = 2 ├╌╌┘ │ 2 │ +╰───────╯ └────┘ +``` + +Now to sort out the read operation in the middle. We can use the same rule as +before to trace up to the first write and rule out us reading either the `0` +value or the garbage that exists between it and `1`, but how do we choose +between the `1` and the `2`? Well, as it turns out there is a complement to the +rule we already defined which gives us the exact answer we need: + +> From the point at which the access occurs, find every other point that can be +> reached by following the _forward_ direction of arrows, then for each one of +> those, take a single step across every line that connects to the relevant +> memory location. **It is not allowed for the access to read or write any value +> that appears below any one of these points**. + +Using this rule, we can follow the arrow downwards and then across and finally +rule out `2` as well as the garbage before it. This leaves us with exactly _one_ +value that the read operation can return, and exactly one possible execution +guaranteed by the Abstract Machine: + +```text + Thread 1 data +╭───────╮ ┌────┐ +│ = 1 ├╌╌┐ │ 0 │ +╰───╥───╯ ├╌╌┼╌╌╌╌┤ +╭───⇓───╮ └╌╌┼╌╌╌╌┤ +│ data ├╌╌╌╌╌┤ 1 │ +╰───╥───╯ ┌╌╌┼╌╌╌╌┤ +╭───⇓───╮ ├╌╌┼╌╌╌╌┤ +│ = 2 ├╌╌┘ │ 2 │ +╰───────╯ └────┘ +``` + +These two rules combined make up the more generalized rule known as _coherence_, +which is put in place to guarantee that a thread will never see a value earlier +than the last one it read or later than a one it will in future write. Coherence +is basically required for any program to act in a sane way, so luckily the C++20 +standard guarantees it as one of its most fundamental principles. + +You might be thinking that all this has been is the longest, most convoluted +explanation ever of the most basic intuitive semantics of programming — and +you’d be absolutely right. But it’s essential to grasp these fundamentals, +because once you have this model in mind, the extension into multiple threads +and the complicated semantics of real atomics becomes completely natural. + +[Loom]: https://docs.rs/loom +[Miri]: https://github.com/rust-lang/miri diff --git a/src/atomics/relaxed.md b/src/atomics/relaxed.md new file mode 100644 index 00000000..1ee18193 --- /dev/null +++ b/src/atomics/relaxed.md @@ -0,0 +1,452 @@ +# Relaxed + +Now we’ve got single-threaded mutation semantics out of the way, we can try +reintroducing a second thread. We’ll have one thread perform a write to the +memory location, and a second thread read from it, like so: + +```rust +// Initial state +let mut data = 0; +// Thread 1: +data = 1; +// Thread 2: +println!("{data}"); +``` + +Of course, any Rust programmer will immediately tell you that this code doesn’t +compile, and indeed it definitely does not, and for good reason. But suspend +your disbelief for a moment, and imagine what would happen if it did. Let’s draw +a diagram, leaving out the reading lines for now: + +```text +Thread 1 data Thread 2 +╭───────╮ ┌────┐ ╭───────╮ +│ = 1 ├╌┐ │ 0 │ ?╌┤ data │ +╰───────╯ ├╌┼╌╌╌╌┤ ╰───────╯ + └╌┼╌╌╌╌┤ + │ 1 │ + └────┘ +``` + +Unfortunately, coherence doesn’t help us in finding out where Thread 2’s line +joins up to, since there are no arrows connecting that operation to anything and +therefore we can’t immediately rule any values out. As a result, we end up +facing a situation we haven’t faced before: there is _more than one_ potential +value for Thread 2 to read. + +And this is where we encounter the big limitation with unsynchronized data +accesses: the price we pay for their speed and optimization capability is that +this situation is considered **Undefined Behavior**. For an unsynchronized read +to be acceptable, there has to be _exactly one_ potential value for it to read, +and when there are multiple like in this situation it is considered a data race. + +So what can we do about this? Well, two things need to be changed. First of all, +Thread 1 has to use an atomic store instead of an unsynchronized write, and +secondly Thread 2 has to use an atomic load instead of an unsynchronized read. +You’ll also notice that all the atomic functions accept one (and sometimes two) +parameters of `atomic::Ordering`s — we’ll explore the details of the differences +between them later, but for now we’ll use `Relaxed` because it is by far the +simplest of the lot. + +```rust +# use std::sync::atomic::{self, AtomicU32}; +// Initial state +let data = AtomicU32::new(0); +// Thread 1: +data.store(1, atomic::Ordering::Relaxed); +// Thread 2: +data.load(atomic::Ordering::Relaxed); +``` + +The use of the atomic store provides one additional ability in comparison to an +unsynchronized store, and that is that there is no “in-between” state between +the old and new values — instead, it immediately updates, resulting in a diagram +that look a bit more like this: + +```text +Thread 1 data +╭───────╮ ┌────┐ +│ = 1 ├─┐ │ 0 │ +╰───────╯ │ └────┘ + └─┬────┐ + │ 1 │ + └────┘ +``` + +We have now established a _modification order_ for `data`: a total, ordered list +of distinct, separated values that it takes over its lifetime. + +On the loading side, we also obtain one additional ability: when there are +multiple possible values to choose from in the modification order, instead of it +triggering UB, exactly one (but it is unspecified which) value is chosen. This +means that there are now _two_ potential executions of our program, with no way +for us to control which one occurs: + +```text + Possible Execution 1 ┃ Possible Execution 2 + ┃ +Thread 1 data Thread 2 ┃ Thread 1 data Thread 2 +╭───────╮ ┌────┐ ╭───────╮ ┃ ╭───────╮ ┌────┐ ╭───────╮ +│ store ├─┐ │ 0 ├───┤ load │ ┃ │ store ├─┐ │ 0 │ ┌─┤ load │ +╰───────╯ │ └────┘ ╰───────╯ ┃ ╰───────╯ │ └────┘ │ ╰───────╯ + └─┬────┐ ┃ └─┬────┐ │ + │ 1 │ ┃ │ 1 ├─┘ + └────┘ ┃ └────┘ +``` + +Note that **both sides must be atomic to avoid the data race**: if only the +writing side used atomic operations, the reading side would still have multiple +values to choose from (UB), and if only the reading side used atomic operations +it could end up reading the garbage data “in-between” `0` and `1` (also UB). + +> **NOTE:** This description of why both sides are needed to be atomic +> operations, while neat and intuitive, is not strictly correct: in reality the +> answer is simply “because the spec says so”. However, it is functionally +> equivalent to the real rules, so it can aid in understanding. + +## Read-modify-write operations + +Loads and stores are pretty neat in avoiding data races, but you can’t get very +far with them. For example, suppose you wanted to implement a global shared +counter that can be used to assign unique IDs to objects. Naïvely, you might try +to write code like this: + +```rust +# use std::sync::atomic::{self, AtomicU64}; +static COUNTER: AtomicU64 = AtomicU64::new(0); +pub fn get_id() -> u64 { + let value = COUNTER.load(atomic::Ordering::Relaxed); + COUNTER.store(value + 1, atomic::Ordering::Relaxed); + value +} +``` + +But then calling that function from multiple threads opens you up to an +execution like below that results in two threads obtaining the same ID (note +that the duplication of `1` in the modification order is intentional; even if +two values are the same, they always get separate entries in the order if they +were caused by different accesses): + +```text +Thread 1 COUNTER Thread 2 +╭───────╮ ┌───┐ ╭───────╮ +│ load ├───┤ 0 ├───┤ load │ +╰───╥───╯ └───┘ ╰────╥──╯ +╭───⇓───╮ ┌─┬───┐ ╭────⇓──╮ +│ store ├─┘ │ 1 │ ┌─┤ store │ +╰───────╯ └───┘ │ ╰───────╯ + ┌───┬─┘ + │ 1 │ + └───┘ +``` + +This is known as a a **race condition** — a logic error in a program caused by a +specific unintended execution of concurrent code. Note that this is distinct +from a _data race_: while a data race is caused by two threads performing +unsynchronized operations at the same time and is always undefined behaviour, +race conditions are totally OK and defined behaviour from the AM’s perspective, +but are only harmful because the programmer didn’t expect it to be possible. You +can think of the distinction between the two as analagous to the difference +between indexing out-of-bounds and indexing in-bounds, but to the wrong element: +both are bugs, but only one is universally a bug, and the other is merely a +logic problem. + +Technically, I believe it is _possible_ to solve this problem with just loads +and stores, if you try hard enough and use several atomics. But luckily, you +don’t have to because there also exists another kind of operation, the +read-modify-write, which is specifically suited to this purpose. + +A read-modify-write operation (shortened to RMW) is a special kind of atomic +operation that reads, changes and writes back a value _in one step_. This means +that there are guaranteed to exist no other values in the modification order in +between the read and the write; it happens as a single operation. I would also +like to point out that this is true of **all** atomic orderings, since a common +misconception is that the `Relaxed` ordering somehow negates this guarantee. + +> Another common confusion about RMWs is that they are guaranteed to “see the +> latest value” of an atomic, which I believe came from a misinterpretation of +> the C++ specification and was later spread by rumour. Of course, this makes no +> sense, since atomics have no latest value due to the lack of the concept of +> time. The original statement in the specification was actually just specifying +> that atomic RMWs are atomic: they only consider the directly previous value in +> the modification order and not any value before it, and gave no additional +> guarantee. + +There are many different RMW operations to choose from, but the one most +appropriate for this use case is `fetch_add`, which adds a number to the atomic, +as well as returns the old value. So our code can be rewritten as this: + +```rust +# use std::sync::atomic::{self, AtomicU64}; +static COUNTER: AtomicU64 = AtomicU64::new(0); +pub fn get_id() -> u64 { + COUNTER.fetch_add(1, atomic::Ordering::Relaxed) +} +``` + +And then, no matter how many threads there are, that race condition from earlier +can never occur. Executions will have to look more like this: + +```text + Thread 1 COUNTER Thread 2 +╭───────────╮ ┌───┐ ╭───────────╮ +│ fetch_add ├─┐ │ 0 │ ┌─┤ fetch_add │ +╰───────────╯ │ └───┘ │ ╰───────────╯ + └─┬───┐ │ + │ 1 │ │ + └───┘ │ + ┌───┬─┘ + │ 2 │ + └───┘ +``` + +There is one problem with this code however, and that is that if `get_id()` is +called over 18 446 744 073 709 551 615 times, the counter will overflow and it +will start generating duplicate IDs. Of course, this won’t feasibly happen, but +it can be problematic if you need to _prove_ that it can’t happen (e.g. for +safety purposes) or you’re using a smaller integer type like `u32`. + +So we’re going to modify this function so that instead of returning a plain +`u64` it returns an `Option`, where `None` is used to indicate that an +overflow occurred and no more IDs could be generated. Additionally, it’s not +enough to just return `None` once, because if there are multiple threads +involved they will not see that result if it just occurs on a single thread — +instead, it needs to continue to return `None` _until the end of time_ (or, +well, this execution of the program). + +That means we have to do away with `fetch_add`, because `fetch_add` will always +overflow and there’s no `checked_fetch_add` equivalent. We’ll return to our racy +algorithm for a minute, this time thinking more about what went wrong. The steps +look something like this: + +1. Load a value of the atomic +1. Perform the checked add, propagating `None` +1. Store in the new value of the atomic + +The problem here is that the store does not necessarily occur directly after the +load in the atomic’s modification order, and that leads to the races. What we +need is some way to say, “add this new value to the modification order, but +_only if_ it occurs directly after the value we loaded”. And luckily for us, +there exists a function that does exactly\* this: `compare_exchange`. + +`compare_exchange` is a bit like a store, but instead of unconditionally storing +the value, it will first check the value directly before the `compare_exchange` +in the modification order to see whether it is what we expect, and if not it +will simply tell us that and not make any changes. It is an RMW operation, so +all of this happens fully atomically — there is no chance for a race condition. + +> \* It’s not quite the same, because `compare_exchange` can suffer from ABA +> problems in which it will see a later value in the modification order that +> just happened to be same and succeed. For example, if the modification order +> contained `1, 2, 1` and a thread loaded the first `1`, +> `compare_exchange(1, 3)` could succeed in replacing either the first or second +> `1`, giving either `1, 3, 2, 1` or `1, 2, 1, 3`. +> +> For some algorithms, this is problematic and needs to be taken into account +> with additional checks; however for us, values can never be reused so we don’t +> have to worry about it. + +In our case, we can simply replace the store with a compare exchange of the old +value and itself plus one (returning `None` instead if the addition overflowed, +to prevent overflowing the atomic). Should the `compare_exchange` fail, we know +that some other thread inserted a value in the modification order after the +value we loaded. This isn’t really a problem — we can just try again and again +until we succeed, and `compare_exchange` is even nice enough to give us the +updated value so we don’t have to load again. Also note that after we’ve updated +our value of the atomic, we’re guaranteed to never see the old value again, by +the coherence rules from the previous chapter. + +So here’s how it looks with these changes appplied: + +```rust +# use std::sync::atomic::{self, AtomicU64}; +static COUNTER: AtomicU64 = AtomicU64::new(0); +pub fn get_id() -> Option { + // Load the counter’s initial value from some place in the modification + // order (it doesn’t matter where, because the compare exchange makes sure + // that our new value appears directly after it). + let mut value = COUNTER.load(atomic::Ordering::Relaxed); + loop { + // Attempt to add one to the atomic. + let res = COUNTER.compare_exchange( + value, + value.checked_add(1)?, + atomic::Ordering::Relaxed, + atomic::Ordering::Relaxed, + ); + // Check what happened… + match res { + // If there was no value in between the value we loaded and our + // newly written value in the modification order, the compare + // exchange suceeded and so we are done. + Ok(_) => break, + + // Otherwise, there was a value in between and so we need to retry + // the addition and continue looping. + Err(updated_value) => value = updated_value, + } + } + Some(value) +} +``` + +This `compare_exchange` loop enables the algorithm to succeed even under +contention; it will simply try again (and again and again). In the below +execution, Thread 1 gets raced to storing its value of `1` to the counter, but +that’s okay because it will just add `1` to the `1`, making `2`, and retry the +compare exchange with that, eventually resulting in a unique ID. + +```text +Thread 1 COUNTER Thread 2 +╭───────╮ ┌───┐ ╭───────╮ +│ load ├───┤ 0 ├───┤ load │ +╰───╥───╯ └───┘ ╰───╥───╯ +╭───⇓───╮ ┌───┬─┐ ╭───⇓───╮ +│ cas ├───┤ 1 │ └─┤ cas │ +╰───╥───╯ └───┘ ╰───────╯ +╭───⇓───╮ ┌─┬───┐ +│ cas ├─┘ │ 2 │ +╰───────╯ └───┘ +``` + +> `compare_exchange` is abbreviated to CAS here (which stands for +> compare-and-swap), since that is the more general name for the operation. It +> is not to be confused with `compare_and_swap`, a deprecated method on Rust +> atomics that performs the same task as `compare_exchange` but has an inferior +> design in some ways. + +There are two additional improvements we can make here. First, because our +algorithm occurs in a loop, it is actually perfectly fine for the CAS to fail +even when there wasn’t a value inserted in the modification order in between, +since we’ll just run it again. This allows to switch out our call to +`compare_exchange` with a call to the weaker `compare_exchange_weak`, that +unlike the former function is allowed to _spuriously_ (i.e. randomly, from the +programmer’s perspective) fail. This often results in better performance on +architectures like ARM, since their `compare_exchange` is really just a loop +around the underlying `compare_exchange_weak`. x86\_64 however will see no +difference in performance. + +The second improvement is that this pattern is so common that the standard +library even provides a helper function for it, called `fetch_update`. It +implements the boilerplate `load`-`loop`-`match` parts for us, so all we have to +do is provide the closure that calls `checked_add(1)` and it will all just work. +This leads us to our final code for this example: + +```rust +# use std::sync::atomic::{self, AtomicU64}; +static COUNTER: AtomicU64 = AtomicU64::new(0); +pub fn get_id() -> Option { + COUNTER.fetch_update( + atomic::Ordering::Relaxed, + atomic::Ordering::Relaxed, + |value| value.checked_add(1), + ) + .ok() +} +``` + +These CAS loops are the absolute bread and butter of concurrent programming; +they’re absolutely everywhere and essential to know about. Every other RMW +operation on atomics can (and often is, if the hardware doesn’t have a more +efficient implementation) be implemented via a CAS loop. This is why CAS is seen +as the canonical example of an RMW — it’s pretty much the most fundamental +operation you can get on atomics. + +I’d also like to briefly bring attention to the atomic orderings used in this +section. They were mostly glossed over, but we were exclusively using `Relaxed`, +and that’s because for something as simple as a global ID counter, _you never +need more than `Relaxed`_. The more complex cases which we’ll look at later +definitely do need stronger orderings, but as a general rule, if: + +- you only have one atomic, and +- you have no other related pieces of data + +`Relaxed` is more than sufficient. + +## “Out-of-thin-air” values + +One peculiar consequence of the semantics of `Relaxed` operations is that it is +theoretically possible for values to come into existence “out-of-thin-air” +(commonly abbreviated to OOTA) — that is, a value could appear despite not ever +being calculated anywhere in code. In particular, consider this setup: + +```rust +# use std::sync::atomic::{self, AtomicU32}; +let x = AtomicU32::new(0); +let y = AtomicU32::new(0); + +// Thread 1: +let r1 = y.load(atomic::Ordering::Relaxed); +x.store(r1, atomic::Ordering::Relaxed); + +// Thread 2: +let r2 = x.load(atomic::Ordering::Relaxed); +y.store(r2, atomic::Ordering::Relaxed); +``` + +When starting to draw a diagram for a possible execution of this program, we +have to first lay out the basic facts that we know: +- `x` and `y` both start out as zero +- Thread 1 performs a load of `y` followed by a store of `x` +- Thread 2 performs a load of `x` followed by a store of `y` +- Each of `x` and `y` take on exactly two values in their lifetime + +Then we can start to construct boxes: + +```text +Thread 1 x y Thread 2 +╭───────╮ ┌───┐ ┌───┐ ╭───────╮ +│ load ├─┐ │ 0 │ │ 0 │ ┌─┤ load │ +╰───╥───╯ │ └───┘ └───┘ │ ╰───╥───╯ + ║ │ ?───────────┘ ║ +╭───⇓───╮ └───────────? ╭───⇓───╮ +│ store ├───┬───┐ ┌───┬───┤ store │ +╰───────╯ │ ? │ │ ? │ ╰───────╯ + └───┘ └───┘ +``` + +At this point, if either of those lines were to connect to the higher box then +the execution would be simple: that thread would forward the value to its lower +box, which the other thread would then either read, or load the same value +(zero) from the box above it, and we’d end up with zero in both atomics. But +what if they were to connect downwards? Then we’d end up with an execution that +looks like this: + +```text +Thread 1 x y Thread 2 +╭───────╮ ┌───┐ ┌───┐ ╭───────╮ +│ load ├─┐ │ 0 │ │ 0 │ ┌─┤ load │ +╰───╥───╯ │ └───┘ └───┘ │ ╰───╥───╯ + ║ │ ┌───────────┘ ║ +╭───⇓───╮ └───┼───────┐ ╭───⇓───╮ +│ store ├───┬─┴─┐ ┌─┴─┬───┤ store │ +╰───────╯ │ ? │ │ ? │ ╰───────╯ + └───┘ └───┘ +``` + +But hang on — it’s not fully resolved yet, we still haven’t put in a value in +those lower question marks. So what value should it be? Well, the second value +of `x` is just copied from from the second value of `y`, so we just have to find +the value of that — but the second value of `y` is itself copied from the second +value of `x`! This means that we can actually put any value we like in that box, +including `0` or `42`, and the logic will check out perfectly fine — meaning if +this program were to execute in this fashion, it would end up reading a value +produced out of thin air! + +Now, if we were to strictly follow the rules we’ve laid out thus far, then this +would be totally valid thing to happen. But luckily, the authors of the C++ +specification have recognized this as a problem, and as such refined the +semantics of `Relaxed` to implement a thorough, logically sound, mathematically +proven formal model that prevents it, that’s just too complex and technical to +explain here— + +> No “out-of-thin-air” values can be computed that circularly depend on their +> own computations. + +Just kidding. Turns out, it’s a *really* difficult problem to solve, and to my +knowledge even now there is no known formal way to express how to prevent it. So +in the specification they just kind of hand-wave and say that it shouldn’t +happen, and that the above program must always give zero in both atomics, +despite the theoretical execution that could result in something else. Well, it +generally works in practice so I can’t complain — it’s just a very interesting +detail to know about. diff --git a/src/atomics/seqcst.md b/src/atomics/seqcst.md new file mode 100644 index 00000000..38e3d1ab --- /dev/null +++ b/src/atomics/seqcst.md @@ -0,0 +1,432 @@ +# SeqCst + +`SeqCst` is probably the most interesting ordering, because it is simultaneously +the simplest and most complex atomic memory ordering in existence. It’s +simple, because if you do only use `SeqCst` everywhere then you can kind of +maybe pretend like the Abstract Machine has a concept of time; phrases like +“latest value” make sense, the program can be thought of as a set of steps that +interleave, there is a universal “now” and “before” and wouldn’t that be nice? +But it’s also the most complex, because as soon as look under the hood you +realize just how incredibly convoluted and hard to follow the actual rules +behind it are, and it gets really ugly really fast as soon as you try to mix it +with any other ordering. + +To understand `SeqCst`, we first have to understand the problem it exists to +solve. A simple example used to show where weaker orderings produce +counterintuitive results is this: + +```rust +# use std::sync::atomic::{self, AtomicBool}; +use std::thread; + +// Set this to Relaxed, Acquire, Release, AcqRel, doesn’t matter — the result is +// the same (modulo panics caused by attempting acquire stores or release +// loads). +const ORDERING: atomic::Ordering = atomic::Ordering::Relaxed; + +static X: AtomicBool = AtomicBool::new(false); +static Y: AtomicBool = AtomicBool::new(false); + +let a = thread::spawn(|| { X.store(true, ORDERING); Y.load(ORDERING) }); +let b = thread::spawn(|| { Y.store(true, ORDERING); X.load(ORDERING) }); + +let a = a.join().unwrap(); +let b = b.join().unwrap(); + +# return; +// This assert is allowed to fail. +assert!(a || b); +``` + +The basic setup of this code, for all of its possible executions, looks like +this: + +```text + a static X static Y b +╭─────────╮ ┌───────┐ ┌───────┐ ╭─────────╮ +│ store X ├─┐ │ false │ │ false │ ┌─┤ store Y │ +╰────╥────╯ │ └───────┘ └───────┘ │ ╰────╥────╯ +╭────⇓────╮ └─┬───────┐ ┌───────┬─┘ ╭────⇓────╮ +│ load Y ├─? │ true │ │ true │ ?─┤ load X │ +╰─────────╯ └───────┘ └───────┘ ╰─────────╯ +``` + +In other words, `a` and `b` are guaranteed to store `true` into `X` and `Y` +respectively, and then attempt to load from the other thread’s atomic. The +question now is: is it possible for them _both_ to load `false`? + +And looking at this diagram, there’s absolutely no reason why not. There isn’t +even a single arrow connecting the left and right hand sides so far, so the +loads have no coherence-based restrictions on which values they are allowed to +pick, and we could end up with an execution like this: + +```text + a static X static Y b +╭─────────╮ ┌───────┐ ┌───────┐ ╭─────────╮ +│ store X ├┐ │ false ├─┐┌┤ false │ ┌┤ store Y │ +╰────╥────╯│ └───────┘┌─┘└───────┘ │╰────╥────╯ + ║ │ ┌─────────┘└───────────┐│ ║ +╭────⇓────╮└─│┬───────┐ ┌───────┬─│┘╭────⇓────╮ +│ load Y ├──┘│ true │ │ true │ └─┤ load X │ +╰─────────╯ └───────┘ └───────┘ ╰─────────╯ +``` + +Which results in a failed assert. This execution is brought about because the +model of separate modification orders means that there is no relative ordering +between `X` and `Y` being changed, and so each thread is allowed to “see” either +order. However, some algorithms will require a globally agreed-upon ordering, +and this is where `SeqCst` can come in useful. + +This ordering, first and foremost, inherits the guarantees from all the other +orderings — it is an acquire operation for loads, a release operation for stores +and an acquire-release operation for RMWs. In addition to this, it gives some +guarantees unique to `SeqCst` about what values it is allowed to load. Note that +these guarantees are not about preventing data races: unless you have some +unrelated code that triggers a data race given an unexpected condition, using +`SeqCst` can only prevent you from race conditions because its guarantees only +apply to other `SeqCst` operations rather than all data accesses. + +## S + +`SeqCst` is fundamentally about _S_, which is the global ordering of all +`SeqCst` operations in an execution of the program. It is consistent between +every atomic and every thread, and all stores, fences and RMWs that use a +sequentially consistent ordering have a place in it (but no other operations +do). It is in contrast to modification orders, which are similarly total but +only scoped to a single atomic rather than the whole program. + +Other than an edge case involving `SeqCst` mixed with weaker orderings (detailed +later on), _S_ is primarily controlled by the happens-before relations in a +program: this means that if an action _A_ happens-before an action _B_, it is +also guaranteed to appear before _B_ in _S_. Other than that restriction, _S_ is +unspecified and will be chosen arbitrarily during execution. + +Once a particular _S_ has been established, every atomic’s modification order is +then guaranteed to be consistent with it, so a `SeqCst` load will never see a +value that has been overwritten by a write that occurred before it in _S_, or a +value that has been written by a write that occured after it in _S_ (note that a +`Relaxed`/`Acquire` load however might, since there is no “before” or “after” as +it is not in _S_ in the first place). + +More formally, this guarantee can be described with _coherence orderings_, a +relation which expresses which of two operations appears before the other in an +atomic’s modification order. It is said that an operation _A_ is +_coherence-ordered-before_ another operation _B_ if any of the following +conditions are met: +1. _A_ is a store or RMW, _B_ is a store or RMW, and _A_ appears before _B_ in + the modification order. +1. _A_ is a store or RMW, _B_ is a load, and _B_ reads the value stored by _A_. +1. _A_ is a load, _B_ is a store or RMW, and _A_ takes its value from a place in + the modification order that appears before _B_. +1. _A_ is coherence-ordered-before a different operation _X_, and _X_ is + coherence-ordered-before _B_ (the basic transitivity property). + +The following diagram gives examples for the main three rules (in each case _A_ +is coherence-ordered-before _B_): + +```text + Rule 1 ┃ Rule 2 ┃ Rule 3 + ┃ ┃ +╭───╮ ┌─┬───┐ ╭───╮ ┃ ╭───╮ ┌─┬───┐ ╭───╮ ┃ ╭───╮ ┌───┐ ╭───╮ +│ A ├─┘ │ │ ┌─┤ B │ ┃ │ A ├─┘ │ ├───┤ B │ ┃ │ A ├───┤ │ ┌─┤ B │ +╰───╯ └───┘ │ ╰───╯ ┃ ╰───╯ └───┘ ╰───╯ ┃ ╰───╯ └───┘ │ ╰───╯ + ┌───┬─┘ ┃ ┃ ┌───┬─┘ + │ │ ┃ ┃ │ │ + └───┘ ┃ ┃ └───┘ +``` + +The only important thing to note is that for two loads of the same value in the +modification order, neither is coherence-ordered-before the other, as in the +following example where _A_ has no coherence ordering relation to _B_: + +```text +╭───╮ ┌───┐ ╭───╮ +│ A ├───┤ ├───┤ B │ +╰───╯ └───┘ ╰───╯ +``` + +Because of this, “_A_ is coherence-ordered-before _B_” is subtly different from +“_A_ is not coherence-ordered-after _B_”; only the latter phrase includes the +above situation, and is synonymous with “either _A_ is coherence-ordered-before +_B_ or _A_ and _B_ are both loads, and see the same value in the atomic’s +modification order”. “Not coherence-ordered-after” is generally a more useful +relation than “coherence-ordered-before”, and so it’s important to understand +what it means. + +With this terminology applied, we can use a more precise definition of +`SeqCst`’s guarantee: for two `SeqCst` operations on the same atomic _A_ and +_B_, where _A_ precedes _B_ in _S_, _A_ is not coherence-ordered-after _B_. +Effectively, this one rule ensures that _S_’s order “propagates” +throughout all the atomics of the program — you can imagine each operation in +_S_ as storing a snapshot of the world, so that every subsequent operation is +consistent with it. + +## Applying `SeqCst` + +So, looking back at our program, let’s consider how we could use `SeqCst` to +make that execution invalid. As a refresher, here’s the framework for every +possible execution of the program: + +```text + a static X static Y b +╭─────────╮ ┌───────┐ ┌───────┐ ╭─────────╮ +│ store X ├─┐ │ false │ │ false │ ┌─┤ store Y │ +╰────╥────╯ │ └───────┘ └───────┘ │ ╰────╥────╯ +╭────⇓────╮ └─┬───────┐ ┌───────┬─┘ ╭────⇓────╮ +│ load Y ├─? │ true │ │ true │ ?─┤ load X │ +╰─────────╯ └───────┘ └───────┘ ╰─────────╯ +``` + +First of all, both the final loads (`a` and `b`’s second operations) need to +become `SeqCst`, because they need to be aware of the total ordering that +determines whether `X` or `Y` becomes `true` first. And secondly, we need to +establish that ordering in the first place, and that needs to be done by making +sure that there is always one operation in _S_ that both sees one of the atomics +as `true` and precedes both final loads in _S_, so that the coherence ordering +guarantee will apply (the final loads themselves don’t work for this since +although they “know” that their corresponding atomic is `true` they don’t +interact with it directly so _S_ doesn’t care) — for this, we must set both +stores to use the `SeqCst` ordering. + +This leaves us with the correct version of the above program, which is +guaranteed to never panic: + +```rust +# use std::sync::atomic::{self, AtomicBool}; +use std::thread; + +const ORDERING: atomic::Ordering = atomic::Ordering::SeqCst; + +static X: AtomicBool = AtomicBool::new(false); +static Y: AtomicBool = AtomicBool::new(false); + +let a = thread::spawn(|| { X.store(true, ORDERING); Y.load(ORDERING) }); +let b = thread::spawn(|| { Y.store(true, ORDERING); X.load(ORDERING) }); + +let a = a.join().unwrap(); +let b = b.join().unwrap(); + +# return; +// This assert is **not** allowed to fail. +assert!(a || b); +``` + +As there are four `SeqCst` operations with a partial order between two pairs in +them (caused by the sequenced-before relation), there are six possible +executions of this program: + +- All of `a`’s operations precede `b`’s operations: + 1. `a` stores `true` into `X` + 1. `a` loads `Y` (gives `false`) + 1. `b` stores `true` into `Y` + 1. `b` loads `X` (required to give `true`) +- All of `b`’s operations precede `a`’s operations: + 1. `b` stores `true` into `Y` + 1. `b` loads `X` (gives `false`) + 1. `a` stores `true` into `X` + 1. `a` loads `Y` (required to give `true`) +- The stores precede the loads, + `a`’s store precedes `b`’s and `a`’s load precedes `b`’s: + 1. `a` stores `true` to `X` + 1. `b` stores `true` into `Y` + 1. `a` loads `Y` (required to give `true`) + 1. `b` loads `X` (required to give `true`) +- The stores precede the loads, + `a`’s store precedes `b`’s and `b`’s load precedes `a`’s: + 1. `a` stores `true` to `X` + 1. `b` stores `true` into `Y` + 1. `b` loads `X` (required to give `true`) + 1. `a` loads `Y` (required to give `true`) +- The stores precede the loads, + `b`’s store precedes `a`’s and `a`’s load precedes `b`’s: + 1. `b` stores `true` into `Y` + 1. `a` stores `true` to `X` + 1. `a` loads `Y` (required to give `true`) + 1. `b` loads `X` (required to give `true`) +- The stores precede the loads, + `b`’s store precedes `a`’s and `b`’s load precedes `a`’s: + 1. `b` stores `true` into `Y` + 1. `a` stores `true` to `X` + 1. `b` loads `X` (required to give `true`) + 1. `a` loads `Y` (required to give `true`) + +All the places where the load was required to give `true` were caused by a +preceding store in _S_ of the same atomic of `true` — otherwise, the load would +be coherence-ordered-before a store which precedes it in _S_, and that is +impossible. + +## The mixed-`SeqCst` special case + +As I’ve been alluding to for a while, I wasn’t being totally truthful when I +said that _S_ is consistent with happens-before relations — in reality, it is +only consistent with _strongly happens-before_ relations, which presents a +subtly-defined subset of happens-before relations. In particular, it excludes +two situations: + +1. The `SeqCst` operation A synchronizes-with an `Acquire` or `AcqRel` operation + B which is sequenced-before another `SeqCst` operation C. Here, despite the + fact that A happens-before C, A does not _strongly_ happen-before C and so is + not guaranteed to precede C in _S_. +2. The `SeqCst` operation A is sequenced-before the `Release` or `AcqRel` + operation B, which synchronizes-with another `SeqCst` operation C. Similarly, + despite the fact that A happens-before C, A might not precede C in _S_. + +The first situation is illustrated below, with `SeqCst` accesses repesented with +asterisks: + +```text + t_1 x t_2 +╭─────╮ ┌─↘───┐ ╭─────╮ +│ *A* ├─┘ │ 1 ├───→ B │ +╰─────╯ └───┘ ╰──╥──╯ + ╭──⇓──╮ + │ *C* │ + ╰─────╯ +``` + +A happens-before, but does not strongly happen-before, C — and anything +sequenced-after C will have the same treatment (unless more synchronization is +used). This means that C is actually allowed to _precede_ A in _S_, despite +conceptually occuring after it. However, anything sequenced-before A, because +there is at least one sequence on either side of the synchronization, will +strongly happen-before C. + +But this is all highly theoretical at the moment, so let’s make an example to +show how that rule can actually affect the execution of code. So, if C were to +precede A in _S_ (and they are not both loads) then that means C is always +coherence-ordered-before A. Let’s say then that C loads from `x` (the atomic +that A has to access), it may load the value that came before A if it were to +precede A in _S_: + +```text + t_1 x t_2 +╭─────╮ ┌───┐ ╭─────╮ +│ *A* ├─┐ │ 0 ├─┐┌→ B │ +╰─────╯ │ └───┘ ││╰──╥──╯ + └─↘───┐┌─┘╭──⇓──╮ + │ 1 ├┘└─→ *C* │ + └───┘ ╰─────╯ +``` + +Ah wait no, that doesn’t work because regular coherence still mandates that `1` +is the only value that can be loaded. In fact, once `1` is loaded _S_’s required +consistency with coherence orderings means that A _is_ required to precede C in +_S_ after all. + +So somehow, to observe this difference we need to have a _different_ `SeqCst` +operation, let’s call it E, be the one that loads from `x`, where C is +guaranteed to precede it in _S_ (so we can observe the “weird” state in between +C and A) but C also doesn’t happen-before it (to avoid coherence getting in the +way) — and to do that, all we have to do is have C appear before a `SeqCst` +operation D in the modification order of another atomic, but have D be a store +so as to avoid C synchronizing with it, and then our desired load E can simply +be sequenced-after D (this will carry over the “precedes in _S_” guarantee, but +does not restore the happens-after relation to C since that was already dropped +by having D be a store). + +In diagram form, that looks like this: + +```text + t_1 x t_2 helper t_3 +╭─────╮ ┌───┐ ╭─────╮ ┌─────┐ ╭─────╮ +│ *A* ├─┐ │ 0 ├┐┌─→ B │ ┌─┤ 0 │ ┌─┤ *D* │ +╰─────╯ │ └───┘││ ╰──╥──╯ │ └─────┘ │ ╰──╥──╯ + │ └│────║────│─────────│┐ ║ + └─↘───┐ │ ╭──⇓──╮ │ ┌─────↙─┘│╭──⇓──╮ + │ 1 ├─┘ │ *C* ←─┘ │ 1 │ └→ *E* │ + └───┘ ╰─────╯ └─────┘ ╰─────╯ + +S = C → D → E → A +``` + +C is guaranteed to precede D in _S_, and D is guaranteed to precede E, but +because this exception means that A is _not_ guaranteed to precede C, it is +totally possible for it to come at the end, resulting in the surprising but +totally valid outcome of E loading `0` from `x`. In code, this can be expressed +as the following code _not_ being guaranteed to panic: + +```rust +# use std::sync::atomic::{AtomicU8, Ordering::{Acquire, SeqCst}}; +# return; +static X: AtomicU8 = AtomicU8::new(0); +static HELPER: AtomicU8 = AtomicU8::new(0); + +// thread_1 +X.store(1, SeqCst); // A + +// thread_2 +assert_eq!(X.load(Acquire), 1); // B +assert_eq!(HELPER.load(SeqCst), 0); // C + +// thread_3 +HELPER.store(1, SeqCst); // D +assert_eq!(X.load(SeqCst), 0); // E +``` + +The second situation listed above has very similar consequences. Its abstract +form is the following execution in which A is not guaranteed to precede C in +_S_, despite A happening-before C: + +```text + t_1 x t_2 +╭─────╮ ┌─↘───┐ ╭─────╮ +│ *A* │ │ │ 0 ├───→ *C* │ +╰──╥──╯ │ └───┘ ╰─────╯ +╭──⇓──╮ │ +│ B ├─┘ +╰─────╯ +``` + +Similarly to before, we can’t just have A access `x` to show why A not +necessarily preceding C in _S_ matters; instead, we have to introduce a second +atomic and third thread to break the happens-before chain first. And finally, a +single relaxed load F at the end is added just to prove that the weird execution +actually happened (leaving `x` as 2 instead of 1). + +```text + t_3 helper t_1 x t_2 +╭─────╮ ┌─────┐ ╭─────╮ ┌───┐ ╭─────╮ +│ *D* ├┐┌─┤ 0 │ ┌─┤ *A* │ │ 0 │ ┌─→ *C* │ +╰──╥──╯││ └─────┘ │ ╰──╥──╯ └───┘ │ ╰──╥──╯ + ║ └│─────────│────║─────┐ │ ║ +╭──⇓──╮ │ ┌─────↙─┘ ╭──⇓──╮ ┌─↘───┐ │ ╭──⇓──╮ +│ *E* ←─┘ │ 1 │ │ B ├─┘││ 1 ├─┘┌┤ F │ +╰─────╯ └─────┘ ╰─────╯ │└───┘ │╰─────╯ + └↘───┐ │ + │ 2 ├──┘ + └───┘ +S = C → D → E → A +``` + +This execution mandates both C preceding A in _S_ and A happening-before C, +something that is only possible through these two mixed-`SeqCst` special +exceptions. It can be expressed in code as well: + +```rust +# use std::sync::atomic::{AtomicU8, Ordering::{Release, Relaxed, SeqCst}}; +# return; +static X: AtomicU8 = AtomicU8::new(0); +static HELPER: AtomicU8 = AtomicU8::new(0); + +// thread_3 +X.store(2, SeqCst); // D +assert_eq!(HELPER.load(SeqCst), 0); // E + +// thread_1 +HELPER.store(1, SeqCst); // A +X.store(1, Release); // B + +// thread_2 +assert_eq!(X.load(SeqCst), 1); // C +assert_eq!(X.load(Relaxed), 2); // F +``` + +If this seems ridiculously specific and obscure, that’s because it is. +Originally, back in C++11, this special case didn’t exist — but then six years +later it was discovered that in practice atomics on Power, Nvidia GPUs and +sometimes ARMv7 _would_ have this special case, and fixing the implementations +would make atomics significantly slower. So instead, in C++20 they simply +encoded it into the specification. + +Generally however, this rule is so complex it’s best to just avoid it entirely +by never mixing `SeqCst` and non-`SeqCst` on a single atomic in the first place.