From 0347b0183f659a1401a7a193448637fd36f96479 Mon Sep 17 00:00:00 2001 From: SabrinaJewson Date: Thu, 4 Aug 2022 15:50:18 +0100 Subject: [PATCH 01/34] =?UTF-8?q?Write=20=E2=80=9CMultithreaded=20Executio?= =?UTF-8?q?n=E2=80=9D=20and=20add=20simplified=20atomic=20spec?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- src/SUMMARY.md | 9 +- src/atomics/acquire-release.md | 1 + src/{ => atomics}/atomics.md | 26 ++- src/atomics/fences.md | 1 + src/atomics/multithread.md | 220 ++++++++++++++++++++ src/atomics/relaxed.md | 43 ++++ src/atomics/seqcst.md | 1 + src/atomics/signals.md | 3 + src/atomics/specification.md | 354 +++++++++++++++++++++++++++++++++ 9 files changed, 651 insertions(+), 7 deletions(-) create mode 100644 src/atomics/acquire-release.md rename src/{ => atomics}/atomics.md (93%) create mode 100644 src/atomics/fences.md create mode 100644 src/atomics/multithread.md create mode 100644 src/atomics/relaxed.md create mode 100644 src/atomics/seqcst.md create mode 100644 src/atomics/signals.md create mode 100644 src/atomics/specification.md diff --git a/src/SUMMARY.md b/src/SUMMARY.md index f1d15a71..a65f21ef 100644 --- a/src/SUMMARY.md +++ b/src/SUMMARY.md @@ -41,7 +41,14 @@ * [Concurrency](concurrency.md) * [Races](races.md) * [Send and Sync](send-and-sync.md) - * [Atomics](atomics.md) + * [Atomics](./atomics/atomics.md) + * [Multithreaded Execution](./atomics/multithread.md) + * [Relaxed](./atomics/relaxed.md) + * [Acquire and Release](./atomics/acquire-release.md) + * [SeqCst](./atomics/seqcst.md) + * [Fences](./atomics/fences.md) + * [Signals](./atomics/signals.md) + * [Specification](./atomics/specification.md) * [Implementing Vec](./vec/vec.md) * [Layout](./vec/vec-layout.md) * [Allocating](./vec/vec-alloc.md) diff --git a/src/atomics/acquire-release.md b/src/atomics/acquire-release.md new file mode 100644 index 00000000..7dc85bde --- /dev/null +++ b/src/atomics/acquire-release.md @@ -0,0 +1 @@ +# Acquire and Release diff --git a/src/atomics.md b/src/atomics/atomics.md similarity index 93% rename from src/atomics.md rename to src/atomics/atomics.md index 6aef6aee..7402d38f 100644 --- a/src/atomics.md +++ b/src/atomics/atomics.md @@ -17,12 +17,24 @@ details, you should check out the [C++ specification][C++-model]. Still, we'll try to cover the basics and some of the problems Rust developers face. -The C++ memory model is fundamentally about trying to bridge the gap between the -semantics we want, the optimizations compilers want, and the inconsistent chaos -our hardware wants. *We* would like to just write programs and have them do -exactly what we said but, you know, fast. Wouldn't that be great? +## Motivation -## Compiler Reordering +The C++ memory model is very large and confusing with lots of seemingly +arbitrary design decisions. To understand the motivation behind this, it can +help to look at what got us in this situation in the first place. There are +three main factors at play here: + +1. Users of the language, who want fast, cross-platform code; +2. compilers, who want to optimize code to make it fast; +3. and the hardware, which is ready to unleash a wrath of inconsistent chaos on + your program at a moment's notice. + +The C++ memory model is fundamentally about trying to bridge the gap between +these three, allowing users to write code for a logical and consistent abstract +machine while the compiler and hardware deal with the madness underneath that +makes it run fast. + +### Compiler Reordering Compilers fundamentally want to be able to do all sorts of complicated transformations to reduce data dependencies and eliminate dead code. In @@ -53,7 +65,7 @@ able to make these kinds of optimizations, because they can seriously improve performance. On the other hand, we'd also like to be able to depend on our program *doing the thing we said*. -## Hardware Reordering +### Hardware Reordering On the other hand, even if the compiler totally understood what we wanted and respected our wishes, our hardware might instead get us in trouble. Trouble @@ -106,6 +118,8 @@ programming: incorrect. If possible, concurrent algorithms should be tested on weakly-ordered hardware. +--- + ## Data Accesses The C++ memory model attempts to bridge the gap by allowing us to talk about the diff --git a/src/atomics/fences.md b/src/atomics/fences.md new file mode 100644 index 00000000..04445512 --- /dev/null +++ b/src/atomics/fences.md @@ -0,0 +1 @@ +# Fences diff --git a/src/atomics/multithread.md b/src/atomics/multithread.md new file mode 100644 index 00000000..21008926 --- /dev/null +++ b/src/atomics/multithread.md @@ -0,0 +1,220 @@ +# Multithreaded Execution + +The first important thing to understand about C++20 atomics is that **the +abstract machine has no concept of time**. You might expect there to be a single +global ordering of events across the program where each happens at the same time +or one after the other, but under the abstract model no such ordering exists; +instead, a possible execution of the program must be treated as a single event +that happens instantaneously — there is never any such thing as “now”, or a +“latest value”, and using that terminology will only lead you to more confusion. +(Of course, in reality there does exist a concept of time, but you must keep in +mind that you’re not programming for the hardware, you’re programming for the +AM.) + +However, while no global ordering of operations exists _between_ threads, there +does exist a single total ordering _within_ each thread, which is known as its +_sequence_. For example, given this simple Rust program: + +```rs +println!("A"); +println!("B"); +``` + +its sequence during one possible execution can be visualized like so: + +```text +╭───────────────╮ +│ println!("A") │ +╰───────╥───────╯ +╭───────⇓───────╮ +│ println!("B") │ +╰───────────────╯ +``` + +That double arrow in between the two boxes (`⇒`) represents that the second +statement is _sequenced after_ the first (and similarly the first statement is +_sequenced before_ the second). This is the strongest kind of ordering guarantee +between any two operations, and only comes about when those two operations +happen one after the other and on the same thread. + +If we add a second thread to the mix: + +```rs +// Thread 1: +println!("A"); +println!("B"); +// Thread 2: +eprintln!("01"); +eprintln!("02"); +``` + +it will simply coexist in parallel, with each thread getting its own independent +sequence: + +```text + Thread 1 Thread 2 +╭───────────────╮ ╭─────────────────╮ +│ println!("A") │ │ eprintln!("01") │ +╰───────╥───────╯ ╰────────╥────────╯ +╭───────⇓───────╮ ╭────────⇓────────╮ +│ println!("B") │ │ eprintln!("02") │ +╰───────────────╯ ╰─────────────────╯ +``` + +Note that this is **not** a representation of multiple things that _could_ +happen at runtime — instead, this diagram describes exactly what _did_ happen +when the program ran once. This distinction is key, because it highlights that +even the lowest-level representation of a program’s execution does not have +a global ordering between threads; those two disconnected chains are all there +is. + +Now let’s make things more interesting by introducing some shared data, and have +both threads read it. + +```rs +// Initial state +let data = 0; +// Thread 1: +data; +// Thread 2: +data; +``` + +Each memory location, similarly to threads, can be shown as another column on +our diagram, but holding values instead of instructions, and each access (read +or write) manifests as a line from the instruction that performed the access to +the associated value in the column. So this code can produce (and is in fact +guaranteed to produce) the following execution: + +```text +Thread 1 data Thread 2 +╭──────╮ ┌────┐ ╭──────╮ +│ data ├╌╌╌╌┤ 0 ├╌╌╌╌┤ data │ +╰──────╯ └────┘ ╰──────╯ +``` + +That is, both threads read the same value of `0` from `data`, with no relative +ordering between them. This is the simple case, for when the data doesn’t ever +change — but that’s no fun, so let’s add some mutability in the mix (we’ll also +return to a single thread, just to keep things simple). + +Consider this code, which we’re going to attempt to draw a diagram for like +above: + +```rs +let mut data = 0; +data = 1; +data; +data = 2; +``` + +Working out executions of code like this is rather like solving a Sudoku puzzle: +you must first lay out all the facts that you know, and then fill in the blanks +with logical reasoning. The initial information we’ve been given is both the +initial value of `data` and the sequential order of Thread 1; we also know that +over its lifetime, `data` takes on a total of three different values that were +caused by two different non-atomic writes. This allows us to start drawing out +some boxes: + +```text + Thread 1 data +╭───────╮ ┌────┐ +│ = 1 ├╌? │ 0 │ +╰───╥───╯ ?╌┼╌╌╌╌┤ +╭───⇓───╮ ?╌┼╌╌╌╌┤ +│ data ├╌? │ ? │ +╰───╥───╯ ?╌┼╌╌╌╌┤ +╭───⇓───╮ ?╌┼╌╌╌╌┤ +│ = 2 ├╌? │ ? │ +╰───────╯ └────┘ +``` + +Note the use of dashed padding in between the values of `data`’s column. Those +spaces won’t ever contain a value, but they’re used to represent an +unsynchronized (non-atomic) write — it is garbage data and attempting to read it +would result in a data race. + +To solve this puzzle, we first need to bring in a new rule that governs all +memory accesses to a particular location: +> From the point at which the access occurs, find every other point that can be +> reached by following the reverse direction of arrows, then for each one of +> those, take a single step across every line that connects to the relevant +> memory location. **It is not allowed for the access to read or write any value +> that appears above any one of these points**. + +In our case, there are two potential executions: one, where the first write +corresponds to the first value in `data`, and two, where the first write +corresponds to the second value in `data`. Considering the second case for a +moment, it would also force the second write to correspond to the first +value in `data`. Therefore its diagram would look something like this: + +```text + Thread 1 data +╭───────╮ ┌────┐ +│ = 1 ├╌╌┐ │ 0 │ +╰───╥───╯ ┊ ┌╌╌┼╌╌╌╌┤ +╭───⇓───╮ ┊ ├╌╌┼╌╌╌╌┤ +│ data ├╌?┊ ┊ │ 2 │ +╰───╥───╯ ├╌┼╌╌┼╌╌╌╌┤ +╭───⇓───╮ └╌┼╌╌┼╌╌╌╌┤ +│ = 2 ├╌╌╌╌┘ │ 1 │ +╰───────╯ └────┘ +``` + +However, that second line breaks the rule we just established! Following up the +arrows from the third operation in Thread 1, we reach the first operation, and +from there we can take a single step to reach the space in between the `2` and +the `1`, which excludes the this access from writing any value above that point. + +So evidently, this execution is no good. We can therefore conclude that the only +possible execution of this program is the other one, in which the `1` appears +above the `2`: + +```text + Thread 1 data +╭───────╮ ┌────┐ +│ = 1 ├╌╌┐ │ 0 │ +╰───╥───╯ ├╌╌┼╌╌╌╌┤ +╭───⇓───╮ └╌╌┼╌╌╌╌┤ +│ data ├╌? │ 1 │ +╰───╥───╯ ┌╌╌┼╌╌╌╌┤ +╭───⇓───╮ ├╌╌┼╌╌╌╌┤ +│ = 2 ├╌╌┘ │ 2 │ +╰───────╯ └────┘ +``` + +Now to sort out the read operation in the middle. We can use the same rule as +before to trace up to the first write and rule out us reading either the `0` +value or the garbage that exists between it and `1`, but how to we choose +between the `1` and the `2`? Well, as it turns out there is a complement to the +rule we already defined which gives us the exact answer we need: + +> From the point at which the access occurs, find every other point that can be +> reached by following the _forward_ direction of arrows, then for each one of +> those, take a single step across every line that connects to the relevant +> memory location. **It is not allowed for the access to read or write any value +> that appears below any one of these points**. + +Using this rule, we can follow the arrow downwards and then across and finally +rule out `2` as well as the garbage before it. This leaves us with exactly _one_ +value that the read operation can return, and exactly one possible execution +guaranteed by the Abstract Machine: + +```text + Thread 1 data +╭───────╮ ┌────┐ +│ = 1 ├╌╌┐ │ 0 │ +╰───╥───╯ ├╌╌┼╌╌╌╌┤ +╭───⇓───╮ └╌╌┼╌╌╌╌┤ +│ data ├╌╌╌╌╌┤ 1 │ +╰───╥───╯ ┌╌╌┼╌╌╌╌┤ +╭───⇓───╮ ├╌╌┼╌╌╌╌┤ +│ = 2 ├╌╌┘ │ 2 │ +╰───────╯ └────┘ +``` + +You might be thinking that all this has been is the longest, most convoluted +explanation ever of the most basic intuitive semantics of programming — and +you’d be absolutely right. But it’s essential to grasp these fundamentals, +because once you have this model in mind, the extension into multiple threads +and the complicated semantics of real atomics becomes completely natural. diff --git a/src/atomics/relaxed.md b/src/atomics/relaxed.md new file mode 100644 index 00000000..887b5be5 --- /dev/null +++ b/src/atomics/relaxed.md @@ -0,0 +1,43 @@ +# Relaxed + +Now we’ve got single-threaded mutation semantics out of the way, we can try +reintroducing a second thread. We’ll have one thread perform a write to the +memory location, and a second thread read from it, like so: + +```rs +// Initial state +let mut state = 0; +// Thread 1: +data = 1; +// Thread 2: +data; +``` + +Of course, any Rust programmer will immediately tell you that this code doesn’t +compile, and indeed it definitely does not, and for good reason. But suspend +your disbelief for a moment, and imagine what would happen if it did. Let’s draw +a diagram, leaving out the reading lines for now: + +```text +Thread 1 data Thread 2 +╭───────╮ ┌────┐ ╭───────╮ +│ = 1 ├╌┐ │ 0 │ ?╌┤ data │ +╰───────╯ ├╌┼╌╌╌╌┤ ╰───────╯ + └╌┼╌╌╌╌┤ + │ 1 │ + └────┘ +``` + +Let’s try to figure out where the line in Thread 2’s access joins up. The rules +from before don’t help us much unfortunately since there are no arrows +connecting that operation to anything, so we can’t immediately rule anything +out. As a result, we end up facing a situation we haven’t faced before: there is +_more than one_ potential value for Thread 2 to read. + +And this is where we encounter the big limitation with unsynchronized data +accesses: the price we pay for their speed and optimization capability is that +this situation is considered **Undefined Behavior**. For an unsynchronized read +to be acceptable, there has to be _exactly one_ potential value for it to read, +and when there are multiple like in this situation it is considered a data race. + +## “Out-of-thin-air” values diff --git a/src/atomics/seqcst.md b/src/atomics/seqcst.md new file mode 100644 index 00000000..a9f12e59 --- /dev/null +++ b/src/atomics/seqcst.md @@ -0,0 +1 @@ +# SeqCst diff --git a/src/atomics/signals.md b/src/atomics/signals.md new file mode 100644 index 00000000..4e539928 --- /dev/null +++ b/src/atomics/signals.md @@ -0,0 +1,3 @@ +# Signals + +(and compiler fences) diff --git a/src/atomics/specification.md b/src/atomics/specification.md new file mode 100644 index 00000000..741d802c --- /dev/null +++ b/src/atomics/specification.md @@ -0,0 +1,354 @@ +# Specification + +Below is a modified C++20 specification draft (as it was on 2022-07-16), edited +to remove C++-only features like consume orderings and `sig_atomic_t`. + +Note that although this has been checked, atomics are very difficult to get +right and so there may be subtle mistakes. If you want to more formally check +your software, read the [\[intro.races\]], [\[atomics.order\]] and +[\[atomics.fences\]] sections of the real C++ specification. + +[\[intro.races\]]: https://eel.is/c++draft/intro.races +[\[atomics.order\]]: https://eel.is/c++draft/atomics.order +[\[atomics.fences\]]: https://eel.is/c++draft/atomics.fences + +## Data races + +The value of an object visible to a thread _T_ at a particular point is the +initial value of the object, a value assigned to the object by _T_, or a value +assigned to the object by another thread, according to the rules below. + +> _Note 1_: In some cases, there might instead be undefined behavior. Much of +> this subclause is motivated by the desire to support atomic operations with +> explicit and detailed visibility constraints. However, it also implicitly +> supports a simpler view for more restricted programs. + +Two expression evaluations _conflict_ if one of them modifies a memory location +and the other one reads or modifies the same memory location. + +The library defines a number of atomic operations and operations on mutexes that +are specially identified as synchronization operations. These operations play a +special role in making assignments in one thread visible to another. A +synchronization operation on one or more memory locations is either an acquire +operation, a release operation, or both an acquire and release operation. A +synchronization operation without an associated memory location is a fence and +can be either an acquire fence, a release fence, or both an acquire and release +fence. In addition, there are relaxed atomic operations, which are not +synchronization operations, and atomic read-modify-write operations, which have +special characteristics. + +> _Note 2_: For example, a call that acquires a mutex will perform an acquire +> operation on the locations comprising the mutex. Correspondingly, a call that +> releases the same mutex will perform a release operation on those same +> locations. Informally, performing a release operation on _A_ forces prior side +> effects on other memory locations to become visible to other threads that +> later perform an acquire operation on _A_. “Relaxed” atomic operations are not +> synchronization operations even though, like synchronization operations, they +> cannot contribute to data races. + +All modifications to a particular atomic object _M_ occur in some particular +total order, called the _modification order_ of _M_. + +> _Note 3_: There is a separate order for each atomic object. There is no +> requirement that these can be combined into a single total order for all +> objects. In general this will be impossible since different threads can +> observe modifications to different objects in inconsistent orders. + +A _release sequence_ headed by a release operation _A_ on an atomic object _M_ +is a maximal contiguous sub-sequence of side effects in the modification order +of _M_, where the first operation is _A_, and every subsequent operation is an +atomic read-modify-write operation. + +Certain library calls _synchronize with_ other library calls performed by +another thread. For example, an atomic store-release synchronizes with a +load-acquire that takes its value from the store. + +> _Note 4_: Except in the specified cases, reading a later value does not +> necessarily ensure visibility as described below. Such a requirement would +> sometimes interfere with efficient implementation. + +> _Note 5_: The specifications of the synchronization operations define when one +> reads the value written by another. For atomic objects, the definition is +> clear. All operations on a given mutex occur in a single total order. Each +> mutex acquisition “reads the value written” by the last mutex release. + +An evaluation _A_ _happens before_ an evaluation _B_ (or, equivalently, _B_ +_happens after_ _A_) if either: +- _A_ is sequenced before _B_, or +- _A_ synchronizes with _B_, or +- for some evaluation _X_, _A_ happens before _X_ and _X_ happens before _B_. + +An evaluation _A_ _strongly happens before_ an evaluation _D_ if, either +- _A_ is sequenced before _D_, or +- _A_ synchronizes with _D_, and both _A_ and _D_ and sequentially consistent + atomic operations, or +- there are evaluations _B_ and _C_ such that _A_ is sequenced before _B_, _B_ + happens before _C_, and _C_ is sequenced before _D_, or +- there is an evaluation _B_ such that _A_ strongly happens before _B_, and _B_ + strongly happens before _D_. + +> _Note 11_: Informally, if _A_ strongly happens before _B_, then _A_ appears to +> be evaluated before _B_ in all contexts. + +A _visible side effect_ _A_ on a scalar object _M_ with respect to a value +computation _B_ of _M_ satisfies the conditions: +- _A_ happens before _B_ and +- there is no other side effect _X_ to _M_ such that _A_ happens before _X_ and + _X_ happens before _B_. + +The value of a non-atomic scalar object _M_, as determined by evaluation _B_, +shall be the value stored by the visible side effect _A_. + +> _Note 12_: If there is ambiguity about which side effect to a non-atomic +> object is visible, then the behavior is either unspecified or undefined. + +> _Note 13_: This states that operations on ordinary objects are not visibly +> reordered. This is not actually detectable without data races, but it is +> necessary to ensure that data races, as defined below, and with suitable +> restrictions on the use of atomics, correspond to data races in a simple +> interleaved (sequentially consistent) execution. + +The value of an atomic object _M_, as determined by evaluation _B_, shall be the +value stored by some side effect _A_ that modifies _M_, where _B_ does not +happen before _A_. + +> _Note 14_: The set of such side effects is also restricted by the rest of the +> rules described here, and in particular, by the coherence requirements below. + +If an operation _A_ that modifies an atomic object _M_ happens before an +operation _B_ that modifies _M_, then _A_ shall be earlier than _B_ in the +modification order of _M_. + +> _Note 15_: This requirement is known as write-write coherence. + +If a value computation _A_ of an atomic object _M_ happens before a value +computation _B_ of _M_, and _A_ takes its value from a side effect _X_ on _M_, +then the value computed by _B_ shall either be the value stored by _X_ or the +value stored by a side effect _Y_ on _M_, where _Y_ follows _X_ in the +modification order of _M_. + +> _Note 16_: This requirement is known as read-read coherence. + +If a value computation _A_ of an atomic object _M_ happens before an operation +_B_ that modifies _M_, then _A_ shall take its value from a side effect _X_ on +_M_, where _X_ precedes _B_ in the modification order of _M_. + +> _Note 17_: This requirement is known as read-write coherence. + +If a side effect _X_ on an atomic object _M_ happens before a value computation +_B_ of _M_, then the evaluation _B_ shall take its value from _X_ or from a side +effect _Y_ that follows _X_ in the modification order of _M_. + +> _Note 18_: This requirement is known as write-read coherence. + +> _Note 19_: The four preceding coherence requirements effectively disallow +> compiler reordering of atomic operations to a single object, even if both +> operations are relaxed loads. This effectively makes the cache coherence +> guarantee provided by most hardware available to C++ atomic operations. + +> _Note 20_: The value observed by a load of an atomic depends on the “happens +> before” relation, which depends on the values observed by loads of atomics. +> The intended reading is that there must exist an association of atomic loads +> with modifications they observe that, together with suitably chosen +> modification orders and the “happens before” relation derived as described +> above, satisfy the resulting constraints as imposed here. + +Two actions are _potentially concurrent_ if +- they are performed by different threads, or +- they are unsequenced, at least one is performed by a signal handler, and they + are not both performed by the same signal handler invocation. + +The execution of a program contains a _data race_ if it contains two potentially +concurrent conflicting actions, at least one of which is not atomic, and neither +happens before the other. Any such data race results in undefined behavior. + +> _Note 21_: It can be shown that programs that correctly use mutexes and +> `SeqCst` operations to prevent all data races and use no other synchronization +> operations behave as if the operations executed by their constituent threads +> were simply interleaved, with each value computation of an object being taken +> from the last side effect on that object in that interleaving. This is normally +> referred to as “sequential consistency”. However, this applies only to +> data-race-free programs, and data-race-free programs cannot observe most +> program transformations that do not change single-threaded program semantics. +> In fact, most single-threaded program transformations continue to be allowed, +> since any program that behaves differently as a result has undefined behavior. + +> _Note 22_: Compiler transformations that introduce assignments to a +> potentially shared memory location that would not be modified by the abstract +> machine are generally precluded by this document, since such an assignment +> might overwrite another assignment by a different thread in cases in which an +> abstract machine execution would not have encountered a data race. This +> includes implementations of data member assignment that overwrite adjacent +> members in separate memory locations. Reordering of atomic loads in cases in +> which the atomics in question might alias is also generally precluded, since +> this could violate the coherence rules. + +> _Note 23_: Transformations that introduce a speculative read of a potentially +> shared memory location might not preserve the semantics of the C++ program as +> defined in this document, since they potentially introduce a data race. +> However, they are typically valid in the context of an optimizing compiler +> that targets a specific machine with well-defined semantics for data races. +> They would be invalid for a hypothetical machine that is not tolerant of races +> or provides hardware race detection. + +## Atomic orderings + +```rs +// in ::core::sync::atomic +#[non_exhaustive] +pub enum Ordering { + Relaxed, + Release, + Acquire, + AcqRel, + SeqCst, +} +``` + +The enumeration `Ordering` specifies the detailed regular (non-atomic) memory +synchronization order as defined in this document and may provide for operation +ordering. Its enumerated values and their meanings are as follows: +- `Relaxed`: no operation orders memory. +- `Release`, `AcqRel`, and `SeqCst`: a store operation performs a release + operation on the affected memory location. +- `Acquire`, `AcqRel`, and `SeqCst`: a load operation performs an acquire + operation on the affected memory location. + +> _Note 2_: Atomic operations specifying `Relaxed` are relaxed with respect to +> memory ordering. Implementations must still guarantee that any given atomic +> access to a particular atomic object be indivisible with respect to all other +> atomic accesses to that object. + +An atomic operation _A_ that performs a release operation on an atomic object +_M_ synchronizes with an atomic operation _B_ that performs an acquire operation +on _M_ and takes its value from any side effect in the release sequence headed +by _A_. + +An atomic operation _A_ on some atomic object _M_ is coherence-ordered before +another atomic operation _B_ on _M_ if +- _A_ is a modification, and _B_ reads the value stored by _A_, or +- _A_ precedes _B_ in the modification order of _M_, or +- _A_ and _B_ are not the same atomic read-modify-write operation, and there + exists an atomic modification _X_ of _M_ such that _A_ reads the value + stored by _X_ and _X_ precedes _B_ in the modification order of _M_, or +- there exists an atomic modification _X_ of _M_ such that _A_ is + coherence-ordered before _X_ and _X_ is coherence-ordered before _B_. + +There is a single total order _S_ on all `SeqCst` operations, including fences, +that satisfies the following constraints. First, if _A_ and _B_ are `SeqCst` +operations and _A_ strongly happens before _B_, then _A_ precedes _B_ in _S_. +Second, for every pair of atomic operations _A_ and _B_ on an object _M_, where +_A_ is coherence-ordered before _B_, the following four conditions are required +to be satisfied by _S_: +- if _A_ and _B_ are both `SeqCst` operations, then _A_ precedes _B_ in _S_; and +- if _A_ is a `SeqCst` operation and _B_ happens before a `SeqCst` fence _Y_, + then _A_ precedes _Y_ in _S_; and +- if a `SeqCst` fence _X_ happens before _A_ and _B_ is a `SeqCst` operation, + then _X_ precedes _B_ in _S_; and +- if an `SeqCst` fence _X_ happens before _A_ and _B_ happens before a `SeqCst` + fence _Y_, then _X_ precedes _Y_ in _S_. + +> _Note 3_: This definition ensures that _S_ is consistent with the modification +> order of any atomic object _M_. It also ensures that a `SeqCst` load _A_ of +> _M_ gets its value either from the last modification of _M_ that precedes _A_ +> in _S_ or from some non-`SeqCst` modification of _M_ that does not happen +> before any modification of _M_ that precedes _A_ in _S_. + +> _Note 4_: We do not require that _S_ be consistent with “happens before”. This +> allows more efficient implementation of `Acquire` and `Release` on some +> machine architectures. It can produce surprising results when these are mixed +> with `SeqCst` accesses. + +> _Note 5_: `SeqCst` ensures sequential consistency only for a program that is +> free of data races and uses exclusively `SeqCst` atomic operations. Any use of +> weaker ordering will invalidate this guarantee unless extreme care is used. In +> many cases, `SeqCst` atomic operations are reorderable with respect to other +> atomic operations performed by the same thread. + +Implementations should ensure that no “out-of-thin-air” values are computed that +circularly depend on their own computation. + +> _Note 6_: For example, with `x` and `y` initially zero, +> ```rs +> // Thread 1: +> let r1 = y.load(atomic::Ordering::Relaxed); +> x.store(r1, atomic::Ordering::Relaxed); +> // Thread 2: +> let r2 = x.load(atomic::Ordering::Relaxed); +> y.store(r2, atomic::Ordering::Relaxed); +> ``` +> this recommendation discourages producing `r1 == r2 == 42`, since the store of +> 42 to `y` is only possible if the store to `x` stores `42`, which circularly +> depends on the store to `y` storing `42`. Note that without this restriction, +> such an execution is possible. + +> _Note 7_: The recommendation similarly disallows `r1 == r2 == 42` in the +> following example, with `x` and `y` again initially zero: +> ```rs +> // Thread 1: +> let r1 = x.load(atomic::Ordering::Relaxed); +> if r1 == 42 { +> y.store(42, atomic::Ordering::Relaxed); +> } +> // Thread 2: +> let r2 = y.load(atomic::Ordering::Relaxed); +> if r2 == 42 { +> x.store(42, atomic::Ordering::Relaxed); +> } +> ``` + +Atomic read-modify-write operations shall always read the last value (in the +modification order) written before the write associated with the +read-modify-write operation. + +Implementations should make atomic stores visible to atomic loads within a +reasonable amount of time. + +## Atomic fences + +This subclause introduces synchronization primitives called _fences_. Fences can +have acquire semantics, release semantics, or both. A fence with acquire +semantics is called an _acquire fence_. A fence with release semantics is called +a _release fence_. + +A release fence _A_ synchronizes with an acquire fence _B_ if there exist atomic +operations _X_ and _Y_, both operating on some atomic object _M_, such that _A_ +is sequenced before _X_, _X_ modifies _M_, _Y_ is sequenced before _B_, and _Y_ +reads the value written by _X_ or a value written by any side effect in the +hypothetical release sequence _X_ would head if it were a release operation. + +A release fence _A_ synchronizes with an atomic operation _B_ that performs an +acquire operation on an atomic object _M_ if there exists an atomic operation +_X_ such that _A_ is sequenced before _X_, _X_ modifies _M_, and _B_ reads the +value written by _X_ or a value written by any side effect in the hypothetical +release sequence _X_ would head if it were a release operation. + +An atomic operation _A_ that is a release operation on an atomic object _M_ +synchronizes with an acquire fence _B_ if there exists some atomic operation _X_ +on _M_ such that _X_ is sequenced before _B_ and reads the value written by _A_ +or a value written by any side effect in the release sequence headed by _A_. + +```rs +pub fn fence(order: Ordering); +``` + +_Effects_: Depending on the value of `order`, this operation: +- has no effects, if `order == Relaxed`; +- is an acquire fence, if `order == Acquire`; +- is a release fence, if `order == Release`; +- is both an acquire and a release fence, if `order == AcqRel`; +- is a sequentially consistent acquire and release fence, if `order == SeqCst`. + +```rs +pub fn compiler_fence(order: Ordering); +``` + +_Effects_: Equivalent to `fence(order)`, except that the resulting ordering +constraints are established only between a thread and a signal handler executed +in the same thread. + +> _Note 1_: `compiler_fence` can be used to specify the order in which actions +> performed by the thread become visible to the signal handler. Compiler +> optimizations and reorderings of loads and stores are inhibited in the same +> way as with `fence` but the hardware fence instructions that `fence` would +> have inserted are not emitted. From 42f46d205445e8eca33e75b0bc1d2e77306b35e9 Mon Sep 17 00:00:00 2001 From: SabrinaJewson Date: Thu, 4 Aug 2022 16:15:20 +0100 Subject: [PATCH 02/34] Fix one broken link --- src/arc-mutex/arc-clone.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/arc-mutex/arc-clone.md b/src/arc-mutex/arc-clone.md index 1adc6c9e..29cb5c77 100644 --- a/src/arc-mutex/arc-clone.md +++ b/src/arc-mutex/arc-clone.md @@ -28,7 +28,7 @@ happens-before relationship but is atomic. When `Drop`ping the Arc, however, we'll need to atomically synchronize when decrementing the reference count. This is described more in [the section on the `Drop` implementation for `Arc`](arc-drop.md). For more information on atomic relationships and Relaxed -ordering, see [the section on atomics](../atomics.md). +ordering, see [the section on atomics](../atomics/atomics.md). Thus, the code becomes this: From 46f31aeaab34fed9e2da419846ca9b0c68190ca5 Mon Sep 17 00:00:00 2001 From: SabrinaJewson Date: Thu, 4 Aug 2022 18:35:11 +0100 Subject: [PATCH 03/34] Replace accidental rs code blocks with rust --- src/atomics/multithread.md | 8 ++++---- src/atomics/relaxed.md | 4 ++-- src/atomics/specification.md | 10 +++++----- 3 files changed, 11 insertions(+), 11 deletions(-) diff --git a/src/atomics/multithread.md b/src/atomics/multithread.md index 21008926..d22f89d5 100644 --- a/src/atomics/multithread.md +++ b/src/atomics/multithread.md @@ -15,7 +15,7 @@ However, while no global ordering of operations exists _between_ threads, there does exist a single total ordering _within_ each thread, which is known as its _sequence_. For example, given this simple Rust program: -```rs +```rust println!("A"); println!("B"); ``` @@ -39,7 +39,7 @@ happen one after the other and on the same thread. If we add a second thread to the mix: -```rs +```rust // Thread 1: println!("A"); println!("B"); @@ -71,7 +71,7 @@ is. Now let’s make things more interesting by introducing some shared data, and have both threads read it. -```rs +```rust // Initial state let data = 0; // Thread 1: @@ -101,7 +101,7 @@ return to a single thread, just to keep things simple). Consider this code, which we’re going to attempt to draw a diagram for like above: -```rs +```rust let mut data = 0; data = 1; data; diff --git a/src/atomics/relaxed.md b/src/atomics/relaxed.md index 887b5be5..ad8c4674 100644 --- a/src/atomics/relaxed.md +++ b/src/atomics/relaxed.md @@ -4,9 +4,9 @@ Now we’ve got single-threaded mutation semantics out of the way, we can try reintroducing a second thread. We’ll have one thread perform a write to the memory location, and a second thread read from it, like so: -```rs +```rust // Initial state -let mut state = 0; +let mut data = 0; // Thread 1: data = 1; // Thread 2: diff --git a/src/atomics/specification.md b/src/atomics/specification.md index 741d802c..738b283b 100644 --- a/src/atomics/specification.md +++ b/src/atomics/specification.md @@ -193,7 +193,7 @@ happens before the other. Any such data race results in undefined behavior. ## Atomic orderings -```rs +```rust // in ::core::sync::atomic #[non_exhaustive] pub enum Ordering { @@ -269,7 +269,7 @@ Implementations should ensure that no “out-of-thin-air” values are computed circularly depend on their own computation. > _Note 6_: For example, with `x` and `y` initially zero, -> ```rs +> ```rust,ignore > // Thread 1: > let r1 = y.load(atomic::Ordering::Relaxed); > x.store(r1, atomic::Ordering::Relaxed); @@ -284,7 +284,7 @@ circularly depend on their own computation. > _Note 7_: The recommendation similarly disallows `r1 == r2 == 42` in the > following example, with `x` and `y` again initially zero: -> ```rs +> ```rust,ignore > // Thread 1: > let r1 = x.load(atomic::Ordering::Relaxed); > if r1 == 42 { @@ -328,7 +328,7 @@ synchronizes with an acquire fence _B_ if there exists some atomic operation _X_ on _M_ such that _X_ is sequenced before _B_ and reads the value written by _A_ or a value written by any side effect in the release sequence headed by _A_. -```rs +```rust,ignore pub fn fence(order: Ordering); ``` @@ -339,7 +339,7 @@ _Effects_: Depending on the value of `order`, this operation: - is both an acquire and a release fence, if `order == AcqRel`; - is a sequentially consistent acquire and release fence, if `order == SeqCst`. -```rs +```rust,ignore pub fn compiler_fence(order: Ordering); ``` From 103a733af9d4dfd9f8207c74c91a850e287cad45 Mon Sep 17 00:00:00 2001 From: SabrinaJewson Date: Thu, 4 Aug 2022 18:37:14 +0100 Subject: [PATCH 04/34] Replace reads with explicit `println!`s --- src/atomics/multithread.md | 6 +++--- src/atomics/relaxed.md | 2 +- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/src/atomics/multithread.md b/src/atomics/multithread.md index d22f89d5..e26fb60c 100644 --- a/src/atomics/multithread.md +++ b/src/atomics/multithread.md @@ -75,9 +75,9 @@ both threads read it. // Initial state let data = 0; // Thread 1: -data; +println!("{data}"); // Thread 2: -data; +eprintln!("{data}"); ``` Each memory location, similarly to threads, can be shown as another column on @@ -104,7 +104,7 @@ above: ```rust let mut data = 0; data = 1; -data; +println!("{data}"); data = 2; ``` diff --git a/src/atomics/relaxed.md b/src/atomics/relaxed.md index ad8c4674..f20b69a7 100644 --- a/src/atomics/relaxed.md +++ b/src/atomics/relaxed.md @@ -10,7 +10,7 @@ let mut data = 0; // Thread 1: data = 1; // Thread 2: -data; +println!("{data}"); ``` Of course, any Rust programmer will immediately tell you that this code doesn’t From a26eab47d283b7a9a66a38ab7b6a432cbfbb85e6 Mon Sep 17 00:00:00 2001 From: SabrinaJewson Date: Fri, 5 Aug 2022 12:14:36 +0100 Subject: [PATCH 05/34] =?UTF-8?q?Write=20the=20=E2=80=9CRelaxed=E2=80=9D?= =?UTF-8?q?=20section?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- src/atomics/relaxed.md | 390 ++++++++++++++++++++++++++++++++++++++++- 1 file changed, 385 insertions(+), 5 deletions(-) diff --git a/src/atomics/relaxed.md b/src/atomics/relaxed.md index f20b69a7..4c4c4c6a 100644 --- a/src/atomics/relaxed.md +++ b/src/atomics/relaxed.md @@ -28,11 +28,11 @@ Thread 1 data Thread 2 └────┘ ``` -Let’s try to figure out where the line in Thread 2’s access joins up. The rules -from before don’t help us much unfortunately since there are no arrows -connecting that operation to anything, so we can’t immediately rule anything -out. As a result, we end up facing a situation we haven’t faced before: there is -_more than one_ potential value for Thread 2 to read. +Unfortunately, the rules from before don’t help us in finding out where Thread +2’s line joins up to, since there are no arrows connecting that operation to +anything and therefore we can’t immediately rule any values out. As a result, we +end up facing a situation we haven’t faced before: there is _more than one_ +potential value for Thread 2 to read. And this is where we encounter the big limitation with unsynchronized data accesses: the price we pay for their speed and optimization capability is that @@ -40,4 +40,384 @@ this situation is considered **Undefined Behavior**. For an unsynchronized read to be acceptable, there has to be _exactly one_ potential value for it to read, and when there are multiple like in this situation it is considered a data race. +So what can we do about this? Well, two things need to be changed. First of all, +Thread 1 has to use an atomic store instead of an unsynchronized write, and +secondly Thread 2 has to use an atomic load instead of an unsynchronized read. +You’ll also notice that all the atomic functions accept one (and sometimes two) +parameters of `atomic::Ordering`s — we’ll explore the details of the differences +between them later, but for now we’ll use `Relaxed` because it is by far the +simplest of the lot. + +```rust +# use std::sync::atomic::{self, AtomicU32}; +// Initial state +let data = AtomicU32::new(0); +// Thread 1: +data.store(1, atomic::Ordering::Relaxed); +// Thread 2: +data.load(atomic::Ordering::Relaxed); +``` + +The use of the atomic store provides one additional ability in comparison to an +unsynchronized store, and that is that there is no “in-between” state between +the old and new values — instead, it immediately updates, resulting in a diagram +that look a bit more like this: + +```text +Thread 1 data +╭───────╮ ┌────┐ +│ = 1 ├─┐ │ 0 │ +╰───────╯ │ └────┘ + └─┬────┐ + │ 1 │ + └────┘ +``` + +We have now established a _modification order_ for `data`: a total, ordered list +of distinct, separated values that it takes over its lifetime. + +On the loading side, we also obtain one additional ability: when there are +multiple possible values to choose from in the modification order, instead of it +triggering UB, exactly one (but it is unspecified which) value is chosen. This +means that there are now _two_ potential executions of our program, with no way +for us to control which one occurs: + +```text + Possible Execution 1 ┃ Possible Execution 2 + ┃ +Thread 1 data Thread 2 ┃ Thread 1 data Thread 2 +╭───────╮ ┌────┐ ╭───────╮ ┃ ╭───────╮ ┌────┐ ╭───────╮ +│ store ├─┐ │ 0 ├───┤ load │ ┃ │ store ├─┐ │ 0 │ ┌─┤ load │ +╰───────╯ │ └────┘ ╰───────╯ ┃ ╰───────╯ │ └────┘ │ ╰───────╯ + └─┬────┐ ┃ └─┬────┐ │ + │ 1 │ ┃ │ 1 ├─┘ + └────┘ ┃ └────┘ +``` + +Note that **both sides must be atomic to avoid the data race**: if only the +writing side used atomic operations, the reading side would still have multiple +values to choose from (UB), and if only the reading side used atomic operations +it could end up reading the garbage data “in-between” `0` and `1` (also UB). + +> **NOTE:** This description of why both sides are needed to be atomic +> operations, while neat and intuitive, is not strictly correct: in reality the +> answer is simply “because the spec says so”. However, it is isomorphic to the +> real rules, so it can aid in understanding. + +## Read-modify-write operations + +Loads and stores are pretty neat in avoiding data races, but you can’t get very +far with them. For example, suppose you wanted to implement a global shared +counter that can be used to assign unique IDs to objects. Naïvely, you might try +to write code like this: + +```rust +# use std::sync::atomic::{self, AtomicU64}; +static COUNTER: AtomicU64 = AtomicU64::new(0); +pub fn get_id() -> u64 { + let value = COUNTER.load(atomic::Ordering::Relaxed); + COUNTER.store(value + 1, atomic::Ordering::Relaxed); + value +} +``` + +But then calling that function from multiple threads opens you up to an +execution like below that results in two threads obtaining the same ID: + +```text +Thread 1 COUNTER Thread 2 +╭───────╮ ┌───┐ ╭───────╮ +│ load ├───┤ 0 ├───┤ load │ +╰───╥───╯ └───┘ ╰────╥──╯ +╭───⇓───╮ ┌─┬───┐ ╭────⇓──╮ +│ store ├─┘ │ 1 │ ┌─┤ store │ +╰───────╯ └───┘ │ ╰───────╯ + ┌───┬─┘ + │ 1 │ + └───┘ +``` + +Technically, I believe it is _possible_ to implement this kind of thing with +just loads and stores, if you try hard enough and use several atomics. But +luckily, you don’t have to because there also exists another kind of operation, +the read-modify-write, which is specifically suited to this purpose. + +A read-modify-write operation (shortened to RMW) is a special kind of atomic +operation that reads, changes and writes back a value _in one step_. This means +that there are guaranteed to exist no other values in the modification order in +between the read and the write; it happens as a single operation. I would also +like to point out that this is true of **all** atomic orderings, since a common +misconception is that the `Relaxed` ordering somehow negates this guarantee. + +There are many different RMW operations to choose from, but the one most +appropriate for this use case is `fetch_add`, which adds a number to the atomic, +as well as returns the old value. So our code can be rewritten as this: + +```rust +# use std::sync::atomic::{self, AtomicU64}; +static COUNTER: AtomicU64 = AtomicU64::new(0); +pub fn get_id() -> u64 { + COUNTER.fetch_add(1, atomic::Ordering::Relaxed) +} +``` + +And then, no matter how many threads there are, that race condition from earlier +can never occur. Executions will have to look more like this: + +```text + Thread 1 COUNTER Thread 2 +╭───────────╮ ┌───┐ ╭───────────╮ +│ fetch_add ├─┐ │ 0 │ ┌─┤ fetch_add │ +╰───────────╯ │ └───┘ │ ╰───────────╯ + └─┬───┐ │ + │ 1 │ │ + └───┘ │ + ┌───┬─┘ + │ 2 │ + └───┘ +``` + +There is one problem with this code however, and that is that if `get_id()` is +called over 18 446 744 073 709 551 615 times, the counter will overflow and it +will start generating duplicate IDs. Of course, this won’t feasibly happen, but +it can be problematic if you need to _prove_ that it can’t happen (e.g. for +safety purposes) or you’re using a smaller integer type like `u32`. + +So we’re going to modify this function so that instead of returning a plain +`u64` it returns an `Option`, where `None` is used to indicate that an +overflow occurred and no more IDs could be generated. Additionally, it’s not +enough to just return `None` once, because if there are multiple threads +involved they will not see that result if it just occurs on a single thread — +instead, it needs to continue to return `None` _until the end of time_ (or, +well, this execution of the program). + +That means we have to do away with `fetch_add`, because `fetch_add` will always +overflow and there’s no `checked_fetch_add` equivalent. We’ll return to our racy +algorithm for a minute, this time thinking more about what went wrong. The steps +look something like this: + +1. Load a value of the atomic +1. Perform the checked add, propagating `None` +1. Store in the new value of the atomic + +The problem here is that the store does not necessarily occur directly after the +load in the atomic’s modification order, and that leads to the races. What we +need is some way to say, “add this new value to the modification order, but +_only if_ it occurs directly after the value we loaded”. And luckily for us, +there exists a function that does exactly\* this: `compare_exchange`. + +`compare_exchange` is a bit like a store, but instead of unconditionally storing +the value, it will first check the previous value in the modification order to +see whether it is what we expect, and if not it will simply tell us that and not +make any changes. It is an RMW operation, so all of this happens fully +atomically — there is no chance for a race condition. + +> \* It’s not quite the same, because `compare_exchange` can suffer from ABA +> problems in which it will see a later value in the modification order that +> just happened to be same and succeed. However, in this code values can never +> be reused so we don’t have to worry about that. + +In our case, we can simply replace the store with a compare exchange of the old +value and itself plus one (returning `None` instead if the addition overflowed, +to prevent overflowing the atomic). Should the `compare_exchange` fail, we know +that some other thread inserted a value in the modification order after the +value we loaded. This isn’t really a problem — we can just try again and again +until we succeed, and `compare_exchange` is even nice enough to give us the +updated value so we don’t have to load again. Also note that after we’ve updated +our value of the atomic, we’re guaranteed to never see the old value again, by +the arrow rules from the previous chapter. + +So here’s how it looks with these changes appplied: + +```rust +# use std::sync::atomic::{self, AtomicU64}; +static COUNTER: AtomicU64 = AtomicU64::new(0); +pub fn get_id() -> Option { + // Load the counter’s initial value from some place in the modification + // order (it doesn’t matter where, because the compare exchange makes sure + // that our new value appears directly after it). + let mut value = COUNTER.load(atomic::Ordering::Relaxed); + loop { + // Attempt to add one to the atomic. + let res = COUNTER.compare_exchange( + value, + value.checked_add(1)?, + atomic::Ordering::Relaxed, + atomic::Ordering::Relaxed, + ); + // Check what happened… + match res { + // If there was no value in between the value we loaded and our + // newly written value in the modification order, the compare + // exchange suceeded and so we are done. + Ok(_) => break, + + // Otherwise, there was a value in between and so we need to retry + // the addition and continue looping. + Err(updated_value) => value = updated_value, + } + } + Some(value) +} +``` + +This `compare_exchange` loop enables the algorithm to succeed even under +contention; it will simply try again (and again and again). In the below +execution, Thread 1 gets raced to storing its value of `1` to the counter, but +that’s okay because it will just add `1` to the `1`, making `2`, and retry the +compare exchange with that, eventually resulting in a unique ID. + +```text +Thread 1 COUNTER Thread 2 +╭───────╮ ┌───┐ ╭───────╮ +│ load ├───┤ 0 ├───┤ load │ +╰───╥───╯ └───┘ ╰───╥───╯ +╭───⇓───╮ ┌───┬─┐ ╭───⇓───╮ +│ cas ├───┤ 1 │ └─┤ cas │ +╰───╥───╯ └───┘ ╰───────╯ +╭───⇓───╮ ┌─┬───┐ +│ cas ├─┘ │ 2 │ +╰───────╯ └───┘ +``` + +> `compare_exchange` is abbreviated to CAS here (which stands for +> compare-and-swap), since that is the more general name for the operation. It +> is not to be confused with `compare_and_swap`, a deprecated method on Rust +> atomics that performs the same task as `compare_exchange` but has an inferior +> design in some ways. + +There are two additional improvements we can make here. First, because our +algorithm occurs in a loop, it is actually perfectly fine for the CAS to fail +even when there wasn’t a value inserted in the modification order in between, +since we’ll just run it again. This allows to switch out our call to +`compare_exchange` with a call to the weaker `compare_exchange_weak`, that +unlike the former function is allowed to _spuriously_ (i.e. randomly, from the +programmer’s perspective) fail. This often results in better performance on +architectures like ARM, since their `compare_exchange` is really just a loop +around the underlying `compare_exchange_weak`. x86\_64 however will see no +difference in performance. + +The second improvement is that this pattern is so common that the standard +library even provides a helper function for it, called `fetch_update`. It +implements the boilerplate `load`-`loop`-`match` parts for us, so all we have to +do is provide the closure that calls `checked_add(1)` and it will all just work. +This leads us to our final code for this example: + +```rust +# use std::sync::atomic::{self, AtomicU64}; +static COUNTER: AtomicU64 = AtomicU64::new(0); +pub fn get_id() -> Option { + COUNTER.fetch_update( + atomic::Ordering::Relaxed, + atomic::Ordering::Relaxed, + |value| value.checked_add(1), + ) + .ok() +} +``` + +These CAS loops are the absolute bread and butter of concurrent programming; +they’re absolutely everywhere and essential to know about. Every other RMW +operation on atomics can (and often is, if the hardware doesn’t have a more +efficient implementation) be implemented via a CAS loop. This is why CAS is seen +as the canonical example of an RMW — it’s pretty much the most fundamental +operation you can get on atomics. + +I’d also like to briefly bring attention to the atomic orderings used in this +section. They were mostly glossed over, but we were exclusively using `Relaxed`, +and that’s because for something as simple as a global ID counter, _you never +need more than `Relaxed`_. The more complex cases which we’ll look at later +definitely do need stronger orderings, but as a general rule, if: + +- you only have one atomic, and +- you have no other related pieces of data + +`Relaxed` is more than sufficient. + ## “Out-of-thin-air” values + +One peculiar consequence of the semantics of `Relaxed` operations is that it is +theoretically possible for values to come into existence “out-of-thin-air” +(commonly abbreviated to OOTA) — that is, a value could appear despite not ever +being calculated anywhere in code. In particular, consider this setup: + +```rust +# use std::sync::atomic::{self, AtomicU32}; +let x = AtomicU32::new(0); +let y = AtomicU32::new(0); + +// Thread 1: +let r1 = y.load(atomic::Ordering::Relaxed); +x.store(r1, atomic::Ordering::Relaxed); + +// Thread 2: +let r2 = x.load(atomic::Ordering::Relaxed); +y.store(r2, atomic::Ordering::Relaxed); +``` + +When starting to draw a diagram for a possible execution of this program, we +have to first lay out the basic facts that we know: +- `x` and `y` both start out as zero +- Thread 1 performs a load of `y` followed by a store of `x` +- Thread 2 performs a load of `x` followed by a store of `y` +- Each of `x` and `y` take on exactly two values in their lifetime + +Then we can start to construct boxes: + +```text +Thread 1 x y Thread 2 +╭───────╮ ┌───┐ ┌───┐ ╭───────╮ +│ load ├─┐ │ 0 │ │ 0 │ ┌─┤ load │ +╰───╥───╯ │ └───┘ └───┘ │ ╰───╥───╯ + ║ │ ?───────────┘ ║ +╭───⇓───╮ └───────────? ╭───⇓───╮ +│ store ├───┬───┐ ┌───┬───┤ store │ +╰───────╯ │ ? │ │ ? │ ╰───────╯ + └───┘ └───┘ +``` + +At this point, if either of those lines were to connect to the higher box then +the execution would be simple: that thread would forward the value to its lower +box, which the other thread would then either read, or load the same value +(zero) from the box above it, and we’d end up with zero in both atomics. But +what if they were to connect downwards? Then we’d end up with an execution that +looks like this: + +```text +Thread 1 x y Thread 2 +╭───────╮ ┌───┐ ┌───┐ ╭───────╮ +│ load ├─┐ │ 0 │ │ 0 │ ┌─┤ load │ +╰───╥───╯ │ └───┘ └───┘ │ ╰───╥───╯ + ║ │ ┌───────────┘ ║ +╭───⇓───╮ └───┼───────┐ ╭───⇓───╮ +│ store ├───┬─┴─┐ ┌─┴─┬───┤ store │ +╰───────╯ │ ? │ │ ? │ ╰───────╯ + └───┘ └───┘ +``` + +But hang on — it’s not fully resolved yet, we still haven’t put in a value in +those lower question marks. So what value should it be? Well, the second value +of `x` is just copied from from the second value of `y`, so we just have to find +the value of that — but the second value of `y` is itself copied from the second +value of `x`! This means that we can actually put any value we like in that box, +including `0` or `42`, and the logic will check out perfectly fine — meaning if +this program were to execute in this fashion, it would end up reading a value +produced out of thin air! + +Now, if we were to strictly follow the rules we’ve laid out thus far, then this +would be totally valid thing to happen. But luckily, the authors of the C++ +specification have recognized this as a problem, and as such refined the +semantics of `Relaxed` to implement a thorough, logically sound, mathematically +proven formal model that prevents it, that’s just too complex and technical to +explain here— + +> No “out-of-thin-air” values can be computed that circularly depend on their +> own computations. + +Just kidding. Turns out, it’s a *really* difficult problem to solve, and to my +knowledge even now there is no known formal way to express how to prevent it. So +in the specification they just kind of hand-wave and say that it shouldn’t +happen, and that the above program must always give zero in both atomics, +despite the theoretical execution that could result in something else. Well, it +generally works in practice so I can’t complain — it’s just a very interesting +detail to know about. From d01fb667fd577a37be786367502d37feed85a511 Mon Sep 17 00:00:00 2001 From: SabrinaJewson Date: Fri, 5 Aug 2022 12:15:07 +0100 Subject: [PATCH 06/34] Remove specification chapter This is not allowed for copyright reasons. --- src/SUMMARY.md | 1 - src/atomics/specification.md | 354 ----------------------------------- 2 files changed, 355 deletions(-) delete mode 100644 src/atomics/specification.md diff --git a/src/SUMMARY.md b/src/SUMMARY.md index a65f21ef..e767c8e5 100644 --- a/src/SUMMARY.md +++ b/src/SUMMARY.md @@ -48,7 +48,6 @@ * [SeqCst](./atomics/seqcst.md) * [Fences](./atomics/fences.md) * [Signals](./atomics/signals.md) - * [Specification](./atomics/specification.md) * [Implementing Vec](./vec/vec.md) * [Layout](./vec/vec-layout.md) * [Allocating](./vec/vec-alloc.md) diff --git a/src/atomics/specification.md b/src/atomics/specification.md deleted file mode 100644 index 738b283b..00000000 --- a/src/atomics/specification.md +++ /dev/null @@ -1,354 +0,0 @@ -# Specification - -Below is a modified C++20 specification draft (as it was on 2022-07-16), edited -to remove C++-only features like consume orderings and `sig_atomic_t`. - -Note that although this has been checked, atomics are very difficult to get -right and so there may be subtle mistakes. If you want to more formally check -your software, read the [\[intro.races\]], [\[atomics.order\]] and -[\[atomics.fences\]] sections of the real C++ specification. - -[\[intro.races\]]: https://eel.is/c++draft/intro.races -[\[atomics.order\]]: https://eel.is/c++draft/atomics.order -[\[atomics.fences\]]: https://eel.is/c++draft/atomics.fences - -## Data races - -The value of an object visible to a thread _T_ at a particular point is the -initial value of the object, a value assigned to the object by _T_, or a value -assigned to the object by another thread, according to the rules below. - -> _Note 1_: In some cases, there might instead be undefined behavior. Much of -> this subclause is motivated by the desire to support atomic operations with -> explicit and detailed visibility constraints. However, it also implicitly -> supports a simpler view for more restricted programs. - -Two expression evaluations _conflict_ if one of them modifies a memory location -and the other one reads or modifies the same memory location. - -The library defines a number of atomic operations and operations on mutexes that -are specially identified as synchronization operations. These operations play a -special role in making assignments in one thread visible to another. A -synchronization operation on one or more memory locations is either an acquire -operation, a release operation, or both an acquire and release operation. A -synchronization operation without an associated memory location is a fence and -can be either an acquire fence, a release fence, or both an acquire and release -fence. In addition, there are relaxed atomic operations, which are not -synchronization operations, and atomic read-modify-write operations, which have -special characteristics. - -> _Note 2_: For example, a call that acquires a mutex will perform an acquire -> operation on the locations comprising the mutex. Correspondingly, a call that -> releases the same mutex will perform a release operation on those same -> locations. Informally, performing a release operation on _A_ forces prior side -> effects on other memory locations to become visible to other threads that -> later perform an acquire operation on _A_. “Relaxed” atomic operations are not -> synchronization operations even though, like synchronization operations, they -> cannot contribute to data races. - -All modifications to a particular atomic object _M_ occur in some particular -total order, called the _modification order_ of _M_. - -> _Note 3_: There is a separate order for each atomic object. There is no -> requirement that these can be combined into a single total order for all -> objects. In general this will be impossible since different threads can -> observe modifications to different objects in inconsistent orders. - -A _release sequence_ headed by a release operation _A_ on an atomic object _M_ -is a maximal contiguous sub-sequence of side effects in the modification order -of _M_, where the first operation is _A_, and every subsequent operation is an -atomic read-modify-write operation. - -Certain library calls _synchronize with_ other library calls performed by -another thread. For example, an atomic store-release synchronizes with a -load-acquire that takes its value from the store. - -> _Note 4_: Except in the specified cases, reading a later value does not -> necessarily ensure visibility as described below. Such a requirement would -> sometimes interfere with efficient implementation. - -> _Note 5_: The specifications of the synchronization operations define when one -> reads the value written by another. For atomic objects, the definition is -> clear. All operations on a given mutex occur in a single total order. Each -> mutex acquisition “reads the value written” by the last mutex release. - -An evaluation _A_ _happens before_ an evaluation _B_ (or, equivalently, _B_ -_happens after_ _A_) if either: -- _A_ is sequenced before _B_, or -- _A_ synchronizes with _B_, or -- for some evaluation _X_, _A_ happens before _X_ and _X_ happens before _B_. - -An evaluation _A_ _strongly happens before_ an evaluation _D_ if, either -- _A_ is sequenced before _D_, or -- _A_ synchronizes with _D_, and both _A_ and _D_ and sequentially consistent - atomic operations, or -- there are evaluations _B_ and _C_ such that _A_ is sequenced before _B_, _B_ - happens before _C_, and _C_ is sequenced before _D_, or -- there is an evaluation _B_ such that _A_ strongly happens before _B_, and _B_ - strongly happens before _D_. - -> _Note 11_: Informally, if _A_ strongly happens before _B_, then _A_ appears to -> be evaluated before _B_ in all contexts. - -A _visible side effect_ _A_ on a scalar object _M_ with respect to a value -computation _B_ of _M_ satisfies the conditions: -- _A_ happens before _B_ and -- there is no other side effect _X_ to _M_ such that _A_ happens before _X_ and - _X_ happens before _B_. - -The value of a non-atomic scalar object _M_, as determined by evaluation _B_, -shall be the value stored by the visible side effect _A_. - -> _Note 12_: If there is ambiguity about which side effect to a non-atomic -> object is visible, then the behavior is either unspecified or undefined. - -> _Note 13_: This states that operations on ordinary objects are not visibly -> reordered. This is not actually detectable without data races, but it is -> necessary to ensure that data races, as defined below, and with suitable -> restrictions on the use of atomics, correspond to data races in a simple -> interleaved (sequentially consistent) execution. - -The value of an atomic object _M_, as determined by evaluation _B_, shall be the -value stored by some side effect _A_ that modifies _M_, where _B_ does not -happen before _A_. - -> _Note 14_: The set of such side effects is also restricted by the rest of the -> rules described here, and in particular, by the coherence requirements below. - -If an operation _A_ that modifies an atomic object _M_ happens before an -operation _B_ that modifies _M_, then _A_ shall be earlier than _B_ in the -modification order of _M_. - -> _Note 15_: This requirement is known as write-write coherence. - -If a value computation _A_ of an atomic object _M_ happens before a value -computation _B_ of _M_, and _A_ takes its value from a side effect _X_ on _M_, -then the value computed by _B_ shall either be the value stored by _X_ or the -value stored by a side effect _Y_ on _M_, where _Y_ follows _X_ in the -modification order of _M_. - -> _Note 16_: This requirement is known as read-read coherence. - -If a value computation _A_ of an atomic object _M_ happens before an operation -_B_ that modifies _M_, then _A_ shall take its value from a side effect _X_ on -_M_, where _X_ precedes _B_ in the modification order of _M_. - -> _Note 17_: This requirement is known as read-write coherence. - -If a side effect _X_ on an atomic object _M_ happens before a value computation -_B_ of _M_, then the evaluation _B_ shall take its value from _X_ or from a side -effect _Y_ that follows _X_ in the modification order of _M_. - -> _Note 18_: This requirement is known as write-read coherence. - -> _Note 19_: The four preceding coherence requirements effectively disallow -> compiler reordering of atomic operations to a single object, even if both -> operations are relaxed loads. This effectively makes the cache coherence -> guarantee provided by most hardware available to C++ atomic operations. - -> _Note 20_: The value observed by a load of an atomic depends on the “happens -> before” relation, which depends on the values observed by loads of atomics. -> The intended reading is that there must exist an association of atomic loads -> with modifications they observe that, together with suitably chosen -> modification orders and the “happens before” relation derived as described -> above, satisfy the resulting constraints as imposed here. - -Two actions are _potentially concurrent_ if -- they are performed by different threads, or -- they are unsequenced, at least one is performed by a signal handler, and they - are not both performed by the same signal handler invocation. - -The execution of a program contains a _data race_ if it contains two potentially -concurrent conflicting actions, at least one of which is not atomic, and neither -happens before the other. Any such data race results in undefined behavior. - -> _Note 21_: It can be shown that programs that correctly use mutexes and -> `SeqCst` operations to prevent all data races and use no other synchronization -> operations behave as if the operations executed by their constituent threads -> were simply interleaved, with each value computation of an object being taken -> from the last side effect on that object in that interleaving. This is normally -> referred to as “sequential consistency”. However, this applies only to -> data-race-free programs, and data-race-free programs cannot observe most -> program transformations that do not change single-threaded program semantics. -> In fact, most single-threaded program transformations continue to be allowed, -> since any program that behaves differently as a result has undefined behavior. - -> _Note 22_: Compiler transformations that introduce assignments to a -> potentially shared memory location that would not be modified by the abstract -> machine are generally precluded by this document, since such an assignment -> might overwrite another assignment by a different thread in cases in which an -> abstract machine execution would not have encountered a data race. This -> includes implementations of data member assignment that overwrite adjacent -> members in separate memory locations. Reordering of atomic loads in cases in -> which the atomics in question might alias is also generally precluded, since -> this could violate the coherence rules. - -> _Note 23_: Transformations that introduce a speculative read of a potentially -> shared memory location might not preserve the semantics of the C++ program as -> defined in this document, since they potentially introduce a data race. -> However, they are typically valid in the context of an optimizing compiler -> that targets a specific machine with well-defined semantics for data races. -> They would be invalid for a hypothetical machine that is not tolerant of races -> or provides hardware race detection. - -## Atomic orderings - -```rust -// in ::core::sync::atomic -#[non_exhaustive] -pub enum Ordering { - Relaxed, - Release, - Acquire, - AcqRel, - SeqCst, -} -``` - -The enumeration `Ordering` specifies the detailed regular (non-atomic) memory -synchronization order as defined in this document and may provide for operation -ordering. Its enumerated values and their meanings are as follows: -- `Relaxed`: no operation orders memory. -- `Release`, `AcqRel`, and `SeqCst`: a store operation performs a release - operation on the affected memory location. -- `Acquire`, `AcqRel`, and `SeqCst`: a load operation performs an acquire - operation on the affected memory location. - -> _Note 2_: Atomic operations specifying `Relaxed` are relaxed with respect to -> memory ordering. Implementations must still guarantee that any given atomic -> access to a particular atomic object be indivisible with respect to all other -> atomic accesses to that object. - -An atomic operation _A_ that performs a release operation on an atomic object -_M_ synchronizes with an atomic operation _B_ that performs an acquire operation -on _M_ and takes its value from any side effect in the release sequence headed -by _A_. - -An atomic operation _A_ on some atomic object _M_ is coherence-ordered before -another atomic operation _B_ on _M_ if -- _A_ is a modification, and _B_ reads the value stored by _A_, or -- _A_ precedes _B_ in the modification order of _M_, or -- _A_ and _B_ are not the same atomic read-modify-write operation, and there - exists an atomic modification _X_ of _M_ such that _A_ reads the value - stored by _X_ and _X_ precedes _B_ in the modification order of _M_, or -- there exists an atomic modification _X_ of _M_ such that _A_ is - coherence-ordered before _X_ and _X_ is coherence-ordered before _B_. - -There is a single total order _S_ on all `SeqCst` operations, including fences, -that satisfies the following constraints. First, if _A_ and _B_ are `SeqCst` -operations and _A_ strongly happens before _B_, then _A_ precedes _B_ in _S_. -Second, for every pair of atomic operations _A_ and _B_ on an object _M_, where -_A_ is coherence-ordered before _B_, the following four conditions are required -to be satisfied by _S_: -- if _A_ and _B_ are both `SeqCst` operations, then _A_ precedes _B_ in _S_; and -- if _A_ is a `SeqCst` operation and _B_ happens before a `SeqCst` fence _Y_, - then _A_ precedes _Y_ in _S_; and -- if a `SeqCst` fence _X_ happens before _A_ and _B_ is a `SeqCst` operation, - then _X_ precedes _B_ in _S_; and -- if an `SeqCst` fence _X_ happens before _A_ and _B_ happens before a `SeqCst` - fence _Y_, then _X_ precedes _Y_ in _S_. - -> _Note 3_: This definition ensures that _S_ is consistent with the modification -> order of any atomic object _M_. It also ensures that a `SeqCst` load _A_ of -> _M_ gets its value either from the last modification of _M_ that precedes _A_ -> in _S_ or from some non-`SeqCst` modification of _M_ that does not happen -> before any modification of _M_ that precedes _A_ in _S_. - -> _Note 4_: We do not require that _S_ be consistent with “happens before”. This -> allows more efficient implementation of `Acquire` and `Release` on some -> machine architectures. It can produce surprising results when these are mixed -> with `SeqCst` accesses. - -> _Note 5_: `SeqCst` ensures sequential consistency only for a program that is -> free of data races and uses exclusively `SeqCst` atomic operations. Any use of -> weaker ordering will invalidate this guarantee unless extreme care is used. In -> many cases, `SeqCst` atomic operations are reorderable with respect to other -> atomic operations performed by the same thread. - -Implementations should ensure that no “out-of-thin-air” values are computed that -circularly depend on their own computation. - -> _Note 6_: For example, with `x` and `y` initially zero, -> ```rust,ignore -> // Thread 1: -> let r1 = y.load(atomic::Ordering::Relaxed); -> x.store(r1, atomic::Ordering::Relaxed); -> // Thread 2: -> let r2 = x.load(atomic::Ordering::Relaxed); -> y.store(r2, atomic::Ordering::Relaxed); -> ``` -> this recommendation discourages producing `r1 == r2 == 42`, since the store of -> 42 to `y` is only possible if the store to `x` stores `42`, which circularly -> depends on the store to `y` storing `42`. Note that without this restriction, -> such an execution is possible. - -> _Note 7_: The recommendation similarly disallows `r1 == r2 == 42` in the -> following example, with `x` and `y` again initially zero: -> ```rust,ignore -> // Thread 1: -> let r1 = x.load(atomic::Ordering::Relaxed); -> if r1 == 42 { -> y.store(42, atomic::Ordering::Relaxed); -> } -> // Thread 2: -> let r2 = y.load(atomic::Ordering::Relaxed); -> if r2 == 42 { -> x.store(42, atomic::Ordering::Relaxed); -> } -> ``` - -Atomic read-modify-write operations shall always read the last value (in the -modification order) written before the write associated with the -read-modify-write operation. - -Implementations should make atomic stores visible to atomic loads within a -reasonable amount of time. - -## Atomic fences - -This subclause introduces synchronization primitives called _fences_. Fences can -have acquire semantics, release semantics, or both. A fence with acquire -semantics is called an _acquire fence_. A fence with release semantics is called -a _release fence_. - -A release fence _A_ synchronizes with an acquire fence _B_ if there exist atomic -operations _X_ and _Y_, both operating on some atomic object _M_, such that _A_ -is sequenced before _X_, _X_ modifies _M_, _Y_ is sequenced before _B_, and _Y_ -reads the value written by _X_ or a value written by any side effect in the -hypothetical release sequence _X_ would head if it were a release operation. - -A release fence _A_ synchronizes with an atomic operation _B_ that performs an -acquire operation on an atomic object _M_ if there exists an atomic operation -_X_ such that _A_ is sequenced before _X_, _X_ modifies _M_, and _B_ reads the -value written by _X_ or a value written by any side effect in the hypothetical -release sequence _X_ would head if it were a release operation. - -An atomic operation _A_ that is a release operation on an atomic object _M_ -synchronizes with an acquire fence _B_ if there exists some atomic operation _X_ -on _M_ such that _X_ is sequenced before _B_ and reads the value written by _A_ -or a value written by any side effect in the release sequence headed by _A_. - -```rust,ignore -pub fn fence(order: Ordering); -``` - -_Effects_: Depending on the value of `order`, this operation: -- has no effects, if `order == Relaxed`; -- is an acquire fence, if `order == Acquire`; -- is a release fence, if `order == Release`; -- is both an acquire and a release fence, if `order == AcqRel`; -- is a sequentially consistent acquire and release fence, if `order == SeqCst`. - -```rust,ignore -pub fn compiler_fence(order: Ordering); -``` - -_Effects_: Equivalent to `fence(order)`, except that the resulting ordering -constraints are established only between a thread and a signal handler executed -in the same thread. - -> _Note 1_: `compiler_fence` can be used to specify the order in which actions -> performed by the thread become visible to the signal handler. Compiler -> optimizations and reorderings of loads and stores are inhibited in the same -> way as with `fence` but the hardware fence instructions that `fence` would -> have inserted are not emitted. From 715e67ffbeeec272012252d5b1a0715b3dd19822 Mon Sep 17 00:00:00 2001 From: SabrinaJewson Date: Sun, 14 Aug 2022 15:07:11 +0100 Subject: [PATCH 07/34] Write about `Acquire` and `Release` --- src/atomics/acquire-release.md | 333 +++++++++++++++++++++++++++++++++ 1 file changed, 333 insertions(+) diff --git a/src/atomics/acquire-release.md b/src/atomics/acquire-release.md index 7dc85bde..9a7e24e5 100644 --- a/src/atomics/acquire-release.md +++ b/src/atomics/acquire-release.md @@ -1 +1,334 @@ # Acquire and Release + +Next, we’re going to try and implement one of the simplest concurrent utilities +possible — a mutex, but without support for waiting (since that’s not really +related to what we’re doing now). It will hold both an atomic flag that +indicates whether it is locked or not, and the protected data itself. In code +this translates to: + +```rs +use std::cell::UnsafeCell; +use std::sync::atomic::AtomicBool; + +pub struct Mutex { + locked: AtomicBool, + data: UnsafeCell, +} + +impl Mutex { + pub const fn new(data: T) -> Self { + Self { + locked: AtomicBool::new(false), + data: UnsafeCell::new(data), + } + } +} +``` + +Now for the lock function. We need to use an RMW here, since we need to both +check whether it is locked and lock it if is not in a single atomic step; this +can be most simply done with a `compare_exchange` (unlike before, it doesn’t +need to be in a loop this time). For the ordering, we’ll just use `Relaxed` +since we don’t know of any others yet. + +```rust +# use std::cell::UnsafeCell; +# use std::sync::atomic::{self, AtomicBool}; +# pub struct Mutex { +# locked: AtomicBool, +# data: UnsafeCell, +# } +impl Mutex { + pub fn lock(&self) -> Option> { + match self.locked.compare_exchange( + false, + true, + atomic::Ordering::Relaxed, + atomic::Ordering::Relaxed, + ) { + Ok(_) => Some(Guard(self)), + Err(_) => None, + } + } +} + +pub struct Guard<'mutex, T>(&'mutex Mutex); +// Deref impl omitted… +``` + +We also need to implement `Drop` for `Guard` to make sure the lock on the mutex +is released once the guard is destroyed. Again we’re just using the `Relaxed` +ordering. + +```rust +# use std::cell::UnsafeCell; +# use std::sync::atomic::{self, AtomicBool}; +# pub struct Mutex { +# locked: AtomicBool, +# data: UnsafeCell, +# } +# pub struct Guard<'mutex, T>(&'mutex Mutex); +impl Drop for Guard<'_, T> { + fn drop(&mut self) { + self.0.locked.store(false, atomic::Ordering::Relaxed); + } +} +``` + +Great! In the normal operation then, this primitive should allow unique access +to the data of the mutex to be transferred across different threads. Usual usage +could look like this: + +```rust,ignore +// Initial state +let mutex = Mutex::new(0); +// Thread 1 +if let Some(guard) = mutex.lock() { + *guard += 1; +} +// Thread 2 +if let Some(guard) = mutex.lock() { + println!("{}", *guard); +} +``` + +Now, there are many possible executions of this code. For example, Thread 2 (the +reader thread) could lock the mutex first, and Thread 1 (the writer thread) +could fail to lock it: + +```text +Thread 1 locked data Thread 2 +╭───────╮ ┌────────┐ ┌───┐ ╭───────╮ +│ cas ├─┐ │ false │ │ 0 ├╌┐ ┌─┤ cas │ +╰───────╯ │ └────────┘ └───┘ ┊ │ ╰───╥───╯ + │ ┌────────┬───────┼─┘ ╭───⇓───╮ + └─┤ true │ └╌╌╌┤ guard │ + └────────┘ ╰───╥───╯ + ┌────────┬─────────┐ ╭───⇓───╮ + │ false │ └─┤ store │ + └────────┘ ╰───────╯ +``` + +Or potentially Thread _1_ could lock the mutex first, and Thread _2_ could fail +to lock it: + +```text +Thread 1 locked data Thread 2 +╭───────╮ ┌────────┐ ┌───┐ ╭───────╮ +│ cas ├─┐ │ false │ ┌─│ 0 │───┤ cas │ +╰───╥───╯ │ └────────┘ │┌┼╌╌╌┤ ╰───────╯ +╭───⇓───╮ └─┬────────┐ │├┼╌╌╌┤ +│ += 1; ├╌┐ │ true ├─┘┊│ 1 │ +╰───╥───╯ ┊ └────────┘ ┊└───┘ +╭───⇓───╮ └╌╌╌╌╌╌╌╌╌╌╌╌╌┘ +│ store ├───┬────────┐ +╰───────╯ │ false │ + └────────┘ +``` + +But the interesting case comes in when Thread 1 successfully locks and unlocks +the mutex, and then Thread 2 locks it. Let’s draw that one out too: + +```text +Thread 1 locked data Thread 2 +╭───────╮ ┌────────┐ ┌───┐ ╭───────╮ +│ cas ├─┐ │ false │ │ 0 │ ┌───┤ cas │ +╰───╥───╯ │ └────────┘ ┌┼╌╌╌┤ │ ╰───╥───╯ +╭───⇓───╮ └─┬────────┐ ├┼╌╌╌┤ │ ╭───⇓───╮ +│ += 1; ├╌┐ │ true │ ┊│ 1 │ │ ?╌┤ guard │ +╰───╥───╯ ┊ └────────┘ ┊└───┘ │ ╰───╥───╯ +╭───⇓───╮ └╌╌╌╌╌╌╌╌╌╌╌╌╌┘ │ ╭───⇓───╮ +│ store ├───┬────────┐ │ ┌─┤ store │ +╰───────╯ │ false │ │ │ ╰───────╯ + └────────┘ │ │ + ┌────────┬─────────┘ │ + │ true │ │ + └────────┘ │ + ┌────────┬───────────┘ + │ false │ + └────────┘ +``` + +Look at the second operation Thread 2 performs (the read of `data`), for which +we haven’t yet joined the line. Where should it connect to? Well actually, it +has multiple options…wait, we’ve seen this before! It’s a data race! + +That’s not good. Last time the solution was to use atomics instead — but in this +case that doesn’t seem to be enough, since even if atomics were used it still +would have the _option_ of reading `0` instead of `1`, and really if we want our +mutex to be sane, it should only be able to read `1`. + +So it seems that want we _want_ is to be able to apply our arrow rules from +before to completely rule out zero from the set of the possible values — if we +were able to draw a large arrow from the Thread 1’s `+= 1;` to Thread 2’s +`guard`, then we could trivially then use the rule to rule out `0` as a value +that could be read. + +This is where the `Acquire` and `Release` orderings come in. Informally put, a +_release store_ will cause an arrow instead of a line to be drawn from the +operation to the destination; and similarly an _acquire load_ will cause an +arrow to be drawn from the destination to the operation. To give a useless +example that illustrates this, for the given program: + +```rust +# use std::sync::atomic::{self, AtomicU32}; +// Initial state +let a = AtomicU32::new(0); +// Thread 1 +a.store(1, atomic::Ordering::Release); +// Thread 2 +a.load(atomic::Ordering::Acquire); +``` + +The two possible executions look like this: + +```text + Possible Execution 1 ┃ Possible Execution 2 + ┃ +Thread 1 a Thread 2 ┃ Thread 1 a Thread 2 +╭───────╮ ┌───┐ ╭──────╮ ┃ ╭───────╮ ┌───┐ ╭──────╮ +│ store ├─┐ │ 0 │ ┌─→ load │ ┃ │ store ├─┐ │ 0 ├───→ load │ +╰───────╯ │ └───┘ │ ╰──────╯ ┃ ╰───────╯ │ └───┘ ╰──────╯ + └─↘───┐ │ ┃ └─↘───┐ + │ 1 ├─┘ ┃ │ 1 │ + └───┘ ┃ └───┘ +``` + +These arrows are a new kind of arrow we +haven’t seen yet; they are known as _happens before_ (or happens after) +relations and are represented as thin arrows (→) on these diagrams. They are +weaker than the _sequenced before_ double-arrows (⇒) that occur inside a single +thread, but can still be used with the arrow rules to determine which values of +a memory location are valid to read. We can say that in the first possible +execution, Thread 1’s `store` (and everything sequenced before that) _happens +before_ Thread 2’s load (and everything sequenced after that). + +There is one more rule required for these to be useful, and that is _release +sequences_: after a release store is performed on an atomic, happens before +arrows will connect together each subsequent value of the atomic as long as the +new value is caused by an RMW and not just a plain store. + +With those rules in mind, converting Thread 1’s second store to use a `Release` +ordering as well as converting Thread 2’s CAS to use an `Acquire` ordering +allows us to effectively draw that arrow we needed before: + +```text +Thread 1 locked data Thread 2 +╭───────╮ ┌───────┐ ┌───┐ ╭───────╮ +│ cas ├─┐ │ false │ │ 0 │ ┌───→ cas │ +╰───╥───╯ │ └───────┘ ┌┼╌╌╌┤ │ ╰───╥───╯ +╭───⇓───╮ └─┬───────┐ ├┼╌╌╌┤ │ ╭───⇓───╮ +│ += 1; ├╌┐ │ true │ ┊│ 1 ├╌│╌╌╌┤ guard │ +╰───╥───╯ ┊ └───────┘ ┊└───┘ │ ╰───╥───╯ +╭───⇓───╮ └╌╌╌╌╌╌╌╌╌╌╌╌┘ │ ╭───⇓───╮ +│ store ├───↘───────┐ │ ┌─┤ store │ +╰───────╯ │ false │ │ │ ╰───────╯ + └───┬───┘ │ │ + ┌───↓───┬─────────┘ │ + │ true │ │ + └───────┘ │ + ┌───────┬───────────┘ + │ false │ + └───────┘ +``` + +We now can trace back along the reverse direction of arrows from the `guard` +bubble to the `+= 1` bubble; we have established that Thread 2’s load happens +after the `+= 1` side effect. This both avoids the data race and gives the +guarantee that `1` will be always read by Thread 2 (as long as locks after +Thread 1, of course). + +However, that is not the only execution of the program possible. Even with this +setup, there is another execution that can also cause UB: if Thread 2 locks the +mutex before Thread 1 does. + +```text +Thread 1 locked data Thread 2 +╭───────╮ ┌───────┐ ┌───┐ ╭───────╮ +│ cas ├───┐ │ false │┌──│ 0 │────→ cas │ +╰───╥───╯ │ └───────┘│ ┌┼╌╌╌┤ ╰───╥───╯ +╭───⇓───╮ │ ┌───────┬┘ ├┼╌╌╌┤ ╭───⇓───╮ +│ += 1; ├╌┐ │ │ true │ ┊│ 1 │ ?╌┤ guard │ +╰───╥───╯ ┊ │ └───────┘ ┊└───┘ ╰───╥───╯ +╭───⇓───╮ └╌│╌╌╌╌╌╌╌╌╌╌╌╌┘ ╭───⇓───╮ +│ store ├─┐ │ ┌───────┬────────────┤ store │ +╰───────╯ │ │ │ false │ ╰───────╯ + │ │ └───────┘ + │ └─┬───────┐ + │ │ true │ + │ └───────┘ + └───↘───────┐ + │ false │ + └───────┘ +``` + +Once again `guard` has multiple options for values to read. This one’s a bit +more counterintuitive than the previous one, since it requires “travelling +forward in time” to understand why the `1` is even there in the first place — +but since the abstract machine has no concept of time, it’s just a valid UB as +any other. + +Luckily, we’ve already solved this problem once, so it easy to solve again: just +like before, we’ll have the CAS become acquire and the store become release. + +```text +Thread 1 locked data Thread 2 +╭───────╮ ┌───────┐ ┌───┐ ╭───────╮ +│ cas ←───┐ │ false │┌──│ 0 │────→ cas │ +╰───╥───╯ │ └───────┘│ ┌┼╌╌╌┤ ╰───╥───╯ +╭───⇓───╮ │ ┌───────┬┘ ├┼╌╌╌┤ ╭───⇓───╮ +│ += 1; ├╌┐ │ │ true │ ┊│ 1 │ ?╌┤ guard │ +╰───╥───╯ ┊ │ └───────┘ ┊└───┘ ╰───╥───╯ +╭───⇓───╮ └╌│╌╌╌╌╌╌╌╌╌╌╌╌┘ ╭───⇓───╮ +│ store ├─┐ │ ┌───────↙────────────┤ store │ +╰───────╯ │ │ │ false │ ╰───────╯ + │ │ └───┬───┘ + │ └─┬───↓───┐ + │ │ true │ + │ └───────┘ + └───↘───────┐ + │ false │ + └───────┘ +``` + +We can now use the second arrow rule from before to follow _forward_ the arrow +from the `guard` bubble all the way to the `+= 1;`, determining that it is only +possible for that read to see `0` as its value. + +This leads us to the proper memory orderings for any mutex (and other locks like +RW locks too, even): use `Acquire` to lock it, and `Release` to unlock it. So +let’s go back to and update our original mutex definition with this knowledge. + +But wait, `compare_exchange` takes two ordering parameters, not just one! That’s +right — it also takes a second one to apply when the exchange fails (in our case, +when the mutex is already locked). But we don’t need an `Acquire` here, since in +that case we won’t be reading from the `data` value anyway, so we’ll just stick +with `Relaxed`. + +```rust,ignore +impl Mutex { + pub fn lock(&self) -> Option> { + match self.locked.compare_exchange( + false, + true, + atomic::Ordering::Acquire, + atomic::Ordering::Relaxed, + ) { + Ok(_) => Some(Guard(self)), + Err(_) => None, + } + } +} + +impl Drop for Guard<'_, T> { + fn drop(&mut self) { + self.0.locked.store(false, atomic::Ordering::Release); + } +} +``` + +Note that similarly to how atomic operations only make sense when paired with +other atomic operations on the same locations, `Acquire` only makes sense when +paired with `Release` and vice versa. That is, both an `Acquire` with no +corresponding `Release` and a `Release` with no corresponding `Acquire` are +useless, since the arrows will be unable to connect to anything. From afe0ee2bf015ef6ad8a74b6d56ea90d23ea86aa4 Mon Sep 17 00:00:00 2001 From: SabrinaJewson Date: Sun, 28 Aug 2022 19:58:46 +0100 Subject: [PATCH 08/34] Write the `SeqCst` section --- src/atomics/seqcst.md | 389 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 389 insertions(+) diff --git a/src/atomics/seqcst.md b/src/atomics/seqcst.md index a9f12e59..6be9593d 100644 --- a/src/atomics/seqcst.md +++ b/src/atomics/seqcst.md @@ -1 +1,390 @@ # SeqCst + +`SeqCst` is probably the most interesting ordering, because it is simultaneously +the simplest and most complex atomic memory ordering in existence. It’s +simple, because if you do only use `SeqCst` everywhere then you can kind of +maybe pretend like the Abstract Machine has a concept of time; phrases like +“latest value” make sense, the program can be thought of as a set of steps that +interleave, there is a universal “now” and “before” and wouldn’t that be nice? +But it’s also the most complex, because as soon as look under the hood you +realize just how incredibly convoluted and hard to follow the actual rules +behind it are, and it gets really ugly really fast as soon as you try to mix it +with any other ordering. + +To understand `SeqCst`, we first have to understand the problem it exists to +solve. The first complexity is that this problem can only be observed in the +presence of at least four different threads _and_ two separate atomic variables; +anything less and it’s not possible to notice a difference. The common example +used to show where weaker orderings produce counterintuitive results is this: + +```rust +# use std::sync::atomic::{self, AtomicBool}; +use std::thread; + +// Set this to Relaxed, Acquire, Release, AcqRel, doesn’t matter — the result is +// the same (modulo panics caused by attempting acquire stores or release +// loads). +const ORDERING: atomic::Ordering = atomic::Ordering::Relaxed; + +static X: AtomicBool = AtomicBool::new(false); +static Y: AtomicBool = AtomicBool::new(false); + +let a = thread::spawn(|| { X.store(true, ORDERING) }); +let b = thread::spawn(|| { Y.store(true, ORDERING) }); +let c = thread::spawn(|| { while !X.load(ORDERING) {} Y.load(ORDERING) }); +let d = thread::spawn(|| { while !Y.load(ORDERING) {} X.load(ORDERING) }); + +let a = a.join().unwrap(); +let b = b.join().unwrap(); +let c = c.join().unwrap(); +let d = d.join().unwrap(); + +# return; +// This assert is allowed to fail. +assert!(c || d); +``` + +The basic setup of this code, for all of its possible executions, looks like +this: + +```text + a static X c d static Y b +╭─────────╮ ┌───────┐ ╭─────────╮ ╭─────────╮ ┌───────┐ ╭─────────╮ +│ store X ├─┐ │ false │ ┌─┤ load X │ │ load Y ├─┐ │ false │ ┌─┤ store Y │ +╰─────────╯ │ └───────┘ │ ╰────╥────╯ ╰────╥────╯ │ └───────┘ │ ╰─────────╯ + └─┬───────┐ │ ╭────⇓────╮ ╭────⇓────╮ │ ┌───────┬─┘ + │ true ├─┘ │ load Y ├─? ?─┤ load X │ └─┤ true │ + └───────┘ ╰─────────╯ ╰─────────╯ └───────┘ +``` + +In other words, `a` and `b` are guaranteed to, at some point, store `true` into +`X` and `Y` respectively, and `c` and `d` are guaranteed to, at some point, load +those values of `true` from `X` and `Y` (there could also be an arbitrary number +of loads of `false` by `c` and `d`, but they’ve been omitted since they don’t +actually affect the execution at all). The question now is when `c` and `d` load +from `Y` and `X` respectively, is it possible for them _both_ to load `false`? + +And looking at this diagram, there’s absolutely no reason why not. There isn’t +even a single arrow connecting the left and right hand sides so far, so the load +has no restrictions on which value it is allowed to pick — and this goes for +both sides equally, so we could end up with an execution like this: + +```text + a static X c d static Y b +╭─────────╮ ┌───────┐ ╭─────────╮ ╭─────────╮ ┌───────┐ ╭─────────╮ +│ store X ├─┐ │ false ├┐ ┌┤ load X │ │ load Y ├┐ ┌┤ false │ ┌─┤ store Y │ +╰─────────╯ │ └───────┘│ │╰────╥────╯ ╰────╥────╯│ │└───────┘ │ ╰─────────╯ + └─┬───────┐└─│─────║──────┐┌─────║─────│─┘┌───────┬─┘ + │ true ├──┘╭────⇓────╮┌─┘╭────⇓────╮└──┤ true │ + └───────┘ │ load Y ├┘└─┤ load X │ └───────┘ + ╰─────────╯ ╰─────────╯ +``` + +Which results in a failed assert. This execution is brought about because the +model of separate modification orders means that there is no relative ordering +between `X` and `Y` being changed, and so each thread is allowed to “see” either +order. However, some algorithms will require a globally agreed-upon ordering, +and this is where `SeqCst` can come in useful. + +This ordering, first and foremost, inherits the guarantees from all the other +orderings — it is an acquire operation for loads, a release operation for stores +and an acquire-release operation for RMWs. In addition to this, it gives some +guarantees unique to `SeqCst` about what values it is allowed to load. Note that +these guarantees are not about preventing data races: unless you have some +unrelated code that triggers a data race given an unexpected condition, using +`SeqCst` can only prevent you from race conditions because its guarantees only +apply to other `SeqCst` operations rather than all data accesses. + +## S + +`SeqCst` is fundamentally about _S_, which is the global ordering of all +`SeqCst` operations in an execution of the program. It is consistent between +every atomic and every thread, and all stores, fences and RMWs that use a +sequentially consistent ordering have a place in it (but no other operations +do). It is in contrast to modification orders, which are similarly total but +only scoped to a single atomic rather than the whole program. + +Other than an edge case involving `SeqCst` mixed with weaker orderings (detailed +in the next section), _S_ is primarily controlled by the happens before +relations in a program: this means that if an action _A_ happens before an +action _B_, it is also guaranteed to appear before _B_ in _S_. Other than that +restriction, _S_ is unspecified and will be chosen arbitrarily during execution. + +Once a particular _S_ has been established, every atomic’s modification order is +then guaranteed to be consistent with it — this means that a `SeqCst` load will +never see a value that has been overwritten by a write that occurred before it +in _S_, or a value that has been written by a write that occured after it in +_S_ (note that a `Relaxed`/`Acquire` load however might, since there is no +“before” or “after” as it is not in _S_ in the first place). + +So, looking back at our program, let’s consider how we could use `SeqCst` to +make that execution invalid. As a refresher, here’s the framework for every +possible execution of the program: + +```text + a static X c d static Y b +╭─────────╮ ┌───────┐ ╭─────────╮ ╭─────────╮ ┌───────┐ ╭─────────╮ +│ store X ├─┐ │ false │ ┌─┤ load X │ │ load Y ├─┐ │ false │ ┌─┤ store Y │ +╰─────────╯ │ └───────┘ │ ╰────╥────╯ ╰────╥────╯ │ └───────┘ │ ╰─────────╯ + └─┬───────┐ │ ╭────⇓────╮ ╭────⇓────╮ │ ┌───────┬─┘ + │ true ├─┘ │ load Y ├─? ?─┤ load X │ └─┤ true │ + └───────┘ ╰─────────╯ ╰─────────╯ └───────┘ +``` + +First of all, both the final loads (`c` and `d`’s second operations) need to +become `SeqCst`, because they need to be aware of the total ordering that +determines whether `X` or `Y` becomes `true` first. And secondly, we need to +establish that ordering in the first place, and that needs to be done by making +sure that there is always one operation in _S_ that both sees one of the atomics +as `true` and precedes both final loads (the final loads themselves don’t work +for this since although they “know” that their corresponding atomic is `true` +they don’t interact with it directly so _S_ doesn’t care). + +There are two operations in the program that could fulfill the first condition, +should they be made `SeqCst`: the stores of `true` and the first loads. However, +the second condition ends up ruling out using the stores, since in order to make +sure that they precede the final loads in _S_ it would be necessary to have the +first loads be `SeqCst` anyway (due to the mixed-`SeqCst` special case detailed +later), so in the end we can just leave them as `Relaxed`. + +This leaves us with the correct version of the above program, which is +guaranteed to never panic: + +```rust +# use std::sync::atomic::{AtomicBool, Ordering::{Relaxed, SeqCst}}; +use std::thread; + +static X: AtomicBool = AtomicBool::new(false); +static Y: AtomicBool = AtomicBool::new(false); + +let a = thread::spawn(|| { X.store(true, Relaxed) }); +let b = thread::spawn(|| { Y.store(true, Relaxed) }); +let c = thread::spawn(|| { while !X.load(SeqCst) {} Y.load(SeqCst) }); +let d = thread::spawn(|| { while !Y.load(SeqCst) {} X.load(SeqCst) }); + +let a = a.join().unwrap(); +let b = b.join().unwrap(); +let c = c.join().unwrap(); +let d = d.join().unwrap(); + +// This assert is **not** allowed to fail. +assert!(c || d); +``` + +As there are four `SeqCst` operations with a partial order between two pairs in +them (caused by the sequenced before relation), there are six possible +executions of this program: + +- All of `c`’s loads precede `d`’s loads: + 1. `c` loads `X` (gives `true`) + 1. `c` loads `Y` (gives either `false` or `true`) + 1. `d` loads `Y` (gives `true`) + 1. `d` loads `X` (required to be `true`) +- Both initial loads precede both final loads: + 1. `c` loads `X` (gives `true`) + 1. `d` loads `Y` (gives `true`) + 1. `c` loads `Y` (required to be `true`) + 1. `d` loads `X` (required to be `true`) +- As above, but the final loads occur in a different order: + 1. `c` loads `X` (gives `true`) + 1. `d` loads `Y` (gives `true`) + 1. `d` loads `X` (required to be `true`) + 1. `c` loads `Y` (required to be `true`) +- As before, but the initial loads occur in a different order: + 1. `d` loads `Y` (gives `true`) + 1. `c` loads `X` (gives `true`) + 1. `c` loads `Y` (required to be `true`) + 1. `d` loads `X` (required to be `true`) +- As above, but the final loads occur in a different order: + 1. `d` loads `Y` (gives `true`) + 1. `c` loads `X` (gives `true`) + 1. `d` loads `X` (required to be `true`) + 1. `c` loads `Y` (required to be `true`) +- All of `d`’s loads precede `c`’s loads: + 1. `d` loads `Y` (gives `true`) + 1. `d` loads `X` (gives either `false` or `true`) + 1. `c` loads `X` (gives `true`) + 1. `c` loads `Y` (required to be `true`) + +All the places were the load is requied to give `true` were caused by a +preceding load in _S_ of the same atomic which saw `true`, because otherwise _S_ +would be inconsistent with the atomic’s modification order and that is +impossible. + +## The mixed-`SeqCst` special case + +As I’ve been alluding to for a while, I wasn’t being totally truthful when I +said that _S_ is consistent with happens before relations — in reality, it is +only consistent with _strongly happens before_ relations, which presents a +subtly-defined subset of happens before relations. In particular, it excludes +two situations: + +1. The `SeqCst` operation A synchronizes-with an `Acquire` or `AcqRel` operation + B which is sequenced before another `SeqCst` operation C. Here, despite the + fact that A happens before C, A does not _strongly_ happen before C and so is + there not guaranteed to precede C in _S_. +2. The `SeqCst` operation A is sequenced-before the `Release` or `AcqRel` + operation B, which synchronizes-with another `SeqCst` operation C. Similarly, + despite the fact that A happens before C, A might not precede C in _S_. + +The first situation is illustrated below, with `SeqCst` accesses repesented with +asterisks: + +```text + t_1 x t_2 +╭─────╮ ┌─↘───┐ ╭─────╮ +│ *A* ├─┘ │ 1 ├───→ B │ +╰─────╯ └───┘ ╰──╥──╯ + ╭──⇓──╮ + │ *C* │ + ╰─────╯ +``` + +A happens before, but does not strongly happen before, C — and anything +sequenced after C will have the same treatment (unless more synchronization is +used). This means that C is actually allowed to _precede_ A in _S_, despite +conceptually happening after it. However, anything sequenced before A, because +there is at least one sequence on either side of the synchronization, will +strongly happen before C. + +But this is all highly theoretical at the moment, so let’s make an example to +show how that rule can actually affect the execution of code. So, if C were to +precede A in _S_ then that means in the modification order of any atomic they +both access, C would have to come before A. Let’s say then that C loads from `x` +(the atomic that A has to access), it may load the value that came before A if +it were to precede A in _S_: + +```text + t_1 x t_2 +╭─────╮ ┌───┐ ╭─────╮ +│ *A* ├─┐ │ 0 ├─┐┌→ B │ +╰─────╯ │ └───┘ ││╰──╥──╯ + └─↘───┐┌─┘╭──⇓──╮ + │ 1 ├┘└─→ *C* │ + └───┘ ╰─────╯ +``` + +Ah wait no, that doesn’t work because coherence still mandates that `1` is the +only value that can be loaded. In fact, once `1` is loaded _S_’s required +consistency with modification orders means that A _is_ required to precede C in +_S_ after all. + +So somehow, to observe this difference we need to have a _different_ `SeqCst` +operation, let’s call it E, be the one that loads from `x`, where C is +guaranteed to precede it in _S_ (so we can observe the “weird” state in between +C and A) but C also doesn’t happen before it (to avoid coherence getting in the +way) — and to do that, all we have to do is have C appear before a `SeqCst` +operation D in the modification order of another atomic, but have D be a store +so as to avoid C synchronizing with it, and then our desired load E can simply +be sequenced after D (this will carry over the “precedes in _S_” guarantee, but +does not restore the happens after relation to C since that was already dropped +by having D be a store). + +In diagram form, that looks like this: + +```text + t_1 x t_2 helper t_3 +╭─────╮ ┌───┐ ╭─────╮ ┌─────┐ ╭─────╮ +│ *A* ├─┐ │ 0 ├┐┌─→ B │ ┌─┤ 0 │ ┌─┤ *D* │ +╰─────╯ │ └───┘││ ╰──╥──╯ │ └─────┘ │ ╰──╥──╯ + │ └│────║────│─────────│┐ ║ + └─↘───┐ │ ╭──⇓──╮ │ ┌─────↙─┘│╭──⇓──╮ + │ 1 ├─┘ │ *C* ←─┘ │ 1 │ └→ *E* │ + └───┘ ╰─────╯ └─────┘ ╰─────╯ + +S = C → D → E → A +``` + +C is guaranteed to precede D in _S_, and D is guaranteed to precede E, but +because this exception means that A is _not_ guaranteed to precede C, it is +totally possible for it to come at the end, resulting in the surprising but +totally valid outcome of E loading `0` from `x`. In code, this can be expressed +as the following code _not_ being guaranteed to panic: + +```rust +# use std::sync::atomic::{AtomicU8, Ordering::{Acquire, SeqCst}}; +# return; +static X: AtomicU8 = AtomicU8::new(0); +static HELPER: AtomicU8 = AtomicU8::new(0); + +// thread_1 +X.store(1, SeqCst); // A + +// thread_2 +assert_eq!(X.load(Acquire), 1); // B +assert_eq!(HELPER.load(SeqCst), 0); // C + +// thread_3 +HELPER.store(1, SeqCst); // D +assert_eq!(X.load(SeqCst), 0); // E +``` + +The second situation listed above has very similar consequences. Its abstract +form is the following execution in which A is not guaranteed to precede C in +_S_, despite A happening before C: + +```text + t_1 x t_2 +╭─────╮ ┌─↘───┐ ╭─────╮ +│ *A* │ │ │ 0 ├───→ *C* │ +╰──╥──╯ │ └───┘ ╰─────╯ +╭──⇓──╮ │ +│ B ├─┘ +╰─────╯ +``` + +Similarly to before, we can’t just have A access `x` to show why A not +necessarily preceding C in _S_ matters; instead, we have to introduce a second +atomic and third thread to break the happens before chain first. And finally, a +single relaxed load F at the end is added just to prove that the weird execution +actually happened (leaving `x` as 2 instead of 1). + +```text + t_3 helper t_1 x t_2 +╭─────╮ ┌─────┐ ╭─────╮ ┌───┐ ╭─────╮ +│ *D* ├┐┌─┤ 0 │ ┌─┤ *A* │ │ 0 │ ┌─→ *C* │ +╰──╥──╯││ └─────┘ │ ╰──╥──╯ └───┘ │ ╰──╥──╯ + ║ └│─────────│────║─────┐ │ ║ +╭──⇓──╮ │ ┌─────↙─┘ ╭──⇓──╮ ┌─↘───┐ │ ╭──⇓──╮ +│ *E* ←─┘ │ 1 │ │ B ├─┘││ 1 ├─┘┌┤ F │ +╰─────╯ └─────┘ ╰─────╯ │└───┘ │╰─────╯ + └↘───┐ │ + │ 2 ├──┘ + └───┘ +S = C → D → E → A +``` + +This execution mandates both C preceding A in _S_ and A happening-before C, +something that is only possible through these two mixed-`SeqCst` special +exceptions. It can be expressed in code as well: + +```rust +# use std::sync::atomic::{AtomicU8, Ordering::{Release, Relaxed, SeqCst}}; +# return; +static X: AtomicU8 = AtomicU8::new(0); +static HELPER: AtomicU8 = AtomicU8::new(0); + +// thread_3 +X.store(2, SeqCst); // D +assert_eq!(HELPER.load(SeqCst), 0); // E + +// thread_1 +HELPER.store(1, SeqCst); // A +X.store(1, Release); // B + +// thread_2 +assert_eq!(X.load(SeqCst), 1); // C +assert_eq!(X.load(Relaxed), 2); // F +``` + +If this seems ridiculously specific and obscure, that’s because it is. +Originally, back in C++11, this special case didn’t exist — but then six years +later it was discovered that in practice atomics on Power, Nvidia GPUs and +sometimes ARMv7 _would_ have this special case, and fixing the implementations +would make atomics significantly slower. So instead, in C++20 they simply +encoded it into the specification. + +Generally however, this rule is so complex it’s best to just avoid it entirely +by never mixing `SeqCst` and non-`SeqCst` on a single atomic in the first place +— or even better, just avoiding `SeqCst` entirely and using a stronger ordering +instead that has less complex semantics and fewer gotchas. From c1129e31c8dd2ac0b73d048b00388478922e6391 Mon Sep 17 00:00:00 2001 From: SabrinaJewson Date: Sun, 28 Aug 2022 20:02:34 +0100 Subject: [PATCH 09/34] =?UTF-8?q?=E2=80=9Chappens=20before=E2=80=9D=20?= =?UTF-8?q?=E2=86=92=20=E2=80=9Chappens-before=E2=80=9D?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- src/atomics/acquire-release.md | 14 +++++++------- src/atomics/seqcst.md | 28 ++++++++++++++-------------- 2 files changed, 21 insertions(+), 21 deletions(-) diff --git a/src/atomics/acquire-release.md b/src/atomics/acquire-release.md index 9a7e24e5..074fa9df 100644 --- a/src/atomics/acquire-release.md +++ b/src/atomics/acquire-release.md @@ -195,16 +195,16 @@ Thread 1 a Thread 2 ┃ Thread 1 a Thread 2 ``` These arrows are a new kind of arrow we -haven’t seen yet; they are known as _happens before_ (or happens after) +haven’t seen yet; they are known as _happens-before_ (or happens-after) relations and are represented as thin arrows (→) on these diagrams. They are weaker than the _sequenced before_ double-arrows (⇒) that occur inside a single thread, but can still be used with the arrow rules to determine which values of a memory location are valid to read. We can say that in the first possible -execution, Thread 1’s `store` (and everything sequenced before that) _happens -before_ Thread 2’s load (and everything sequenced after that). +execution, Thread 1’s `store` (and everything sequenced before that) +_happens-before_ Thread 2’s load (and everything sequenced after that). There is one more rule required for these to be useful, and that is _release -sequences_: after a release store is performed on an atomic, happens before +sequences_: after a release store is performed on an atomic, happens-before arrows will connect together each subsequent value of the atomic as long as the new value is caused by an RMW and not just a plain store. @@ -233,9 +233,9 @@ Thread 1 locked data Thread 2 ``` We now can trace back along the reverse direction of arrows from the `guard` -bubble to the `+= 1` bubble; we have established that Thread 2’s load happens -after the `+= 1` side effect. This both avoids the data race and gives the -guarantee that `1` will be always read by Thread 2 (as long as locks after +bubble to the `+= 1` bubble; we have established that Thread 2’s load +happens-after the `+= 1` side effect. This both avoids the data race and gives +the guarantee that `1` will be always read by Thread 2 (as long as locks after Thread 1, of course). However, that is not the only execution of the program possible. Even with this diff --git a/src/atomics/seqcst.md b/src/atomics/seqcst.md index 6be9593d..7c3dd6a4 100644 --- a/src/atomics/seqcst.md +++ b/src/atomics/seqcst.md @@ -105,8 +105,8 @@ do). It is in contrast to modification orders, which are similarly total but only scoped to a single atomic rather than the whole program. Other than an edge case involving `SeqCst` mixed with weaker orderings (detailed -in the next section), _S_ is primarily controlled by the happens before -relations in a program: this means that if an action _A_ happens before an +in the next section), _S_ is primarily controlled by the happens-before +relations in a program: this means that if an action _A_ happens-before an action _B_, it is also guaranteed to appear before _B_ in _S_. Other than that restriction, _S_ is unspecified and will be chosen arbitrarily during execution. @@ -214,18 +214,18 @@ impossible. ## The mixed-`SeqCst` special case As I’ve been alluding to for a while, I wasn’t being totally truthful when I -said that _S_ is consistent with happens before relations — in reality, it is -only consistent with _strongly happens before_ relations, which presents a -subtly-defined subset of happens before relations. In particular, it excludes +said that _S_ is consistent with happens-before relations — in reality, it is +only consistent with _strongly happens-before_ relations, which presents a +subtly-defined subset of happens-before relations. In particular, it excludes two situations: 1. The `SeqCst` operation A synchronizes-with an `Acquire` or `AcqRel` operation B which is sequenced before another `SeqCst` operation C. Here, despite the - fact that A happens before C, A does not _strongly_ happen before C and so is + fact that A happens-before C, A does not _strongly_ happen-before C and so is there not guaranteed to precede C in _S_. 2. The `SeqCst` operation A is sequenced-before the `Release` or `AcqRel` operation B, which synchronizes-with another `SeqCst` operation C. Similarly, - despite the fact that A happens before C, A might not precede C in _S_. + despite the fact that A happens-before C, A might not precede C in _S_. The first situation is illustrated below, with `SeqCst` accesses repesented with asterisks: @@ -240,12 +240,12 @@ asterisks: ╰─────╯ ``` -A happens before, but does not strongly happen before, C — and anything +A happens-before, but does not strongly happen-before, C — and anything sequenced after C will have the same treatment (unless more synchronization is used). This means that C is actually allowed to _precede_ A in _S_, despite -conceptually happening after it. However, anything sequenced before A, because +conceptually occuring after it. However, anything sequenced before A, because there is at least one sequence on either side of the synchronization, will -strongly happen before C. +strongly happen-before C. But this is all highly theoretical at the moment, so let’s make an example to show how that rule can actually affect the execution of code. So, if C were to @@ -272,12 +272,12 @@ _S_ after all. So somehow, to observe this difference we need to have a _different_ `SeqCst` operation, let’s call it E, be the one that loads from `x`, where C is guaranteed to precede it in _S_ (so we can observe the “weird” state in between -C and A) but C also doesn’t happen before it (to avoid coherence getting in the +C and A) but C also doesn’t happen-before it (to avoid coherence getting in the way) — and to do that, all we have to do is have C appear before a `SeqCst` operation D in the modification order of another atomic, but have D be a store so as to avoid C synchronizing with it, and then our desired load E can simply be sequenced after D (this will carry over the “precedes in _S_” guarantee, but -does not restore the happens after relation to C since that was already dropped +does not restore the happens-after relation to C since that was already dropped by having D be a store). In diagram form, that looks like this: @@ -321,7 +321,7 @@ assert_eq!(X.load(SeqCst), 0); // E The second situation listed above has very similar consequences. Its abstract form is the following execution in which A is not guaranteed to precede C in -_S_, despite A happening before C: +_S_, despite A happening-before C: ```text t_1 x t_2 @@ -335,7 +335,7 @@ _S_, despite A happening before C: Similarly to before, we can’t just have A access `x` to show why A not necessarily preceding C in _S_ matters; instead, we have to introduce a second -atomic and third thread to break the happens before chain first. And finally, a +atomic and third thread to break the happens-before chain first. And finally, a single relaxed load F at the end is added just to prove that the weird execution actually happened (leaving `x` as 2 instead of 1). From b89639939fdf537096ad6e6d57b9f31edcdd933b Mon Sep 17 00:00:00 2001 From: SabrinaJewson Date: Sun, 28 Aug 2022 20:11:20 +0100 Subject: [PATCH 10/34] Introduce synchronizes-with terminology --- src/atomics/acquire-release.md | 30 +++++++++++++++++++----------- 1 file changed, 19 insertions(+), 11 deletions(-) diff --git a/src/atomics/acquire-release.md b/src/atomics/acquire-release.md index 074fa9df..fa048ee0 100644 --- a/src/atomics/acquire-release.md +++ b/src/atomics/acquire-release.md @@ -194,14 +194,21 @@ Thread 1 a Thread 2 ┃ Thread 1 a Thread 2 └───┘ ┃ └───┘ ``` -These arrows are a new kind of arrow we -haven’t seen yet; they are known as _happens-before_ (or happens-after) -relations and are represented as thin arrows (→) on these diagrams. They are -weaker than the _sequenced before_ double-arrows (⇒) that occur inside a single -thread, but can still be used with the arrow rules to determine which values of -a memory location are valid to read. We can say that in the first possible -execution, Thread 1’s `store` (and everything sequenced before that) -_happens-before_ Thread 2’s load (and everything sequenced after that). +These arrows are a new kind of arrow we haven’t seen yet; they are known as +_happens-before_ (or happens-after) relations and are represented as thin arrows +(→) on these diagrams. They are weaker than the _sequenced before_ +double-arrows (⇒) that occur inside a single thread, but can still be used with +the arrow rules to determine which values of a memory location are valid to +read. + +When a happens-before arrow stores a data value to an atomic (via a release +operation) which is then loaded by another happens-before arrow (via an acquire +operation) we say that the release operation _synchronized-with_ the acquire +operation, which in doing so establishes that the release operation +_happens-before_ the acquire operation. Therefore, we can say that in the first +possible execution, Thread 1’s `store` synchronizes-with Thread 2’s `load`, +which causes that `store` and everything sequenced before it to happen-before +the `load` and everything sequenced after it. There is one more rule required for these to be useful, and that is _release sequences_: after a release store is performed on an atomic, happens-before @@ -234,9 +241,10 @@ Thread 1 locked data Thread 2 We now can trace back along the reverse direction of arrows from the `guard` bubble to the `+= 1` bubble; we have established that Thread 2’s load -happens-after the `+= 1` side effect. This both avoids the data race and gives -the guarantee that `1` will be always read by Thread 2 (as long as locks after -Thread 1, of course). +happens-after the `+= 1` side effect, because Thread 2’s CAS synchronizes-with +Thread 1’s store. This both avoids the data race and gives the guarantee that +`1` will be always read by Thread 2 (as long as locks after Thread 1, of +course). However, that is not the only execution of the program possible. Even with this setup, there is another execution that can also cause UB: if Thread 2 locks the From 59fde6f6bb04370499ef49ea89a4ac81a8654cb8 Mon Sep 17 00:00:00 2001 From: SabrinaJewson Date: Sun, 28 Aug 2022 20:19:40 +0100 Subject: [PATCH 11/34] =?UTF-8?q?Use=20=E2=80=9Ccoherence=E2=80=9D=20termi?= =?UTF-8?q?nology=20from=20the=20start?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- src/atomics/acquire-release.md | 10 +++++----- src/atomics/multithread.md | 6 ++++++ src/atomics/relaxed.md | 12 ++++++------ src/atomics/seqcst.md | 5 +++-- 4 files changed, 20 insertions(+), 13 deletions(-) diff --git a/src/atomics/acquire-release.md b/src/atomics/acquire-release.md index fa048ee0..1ee7a4a0 100644 --- a/src/atomics/acquire-release.md +++ b/src/atomics/acquire-release.md @@ -158,7 +158,7 @@ case that doesn’t seem to be enough, since even if atomics were used it still would have the _option_ of reading `0` instead of `1`, and really if we want our mutex to be sane, it should only be able to read `1`. -So it seems that want we _want_ is to be able to apply our arrow rules from +So it seems that want we _want_ is to be able to apply the coherence rules from before to completely rule out zero from the set of the possible values — if we were able to draw a large arrow from the Thread 1’s `+= 1;` to Thread 2’s `guard`, then we could trivially then use the rule to rule out `0` as a value @@ -198,7 +198,7 @@ These arrows are a new kind of arrow we haven’t seen yet; they are known as _happens-before_ (or happens-after) relations and are represented as thin arrows (→) on these diagrams. They are weaker than the _sequenced before_ double-arrows (⇒) that occur inside a single thread, but can still be used with -the arrow rules to determine which values of a memory location are valid to +the coherence rules to determine which values of a memory location are valid to read. When a happens-before arrow stores a data value to an atomic (via a release @@ -299,9 +299,9 @@ Thread 1 locked data Thread 2 └───────┘ ``` -We can now use the second arrow rule from before to follow _forward_ the arrow -from the `guard` bubble all the way to the `+= 1;`, determining that it is only -possible for that read to see `0` as its value. +We can now use the second coherence rule from before to follow _forward_ the +arrow from the `guard` bubble all the way to the `+= 1;`, determining that it is +only possible for that read to see `0` as its value. This leads us to the proper memory orderings for any mutex (and other locks like RW locks too, even): use `Acquire` to lock it, and `Release` to unlock it. So diff --git a/src/atomics/multithread.md b/src/atomics/multithread.md index e26fb60c..fa061ed5 100644 --- a/src/atomics/multithread.md +++ b/src/atomics/multithread.md @@ -213,6 +213,12 @@ guaranteed by the Abstract Machine: ╰───────╯ └────┘ ``` +These two rules combined make up the more generalized rule known as _coherence_, +which is put in place to guarantee that a thread will never see a value earlier +than the last one it read or later than a one it will in future write. Coherence +is basically required for any program to act in a sane way, so luckily the C++20 +standard guarantees it as one of its most fundamental principles. + You might be thinking that all this has been is the longest, most convoluted explanation ever of the most basic intuitive semantics of programming — and you’d be absolutely right. But it’s essential to grasp these fundamentals, diff --git a/src/atomics/relaxed.md b/src/atomics/relaxed.md index 4c4c4c6a..557cab4f 100644 --- a/src/atomics/relaxed.md +++ b/src/atomics/relaxed.md @@ -28,11 +28,11 @@ Thread 1 data Thread 2 └────┘ ``` -Unfortunately, the rules from before don’t help us in finding out where Thread -2’s line joins up to, since there are no arrows connecting that operation to -anything and therefore we can’t immediately rule any values out. As a result, we -end up facing a situation we haven’t faced before: there is _more than one_ -potential value for Thread 2 to read. +Unfortunately, coherence doesn’t help us in finding out where Thread 2’s line +joins up to, since there are no arrows connecting that operation to anything and +therefore we can’t immediately rule any values out. As a result, we end up +facing a situation we haven’t faced before: there is _more than one_ potential +value for Thread 2 to read. And this is where we encounter the big limitation with unsynchronized data accesses: the price we pay for their speed and optimization capability is that @@ -225,7 +225,7 @@ value we loaded. This isn’t really a problem — we can just try again and aga until we succeed, and `compare_exchange` is even nice enough to give us the updated value so we don’t have to load again. Also note that after we’ve updated our value of the atomic, we’re guaranteed to never see the old value again, by -the arrow rules from the previous chapter. +the coherence rules from the previous chapter. So here’s how it looks with these changes appplied: diff --git a/src/atomics/seqcst.md b/src/atomics/seqcst.md index 7c3dd6a4..9024dac8 100644 --- a/src/atomics/seqcst.md +++ b/src/atomics/seqcst.md @@ -66,8 +66,9 @@ from `Y` and `X` respectively, is it possible for them _both_ to load `false`? And looking at this diagram, there’s absolutely no reason why not. There isn’t even a single arrow connecting the left and right hand sides so far, so the load -has no restrictions on which value it is allowed to pick — and this goes for -both sides equally, so we could end up with an execution like this: +has no coherence-based restrictions on which value it is allowed to pick — and +this goes for both sides equally, so we could end up with an execution like +this: ```text a static X c d static Y b From 6dc3d549b12a722edd00cc83b4ac8a56e8450e04 Mon Sep 17 00:00:00 2001 From: SabrinaJewson Date: Sun, 28 Aug 2022 20:26:37 +0100 Subject: [PATCH 12/34] =?UTF-8?q?Remove=20old=20sections=20and=20introduce?= =?UTF-8?q?=20=E2=80=9CAM=E2=80=9D=20in=20intro?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- src/atomics/atomics.md | 137 +---------------------------------------- 1 file changed, 3 insertions(+), 134 deletions(-) diff --git a/src/atomics/atomics.md b/src/atomics/atomics.md index 7402d38f..c28310a5 100644 --- a/src/atomics/atomics.md +++ b/src/atomics/atomics.md @@ -30,9 +30,9 @@ three main factors at play here: your program at a moment's notice. The C++ memory model is fundamentally about trying to bridge the gap between -these three, allowing users to write code for a logical and consistent abstract -machine while the compiler and hardware deal with the madness underneath that -makes it run fast. +these three, allowing users to write code for a logical and consistent Abstract +Machine (AM for short) while the compiler and hardware deal with the madness +underneath that makes it run fast. ### Compiler Reordering @@ -118,136 +118,5 @@ programming: incorrect. If possible, concurrent algorithms should be tested on weakly-ordered hardware. ---- - -## Data Accesses - -The C++ memory model attempts to bridge the gap by allowing us to talk about the -*causality* of our program. Generally, this is by establishing a *happens -before* relationship between parts of the program and the threads that are -running them. This gives the hardware and compiler room to optimize the program -more aggressively where a strict happens-before relationship isn't established, -but forces them to be more careful where one is established. The way we -communicate these relationships are through *data accesses* and *atomic -accesses*. - -Data accesses are the bread-and-butter of the programming world. They are -fundamentally unsynchronized and compilers are free to aggressively optimize -them. In particular, data accesses are free to be reordered by the compiler on -the assumption that the program is single-threaded. The hardware is also free to -propagate the changes made in data accesses to other threads as lazily and -inconsistently as it wants. Most critically, data accesses are how data races -happen. Data accesses are very friendly to the hardware and compiler, but as -we've seen they offer *awful* semantics to try to write synchronized code with. -Actually, that's too weak. - -**It is literally impossible to write correct synchronized code using only data -accesses.** - -Atomic accesses are how we tell the hardware and compiler that our program is -multi-threaded. Each atomic access can be marked with an *ordering* that -specifies what kind of relationship it establishes with other accesses. In -practice, this boils down to telling the compiler and hardware certain things -they *can't* do. For the compiler, this largely revolves around re-ordering of -instructions. For the hardware, this largely revolves around how writes are -propagated to other threads. The set of orderings Rust exposes are: - -* Sequentially Consistent (SeqCst) -* Release -* Acquire -* Relaxed - -(Note: We explicitly do not expose the C++ *consume* ordering) - -TODO: negative reasoning vs positive reasoning? TODO: "can't forget to -synchronize" - -## Sequentially Consistent - -Sequentially Consistent is the most powerful of all, implying the restrictions -of all other orderings. Intuitively, a sequentially consistent operation -cannot be reordered: all accesses on one thread that happen before and after a -SeqCst access stay before and after it. A data-race-free program that uses -only sequentially consistent atomics and data accesses has the very nice -property that there is a single global execution of the program's instructions -that all threads agree on. This execution is also particularly nice to reason -about: it's just an interleaving of each thread's individual executions. This -does not hold if you start using the weaker atomic orderings. - -The relative developer-friendliness of sequential consistency doesn't come for -free. Even on strongly-ordered platforms sequential consistency involves -emitting memory fences. - -In practice, sequential consistency is rarely necessary for program correctness. -However sequential consistency is definitely the right choice if you're not -confident about the other memory orders. Having your program run a bit slower -than it needs to is certainly better than it running incorrectly! It's also -mechanically trivial to downgrade atomic operations to have a weaker -consistency later on. Just change `SeqCst` to `Relaxed` and you're done! Of -course, proving that this transformation is *correct* is a whole other matter. - -## Acquire-Release - -Acquire and Release are largely intended to be paired. Their names hint at their -use case: they're perfectly suited for acquiring and releasing locks, and -ensuring that critical sections don't overlap. - -Intuitively, an acquire access ensures that every access after it stays after -it. However operations that occur before an acquire are free to be reordered to -occur after it. Similarly, a release access ensures that every access before it -stays before it. However operations that occur after a release are free to be -reordered to occur before it. - -When thread A releases a location in memory and then thread B subsequently -acquires *the same* location in memory, causality is established. Every write -(including non-atomic and relaxed atomic writes) that happened before A's -release will be observed by B after its acquisition. However no causality is -established with any other threads. Similarly, no causality is established -if A and B access *different* locations in memory. - -Basic use of release-acquire is therefore simple: you acquire a location of -memory to begin the critical section, and then release that location to end it. -For instance, a simple spinlock might look like: - -```rust -use std::sync::Arc; -use std::sync::atomic::{AtomicBool, Ordering}; -use std::thread; - -fn main() { - let lock = Arc::new(AtomicBool::new(false)); // value answers "am I locked?" - - // ... distribute lock to threads somehow ... - - // Try to acquire the lock by setting it to true - while lock.compare_and_swap(false, true, Ordering::Acquire) { } - // broke out of the loop, so we successfully acquired the lock! - - // ... scary data accesses ... - - // ok we're done, release the lock - lock.store(false, Ordering::Release); -} -``` - -On strongly-ordered platforms most accesses have release or acquire semantics, -making release and acquire often totally free. This is not the case on -weakly-ordered platforms. - -## Relaxed - -Relaxed accesses are the absolute weakest. They can be freely re-ordered and -provide no happens-before relationship. Still, relaxed operations are still -atomic. That is, they don't count as data accesses and any read-modify-write -operations done to them occur atomically. Relaxed operations are appropriate for -things that you definitely want to happen, but don't particularly otherwise care -about. For instance, incrementing a counter can be safely done by multiple -threads using a relaxed `fetch_add` if you're not using the counter to -synchronize any other accesses. - -There's rarely a benefit in making an operation relaxed on strongly-ordered -platforms, since they usually provide release-acquire semantics anyway. However -relaxed operations can be cheaper on weakly-ordered platforms. - [C11-busted]: http://plv.mpi-sws.org/c11comp/popl15.pdf [C++-model]: https://en.cppreference.com/w/cpp/atomic/memory_order From 29707ee3989b4b299a1f3dc72a23c625097de79b Mon Sep 17 00:00:00 2001 From: SabrinaJewson Date: Sun, 28 Aug 2022 20:29:08 +0100 Subject: [PATCH 13/34] =?UTF-8?q?=E2=80=9Cisomorphic=E2=80=9D=20=E2=86=92?= =?UTF-8?q?=20=E2=80=9Cfunctionally=20equivalent=E2=80=9D?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- src/atomics/relaxed.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/src/atomics/relaxed.md b/src/atomics/relaxed.md index 557cab4f..e3898858 100644 --- a/src/atomics/relaxed.md +++ b/src/atomics/relaxed.md @@ -101,8 +101,8 @@ it could end up reading the garbage data “in-between” `0` and `1` (also UB). > **NOTE:** This description of why both sides are needed to be atomic > operations, while neat and intuitive, is not strictly correct: in reality the -> answer is simply “because the spec says so”. However, it is isomorphic to the -> real rules, so it can aid in understanding. +> answer is simply “because the spec says so”. However, it is functionally +> equivalent to the real rules, so it can aid in understanding. ## Read-modify-write operations From 52d5d13a3dacc36d972bc9cd98adfc2b7b88bc8e Mon Sep 17 00:00:00 2001 From: SabrinaJewson Date: Sun, 28 Aug 2022 20:39:03 +0100 Subject: [PATCH 14/34] =?UTF-8?q?Define=20the=20term=20=E2=80=9Crace=20con?= =?UTF-8?q?dition=E2=80=9D?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- src/atomics/relaxed.md | 19 +++++++++++++++---- 1 file changed, 15 insertions(+), 4 deletions(-) diff --git a/src/atomics/relaxed.md b/src/atomics/relaxed.md index e3898858..f0637612 100644 --- a/src/atomics/relaxed.md +++ b/src/atomics/relaxed.md @@ -137,10 +137,21 @@ Thread 1 COUNTER Thread 2 └───┘ ``` -Technically, I believe it is _possible_ to implement this kind of thing with -just loads and stores, if you try hard enough and use several atomics. But -luckily, you don’t have to because there also exists another kind of operation, -the read-modify-write, which is specifically suited to this purpose. +This is known as a a **race condition** — a logic error in a program caused by a +specific unintended execution of concurrent code. Note that this is distinct +from a _data race_: while a data race is caused by two threads performing +unsynchronized operations at the same time and is always undefined behaviour, +race conditions are totally OK and defined behaviour from the AM’s perspective, +but are only harmful because the programmer didn’t expect it to be possible. You +can think of the distinction between the two as analagous to the difference +between indexing out-of-bounds and indexing in-bounds, but to the wrong element: +both are bugs, but only one is universally a bug, and the other is merely a +logic problem. + +Technically, I believe it is _possible_ to solve this problem with just loads +and stores, if you try hard enough and use several atomics. But luckily, you +don’t have to because there also exists another kind of operation, the +read-modify-write, which is specifically suited to this purpose. A read-modify-write operation (shortened to RMW) is a special kind of atomic operation that reads, changes and writes back a value _in one step_. This means From 40b06fef66dfe5fe356684eb5f2727b47e999bcf Mon Sep 17 00:00:00 2001 From: SabrinaJewson Date: Sun, 28 Aug 2022 20:43:24 +0100 Subject: [PATCH 15/34] Add note about duplication of `1` in M.O. --- src/atomics/relaxed.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/src/atomics/relaxed.md b/src/atomics/relaxed.md index f0637612..b65e0fd0 100644 --- a/src/atomics/relaxed.md +++ b/src/atomics/relaxed.md @@ -122,7 +122,10 @@ pub fn get_id() -> u64 { ``` But then calling that function from multiple threads opens you up to an -execution like below that results in two threads obtaining the same ID: +execution like below that results in two threads obtaining the same ID (note +that the duplication of `1` in the modification order is intentional; even if +two values are the same, they always get separate entries in the order if they +were caused by different accesses): ```text Thread 1 COUNTER Thread 2 From 390754b28b3babec33a84004ebc8d0f57afd76de Mon Sep 17 00:00:00 2001 From: SabrinaJewson Date: Sun, 28 Aug 2022 20:50:27 +0100 Subject: [PATCH 16/34] Explain the ABA problem --- src/atomics/relaxed.md | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/src/atomics/relaxed.md b/src/atomics/relaxed.md index b65e0fd0..cc2563d9 100644 --- a/src/atomics/relaxed.md +++ b/src/atomics/relaxed.md @@ -228,8 +228,14 @@ atomically — there is no chance for a race condition. > \* It’s not quite the same, because `compare_exchange` can suffer from ABA > problems in which it will see a later value in the modification order that -> just happened to be same and succeed. However, in this code values can never -> be reused so we don’t have to worry about that. +> just happened to be same and succeed. For example, if the modification order +> contained `1, 2, 1` and a thread loaded the first `1`, +> `compare_exchange(1, 3)` could succeed in replacing either the first or second +> `1`, giving either `1, 3, 2, 1` or `1, 2, 1, 3`. +> +> For some algorithms, this is problematic and needs to be taken into account +> with additional checks; however for us, values can never be reused so we don’t +> have to worry about it. In our case, we can simply replace the store with a compare exchange of the old value and itself plus one (returning `None` instead if the addition overflowed, From 493c671b6913c95db51c2afbb648b56240168678 Mon Sep 17 00:00:00 2001 From: SabrinaJewson Date: Sun, 28 Aug 2022 20:57:54 +0100 Subject: [PATCH 17/34] =?UTF-8?q?Dispel=20the=20myth=20that=20RMWs=20?= =?UTF-8?q?=E2=80=9Csee=20the=20latest=20value=E2=80=9D?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- src/atomics/relaxed.md | 17 +++++++++++++---- 1 file changed, 13 insertions(+), 4 deletions(-) diff --git a/src/atomics/relaxed.md b/src/atomics/relaxed.md index cc2563d9..1ee18193 100644 --- a/src/atomics/relaxed.md +++ b/src/atomics/relaxed.md @@ -163,6 +163,15 @@ between the read and the write; it happens as a single operation. I would also like to point out that this is true of **all** atomic orderings, since a common misconception is that the `Relaxed` ordering somehow negates this guarantee. +> Another common confusion about RMWs is that they are guaranteed to “see the +> latest value” of an atomic, which I believe came from a misinterpretation of +> the C++ specification and was later spread by rumour. Of course, this makes no +> sense, since atomics have no latest value due to the lack of the concept of +> time. The original statement in the specification was actually just specifying +> that atomic RMWs are atomic: they only consider the directly previous value in +> the modification order and not any value before it, and gave no additional +> guarantee. + There are many different RMW operations to choose from, but the one most appropriate for this use case is `fetch_add`, which adds a number to the atomic, as well as returns the old value. So our code can be rewritten as this: @@ -221,10 +230,10 @@ _only if_ it occurs directly after the value we loaded”. And luckily for us, there exists a function that does exactly\* this: `compare_exchange`. `compare_exchange` is a bit like a store, but instead of unconditionally storing -the value, it will first check the previous value in the modification order to -see whether it is what we expect, and if not it will simply tell us that and not -make any changes. It is an RMW operation, so all of this happens fully -atomically — there is no chance for a race condition. +the value, it will first check the value directly before the `compare_exchange` +in the modification order to see whether it is what we expect, and if not it +will simply tell us that and not make any changes. It is an RMW operation, so +all of this happens fully atomically — there is no chance for a race condition. > \* It’s not quite the same, because `compare_exchange` can suffer from ABA > problems in which it will see a later value in the modification order that From dc6a9421941d6b80f07b6884ce1c5cfdaaecaf44 Mon Sep 17 00:00:00 2001 From: SabrinaJewson Date: Sun, 28 Aug 2022 21:04:12 +0100 Subject: [PATCH 18/34] Explain the C++20 release sequence changes --- src/atomics/acquire-release.md | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/src/atomics/acquire-release.md b/src/atomics/acquire-release.md index 1ee7a4a0..c2cb9928 100644 --- a/src/atomics/acquire-release.md +++ b/src/atomics/acquire-release.md @@ -213,7 +213,13 @@ the `load` and everything sequenced after it. There is one more rule required for these to be useful, and that is _release sequences_: after a release store is performed on an atomic, happens-before arrows will connect together each subsequent value of the atomic as long as the -new value is caused by an RMW and not just a plain store. +new value is caused by an RMW and not just a plain store (this means any +subsequent normal store, no matter the ordering, will end the sequence). + +> In the C++11 memory model, any subsequent store by the same thread that +> performed the original `Release` store would also contribute to the release +> sequence. However, this was removed in C++20 for simplicity and better +> optimizations and so **must not** be relied upon. With those rules in mind, converting Thread 1’s second store to use a `Release` ordering as well as converting Thread 2’s CAS to use an `Acquire` ordering From b3c2e626a371d15fc157f0a6b4b6e36b818c0484 Mon Sep 17 00:00:00 2001 From: SabrinaJewson Date: Sun, 28 Aug 2022 21:40:07 +0100 Subject: [PATCH 19/34] Explain the Abstract Machine --- src/atomics/atomics.md | 7 ++--- src/atomics/multithread.md | 58 +++++++++++++++++++++++++++++++------- 2 files changed, 51 insertions(+), 14 deletions(-) diff --git a/src/atomics/atomics.md b/src/atomics/atomics.md index c28310a5..12b2131b 100644 --- a/src/atomics/atomics.md +++ b/src/atomics/atomics.md @@ -29,10 +29,9 @@ three main factors at play here: 3. and the hardware, which is ready to unleash a wrath of inconsistent chaos on your program at a moment's notice. -The C++ memory model is fundamentally about trying to bridge the gap between -these three, allowing users to write code for a logical and consistent Abstract -Machine (AM for short) while the compiler and hardware deal with the madness -underneath that makes it run fast. +The memory model is fundamentally about trying to bridge the gap between these +three, allowing users to write the algorithms they want while the compiler and +hardware perform the arcane magic necessary to make them run fast. ### Compiler Reordering diff --git a/src/atomics/multithread.md b/src/atomics/multithread.md index fa061ed5..403888b3 100644 --- a/src/atomics/multithread.md +++ b/src/atomics/multithread.md @@ -1,15 +1,50 @@ # Multithreaded Execution -The first important thing to understand about C++20 atomics is that **the -abstract machine has no concept of time**. You might expect there to be a single -global ordering of events across the program where each happens at the same time -or one after the other, but under the abstract model no such ordering exists; -instead, a possible execution of the program must be treated as a single event -that happens instantaneously — there is never any such thing as “now”, or a -“latest value”, and using that terminology will only lead you to more confusion. -(Of course, in reality there does exist a concept of time, but you must keep in -mind that you’re not programming for the hardware, you’re programming for the -AM.) +When you write Rust code to run on your computer, it may surprise you but you’re +not actually writing Rust code to run on your computer — instead, you’re writing +Rust code to run on the _abstract machine_ (or AM for short). The abstract +machine, to be contrasted with the physical machine, is an abstract +representation of a theoretical computer: it doesn’t actually exist _per se_, +but the combination of a compiler, target architecture and target operating +system is capable of emulating a subset of its possible behaviours. + +The Abstract Machine has a few properties that are essential to understand: +1. It is architecture and OS-independent. The Abstract Machine doesn’t care + whether you’re on x86_64 or iOS or a Nintendo 3DS, the rules are the same + for everyone. This enables you to write code without having to think about + what the underlying system does or how it does it, as long as you obey the + Abstract Machine’s rules you know you’ll be fine. +1. It is the lowest common denominator of all supported computer systems. This + means it is allowed to result in executions no sane computer would actually + generate in real life. It is also purposefully built with forward + compatibility in mind, giving compilers the opportunity to make better and + more aggressive optimizations in the future. As a result, it can be quite + hard to test code, especially if you’re on a system that exploits fewer of + the AM’s allowed semantics, so it is highly recommended to utilize tools + that intentionally produce these executions like [Loom] and [Miri]. +1. Its model is highly formalized and not representative of what goes on + underneath. Because C++ needs to be defined by a formal specification and + not just hand-wavy rules about “this is what allowed and this is what + isn’t”, the Abstract Machine defines things in a very mathematical and, + well, _abstract_, way; instead of saying things like “the compiler is + allowed to do X” it will find a way to define the system such that the + compiler’s ability to do X simply follows as a natural consequence. This + makes it very elegant and keeps the mathematicians happy, but you should + keep in mind that this is not how computers actually function, it is merely + a representation of it. + +With that out of the way, let’s look into how the C++20 Abstract Machine is +actually defined. + +The first important thing to understand is that **the abstract machine has no +concept of time**. You might expect there to be a single global ordering of +events across the program where each happens at the same time or one after the +other, but under the abstract model no such ordering exists; instead, a possible +execution of the program must be treated as a single event that happens +instantaneously. There is never any such thing as “now”, or a “latest value”, +and using that terminology will only lead you to more confusion. Of course, in +reality there does exist a concept of time, but you must keep in mind that +you’re not programming for the hardware, you’re programming for the AM. However, while no global ordering of operations exists _between_ threads, there does exist a single total ordering _within_ each thread, which is known as its @@ -224,3 +259,6 @@ explanation ever of the most basic intuitive semantics of programming — and you’d be absolutely right. But it’s essential to grasp these fundamentals, because once you have this model in mind, the extension into multiple threads and the complicated semantics of real atomics becomes completely natural. + +[Loom]: https://docs.rs/loom +[Miri]: https://github.com/rust-lang/miri From a9eb1f69baf6bf270729a1cce557ea6bb47813c5 Mon Sep 17 00:00:00 2001 From: SabrinaJewson Date: Sun, 28 Aug 2022 21:53:38 +0100 Subject: [PATCH 20/34] Improve the explanations of coherence --- src/atomics/multithread.md | 47 ++++++++++++++++++++++++++++---------- 1 file changed, 35 insertions(+), 12 deletions(-) diff --git a/src/atomics/multithread.md b/src/atomics/multithread.md index 403888b3..de821139 100644 --- a/src/atomics/multithread.md +++ b/src/atomics/multithread.md @@ -129,12 +129,37 @@ Thread 1 data Thread 2 ``` That is, both threads read the same value of `0` from `data`, with no relative -ordering between them. This is the simple case, for when the data doesn’t ever -change — but that’s no fun, so let’s add some mutability in the mix (we’ll also -return to a single thread, just to keep things simple). +ordering between them. -Consider this code, which we’re going to attempt to draw a diagram for like -above: +That’s reads done, so we’ll look at the other kind of data access next: writes. +We’ll also return to a single thread for now, just to keep things simple. + +```rust +let mut data = 0; +data = 1; +``` + +Here, we have a single variable that the main thread writes to once — this means +that in its lifetime, it holds two values, first `0`, and then `1`. +Diagrammatically, this code’s execution can be represented like so: + +```text + Thread 1 data +╭───────╮ ┌────┐ +│ = 1 ├╌╌╌┐ │ 0 │ +╰───────╯ ├╌╌╌┼╌╌╌╌┤ + └╌╌╌┼╌╌╌╌┤ + │ 1 │ + └────┘ +``` + +Note the use of dashed padding in between the values of `data`’s column. Those +spaces won’t ever contain a value, but they’re used to represent an +unsynchronized (non-atomic) write — it is garbage data and attempting to read it +would result in a data race. + +Now let’s put all of our knowledge thus far together, and make a program both +that reads _and_ writes data — woah, scary! ```rust let mut data = 0; @@ -164,13 +189,10 @@ some boxes: ╰───────╯ └────┘ ``` -Note the use of dashed padding in between the values of `data`’s column. Those -spaces won’t ever contain a value, but they’re used to represent an -unsynchronized (non-atomic) write — it is garbage data and attempting to read it -would result in a data race. +We know all of those lines need to be joined _somewhere_, but we don’t quite +know _where_ yet. This is where we need to bring in our first rule, a rule that +universally governs all accesses to every location in memory: -To solve this puzzle, we first need to bring in a new rule that governs all -memory accesses to a particular location: > From the point at which the access occurs, find every other point that can be > reached by following the reverse direction of arrows, then for each one of > those, take a single step across every line that connects to the relevant @@ -199,7 +221,8 @@ value in `data`. Therefore its diagram would look something like this: However, that second line breaks the rule we just established! Following up the arrows from the third operation in Thread 1, we reach the first operation, and from there we can take a single step to reach the space in between the `2` and -the `1`, which excludes the this access from writing any value above that point. +the `1`, which excludes the third access from writing any value above that point +— including the `2` that it is currently writing! So evidently, this execution is no good. We can therefore conclude that the only possible execution of this program is the other one, in which the `1` appears From 8068390a901e928582e5ccd6f0fafbea37b28256 Mon Sep 17 00:00:00 2001 From: SabrinaJewson Date: Sun, 28 Aug 2022 22:00:52 +0100 Subject: [PATCH 21/34] Show the final correct execution in mutex example --- src/atomics/acquire-release.md | 17 ++++++++--------- 1 file changed, 8 insertions(+), 9 deletions(-) diff --git a/src/atomics/acquire-release.md b/src/atomics/acquire-release.md index c2cb9928..77c4ac7d 100644 --- a/src/atomics/acquire-release.md +++ b/src/atomics/acquire-release.md @@ -283,15 +283,18 @@ but since the abstract machine has no concept of time, it’s just a valid UB as any other. Luckily, we’ve already solved this problem once, so it easy to solve again: just -like before, we’ll have the CAS become acquire and the store become release. +like before, we’ll have the CAS become acquire and the store become release, and +then we can use the second coherence rule from before to follow _forward_ the +arrow from the `guard` bubble all the way to the `+= 1;`, determining that it is +only possible for that read to see `0` as its value, as in the execution below. ```text Thread 1 locked data Thread 2 ╭───────╮ ┌───────┐ ┌───┐ ╭───────╮ -│ cas ←───┐ │ false │┌──│ 0 │────→ cas │ -╰───╥───╯ │ └───────┘│ ┌┼╌╌╌┤ ╰───╥───╯ -╭───⇓───╮ │ ┌───────┬┘ ├┼╌╌╌┤ ╭───⇓───╮ -│ += 1; ├╌┐ │ │ true │ ┊│ 1 │ ?╌┤ guard │ +│ cas ←───┐ │ false │┌──│ 0 ├╌┐──→ cas │ +╰───╥───╯ │ └───────┘│ ┌┼╌╌╌┤ ┊ ╰───╥───╯ +╭───⇓───╮ │ ┌───────┬┘ ├┼╌╌╌┤ ┊ ╭───⇓───╮ +│ += 1; ├╌┐ │ │ true │ ┊│ 1 │ └─╌┤ guard │ ╰───╥───╯ ┊ │ └───────┘ ┊└───┘ ╰───╥───╯ ╭───⇓───╮ └╌│╌╌╌╌╌╌╌╌╌╌╌╌┘ ╭───⇓───╮ │ store ├─┐ │ ┌───────↙────────────┤ store │ @@ -305,10 +308,6 @@ Thread 1 locked data Thread 2 └───────┘ ``` -We can now use the second coherence rule from before to follow _forward_ the -arrow from the `guard` bubble all the way to the `+= 1;`, determining that it is -only possible for that read to see `0` as its value. - This leads us to the proper memory orderings for any mutex (and other locks like RW locks too, even): use `Acquire` to lock it, and `Release` to unlock it. So let’s go back to and update our original mutex definition with this knowledge. From 3c76e35449b9b6c765c4e08016a610c3bfa65781 Mon Sep 17 00:00:00 2001 From: SabrinaJewson Date: Sun, 28 Aug 2022 22:18:47 +0100 Subject: [PATCH 22/34] Add a more formal explanation of happens-before --- src/atomics/acquire-release.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/src/atomics/acquire-release.md b/src/atomics/acquire-release.md index 77c4ac7d..ef630a39 100644 --- a/src/atomics/acquire-release.md +++ b/src/atomics/acquire-release.md @@ -210,6 +210,13 @@ possible execution, Thread 1’s `store` synchronizes-with Thread 2’s `load`, which causes that `store` and everything sequenced before it to happen-before the `load` and everything sequenced after it. +> More formally, we can say that A happens-before B if any of the following +> conditions are true: +> 1. A is sequenced-before B (i.e. A occurs before B on the same thread) +> 2. A synchronizes-with B (i.e. A is a `Release` operation and B is an +> `Acquire` operation that reads the value written by A) +> 3. A happens-before X, and X happens-before B (transitivity) + There is one more rule required for these to be useful, and that is _release sequences_: after a release store is performed on an atomic, happens-before arrows will connect together each subsequent value of the atomic as long as the From d4f8f47439cc09d30d07374b6977825d98d356b2 Mon Sep 17 00:00:00 2001 From: SabrinaJewson Date: Mon, 29 Aug 2022 16:51:36 +0100 Subject: [PATCH 23/34] Write about acquire and release fences --- src/atomics/fences.md | 201 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 201 insertions(+) diff --git a/src/atomics/fences.md b/src/atomics/fences.md index 04445512..76d5a371 100644 --- a/src/atomics/fences.md +++ b/src/atomics/fences.md @@ -1 +1,202 @@ # Fences + +As well as loads, stores, and RMWs, there is one more kind of atomic operation +to be aware of: fences. Fences can be triggered by the +[`core::sync::atomic::fence`] function, which accepts a single ordering +parameter and returns nothing. They don’t do anything on their own, but can be +thought of as events that strengthen the ordering of nearby atomic operations. + +## Acquire fences + +The most common kind of fence is an _acquire fence_, which can be triggered in +three different ways: +1. `atomic::fence(atomic::Ordering::Acquire)` +1. `atomic::fence(atomic::Ordering::AcqRel)` +1. `atomic::fence(atomic::Ordering::SeqCst)` + +An acquire fence retroactively makes every single non-`Acquire` operation that +was sequenced-before it act like an `Acquire` operation that occurred at the +fence — in other words, it causes every prior `Release`d value that was +previously loaded on the thread to synchronize-with the fence. For example, the +following code: + +```rust +# use std::sync::atomic::{self, AtomicU32}; +static X: AtomicU32 = AtomicU32::new(0); + +// t_1 +X.store(1, atomic::Ordering::Release); + +// t_2 +let value = X.load(atomic::Ordering::Relaxed); +atomic::fence(atomic::Ordering::Acquire); +``` + +Can result in two possible executions: + +```text + Possible Execution 1 ┃ Possible Execution 2 + ┃ + t_1 X t_2 ┃ t_1 X t_2 +╭───────╮ ┌───┐ ╭───────╮ ┃ ╭───────╮ ┌───┐ ╭───────╮ +│ store ├─┐ │ 0 │ ┌─┤ load │ ┃ │ store ├─┐ │ 0 ├───┤ load │ +╰───────╯ │ └───┘ │ ╰───╥───╯ ┃ ╰───────╯ │ └───┘ ╰───╥───╯ + └─↘───┐ │ ╭───⇓───╮ ┃ └─↘───┐ ╭───⇓───╮ + │ 1 ├─┘┌→ fence │ ┃ │ 1 │ │ fence │ + └───┴──┘╰───────╯ ┃ └───┘ ╰───────╯ +``` + +In the first execution, `t_1`’s store synchronizes-with and therefore +happens-before `t_2`’s fence due to the prior load, but note that it does _not_ +happen-before `t_2`’s load. + +Acquire fences work on any number of atomics, and on release sequences too. A +more complex example is as follows: + +```rust +# use std::sync::atomic::{self, AtomicU32}; +static X: AtomicU32 = AtomicU32::new(0); +static Y: AtomicU32 = AtomicU32::new(0); + +// t_1 +X.store(1, atomic::Ordering::Release); +X.fetch_add(1, atomic::Ordering::Relaxed); + +// t_2 +Y.store(1, atomic::Ordering::Release); + +// t_3 +let x = X.load(atomic::Ordering::Relaxed); +let y = Y.load(atomic::Ordering::Relaxed); +atomic::fence(atomic::Ordering::Acquire); +``` + +This can result in an execution like so: + +``` + t_1 X t_3 Y t_2 +╭───────╮ ┌───┐ ╭───────╮ ┌───┐ ╭───────╮ +│ store ├─┐ │ 0 │ ┌─┤ load │ │ 0 │ ┌─┤ store │ +╰───╥───╯ │ └───┘ │ ╰───╥───╯ └───┘ │ ╰───────╯ +╭───⇓───╮ └─↘───┐ │ ╭───⇓───╮ ┌───↙─┘ +│ rmw ├─┐ │ 1 │ │ │ load ├───┤ 1 │ +╰───────╯ │ └─┬─┘ │ ╰───╥───╯ ┌─┴───┘ + └─┬─↓─┐ │ ╭───⇓───╮ │ + │ 2 ├─┘┌→ fence ←─┘ + └───┴──┘╰───────╯ +``` + +There are two common scenarios in which acquire fences are used: +1. When an `Acquire` ordering is only necessary when a specific value is loaded. + For example, you may only wish to acquire when an `initialized` boolean is + `true`, since otherwise you won’t be reading the shared state at all. In + this case, you can load with a `Relaxed` ordering and then issue an + `Acquire` fence afterward only if that condition is met, which can aid in + performance sometimes (since the acquire operation is avoided when + `initialized == false`). +2. When several `Acquire` operations on different locations need to be performed + in a row, but individually each operation doesn’t need `Acquire` ordering; + it is often faster to perform all the loads as `Relaxed` first and use a + single `Acquire` fence at the end then it is to make each one separately use + `Acquire`. + +## Release fences + +Release fences are the natural complement to acquire fences, and they similarly +can be triggered in three different ways: +1. `atomic::fence(atomic::Ordering::Release)` +1. `atomic::fence(atomic::Ordering::AcqRel)` +1. `atomic::fence(atomic::Ordering::SeqCst)` + +Release fences convert every subsequent atomic access in the same thread into a +release operation that has its arrow starting from the fence — in other words, +every `Acquire` operation that sees a value that was written by the fence’s +thread after the release fence will synchronize-with the release fence. For +example, the following code: + +```rust +# use std::sync::atomic::{self, AtomicU32}; +static X: AtomicU32 = AtomicU32::new(0); + +// t_1 +atomic::fence(atomic::Ordering::Release); +X.store(1, atomic::Ordering::Relaxed); + +// t_2 +X.load(atomic::Ordering::Acquire); +``` + +Can result in this execution: + +```text + t_1 X t_2 +╭───────╮ ┌───┐ ╭───────╮ +│ fence ├─┐ │ 0 │ ┌─→ load │ +╰───╥───╯ │ └───┘ │ ╰───────╯ +╭───⇓───╮ └─↘───┐ │ +│ store ├───┤ 1 ├─┘ +╰───────╯ └───┘ +``` + +As well as it being possible for a release fence to synchronize-with an acquire +load (fence–atomic synchronization) and a release store to synchronize-with an +acquire fence (atomic–fence synchronization), it is also possible for release +fences to synchronize with acquire fences (fence–fence synchronization). In this +code snippet, only fences and `Relaxed` operations are used to establish a +happens-before relation (in some executions): + +```rust +# use std::sync::atomic::{self, AtomicU32}; +static X: AtomicU32 = AtomicU32::new(0); + +// t_1 +atomic::fence(atomic::Ordering::Release); +X.store(1, atomic::Ordering::Relaxed); + +// t_2 +X.load(atomic::Ordering::Relaxed); +atomic::fence(atomic::Ordering::Acquire); +``` + +The execution with the relation looks like this: + +```text + t_1 X t_2 +╭───────╮ ┌───┐ ╭───────╮ +│ fence ├─┐ │ 0 │ ┌─┤ load │ +╰───╥───╯ │ └───┘ │ ╰───╥───╯ +╭───⇓───╮ └─↘───┐ │ ╭───⇓───╮ +│ store ├───┤ 1 ├─┘┌→ fence │ +╰───────╯ └───┴──┘╰───────╯ +``` + +Like with acquire fences, release fences are commonly used to optimize over a +series of atomic stores that don’t individually need to be `Release`, since it’s +often faster to put a single release fence at the start and use `Relaxed` from +that point on than it is to use `Release` every time. + +## `AcqRel` fences + +`AcqRel` fences are just the combined behaviour of an `Acquire` fence and a +`Release` fence in one operation. There isn’t much special to note about them, +other than that they behave more like an acquire fence followed by a release +fence than the other way around, which is useful to know in situations like the +following: + +```text + t_1 X t_2 Y t_3 +╭───────╮ ┌───┐ ╭───────╮ ┌───┐ ╭───────╮ +│ A │ │ 0 │ ┌─┤ load │ │ 0 │ ┌─→ load │ +╰───╥───╯ └───┘ │ ╰───╥───╯ └───┘ │ ╰───╥───╯ +╭───⇓───╮ ┌─↘───┐ │ ╭───⇓───╮┌──↘───┐ │ ╭───⇓───╮ +│ store ├─┘ │ 1 ├─┘┌→ fence ├┘┌─┤ 1 ├─┘ │ B │ +╰───────╯ └───┴──┘╰───╥───╯ │ └───┘ ╰───────╯ + ╭───⇓───╮ │ + │ store ├─┘ + ╰───────╯ +``` + +Here, A happens-before B, which is singularly due to the `AcqRel` fence’s +ability to “carry over” happens-before relations within itself. + +[`core::sync::atomic::fence`]: https://doc.rust-lang.org/stable/core/sync/atomic/fn.fence.html From 5e27ed5c0307932e936ea60eb455f36b4a3f7098 Mon Sep 17 00:00:00 2001 From: SabrinaJewson Date: Sun, 4 Sep 2022 16:30:14 +0100 Subject: [PATCH 24/34] Improve the `SeqCst` explanation --- src/atomics/seqcst.md | 97 ++++++++++++++++++++++++++++++++----------- 1 file changed, 72 insertions(+), 25 deletions(-) diff --git a/src/atomics/seqcst.md b/src/atomics/seqcst.md index 9024dac8..de677cd0 100644 --- a/src/atomics/seqcst.md +++ b/src/atomics/seqcst.md @@ -106,17 +106,65 @@ do). It is in contrast to modification orders, which are similarly total but only scoped to a single atomic rather than the whole program. Other than an edge case involving `SeqCst` mixed with weaker orderings (detailed -in the next section), _S_ is primarily controlled by the happens-before -relations in a program: this means that if an action _A_ happens-before an -action _B_, it is also guaranteed to appear before _B_ in _S_. Other than that -restriction, _S_ is unspecified and will be chosen arbitrarily during execution. +later on), _S_ is primarily controlled by the happens-before relations in a +program: this means that if an action _A_ happens-before an action _B_, it is +also guaranteed to appear before _B_ in _S_. Other than that restriction, _S_ is +unspecified and will be chosen arbitrarily during execution. Once a particular _S_ has been established, every atomic’s modification order is -then guaranteed to be consistent with it — this means that a `SeqCst` load will -never see a value that has been overwritten by a write that occurred before it -in _S_, or a value that has been written by a write that occured after it in -_S_ (note that a `Relaxed`/`Acquire` load however might, since there is no -“before” or “after” as it is not in _S_ in the first place). +then guaranteed to be consistent with it, so a `SeqCst` load will never see a +value that has been overwritten by a write that occurred before it in _S_, or a +value that has been written by a write that occured after it in _S_ (note that a +`Relaxed`/`Acquire` load however might, since there is no “before” or “after” as +it is not in _S_ in the first place). + +More formally, this guarantee can be described with _coherence orderings_, a +relation which expresses which of two operations appears before the other in an +atomic’s modification order. It is said that an operation _A_ is +_coherence-ordered-before_ another operation _B_ if any of the following +conditions are met: +1. _A_ is a store or RMW, _B_ is a store or RMW, and _A_ appears before _B_ in + the modification order. +1. _A_ is a store or RMW, _B_ is a load, and _B_ reads the value stored by _A_. +1. _A_ is a load, _B_ is a store or RMW, and _A_ takes its value from a place in + the modification order that appears before _B_. +1. _A_ is coherence-ordered-before a different operation _X_, and _X_ is + coherence-ordered-before _B_ (the basic transitivity property). + +The following diagram gives examples for the main three rules (in each case _A_ +is coherence-ordered-before _B_): + +```text + Rule 1 ┃ Rule 2 ┃ Rule 3 + ┃ ┃ +╭───╮ ┌─┬───┐ ╭───╮ ┃ ╭───╮ ┌─┬───┐ ╭───╮ ┃ ╭───╮ ┌───┐ ╭───╮ +│ A ├─┘ │ │ ┌─┤ B │ ┃ │ A ├─┘ │ ├───┤ B │ ┃ │ A ├───┤ │ ┌─┤ B │ +╰───╯ └───┘ │ ╰───╯ ┃ ╰───╯ └───┘ ╰───╯ ┃ ╰───╯ └───┘ │ ╰───╯ + ┌───┬─┘ ┃ ┃ ┌───┬─┘ + │ │ ┃ ┃ │ │ + └───┘ ┃ ┃ └───┘ +``` + +The only important thing to note is that for two loads of the same value in the +modification order, neither is coherence-ordered-before the other, as in the +following example where _A_ has no coherence ordering relation to _B_: + +```text +╭───╮ ┌───┐ ╭───╮ +│ A ├───┤ ├───┤ B │ +╰───╯ └───┘ ╰───╯ +``` + +With this terminology applied, we can use a more precise definition of +`SeqCst`’s guarantee: for two `SeqCst` operations on the same atomic _A_ and +_B_, where _A_ precedes _B_ in _S_, either _A_ must be coherence-ordered-before +_B_ or they must both be loads that see the same value in the modification +order. Effectively, this one rule ensures that _S_’s order “propagates” +throughout all the atomics of the program — you can imagine each operation in +_S_ as storing a snapshot of the world, so that every subsequent operation is +consistent with it. + +## Applying `SeqCst` So, looking back at our program, let’s consider how we could use `SeqCst` to make that execution invalid. As a refresher, here’s the framework for every @@ -137,9 +185,10 @@ become `SeqCst`, because they need to be aware of the total ordering that determines whether `X` or `Y` becomes `true` first. And secondly, we need to establish that ordering in the first place, and that needs to be done by making sure that there is always one operation in _S_ that both sees one of the atomics -as `true` and precedes both final loads (the final loads themselves don’t work -for this since although they “know” that their corresponding atomic is `true` -they don’t interact with it directly so _S_ doesn’t care). +as `true` and precedes both final loads in _S_, so that the coherence ordering +guarantee will apply (the final loads themselves don’t work for this since +although they “know” that their corresponding atomic is `true` they don’t +interact with it directly so _S_ doesn’t care). There are two operations in the program that could fulfill the first condition, should they be made `SeqCst`: the stores of `true` and the first loads. However, @@ -207,9 +256,9 @@ executions of this program: 1. `c` loads `X` (gives `true`) 1. `c` loads `Y` (required to be `true`) -All the places were the load is requied to give `true` were caused by a -preceding load in _S_ of the same atomic which saw `true`, because otherwise _S_ -would be inconsistent with the atomic’s modification order and that is +All the places where the load was required to give `true` were caused by a +preceding load in _S_ of the same atomic which saw `true` — otherwise, the load +would be coherence-ordered-before a load which precedes it in _S_, and that is impossible. ## The mixed-`SeqCst` special case @@ -250,10 +299,10 @@ strongly happen-before C. But this is all highly theoretical at the moment, so let’s make an example to show how that rule can actually affect the execution of code. So, if C were to -precede A in _S_ then that means in the modification order of any atomic they -both access, C would have to come before A. Let’s say then that C loads from `x` -(the atomic that A has to access), it may load the value that came before A if -it were to precede A in _S_: +precede A in _S_ (and they are not both loads) then that means C is always +coherence-ordered-before A. Let’s say then that C loads from `x` (the atomic +that A has to access), it may load the value that came before A if it were to +precede A in _S_: ```text t_1 x t_2 @@ -265,9 +314,9 @@ it were to precede A in _S_: └───┘ ╰─────╯ ``` -Ah wait no, that doesn’t work because coherence still mandates that `1` is the -only value that can be loaded. In fact, once `1` is loaded _S_’s required -consistency with modification orders means that A _is_ required to precede C in +Ah wait no, that doesn’t work because regular coherence still mandates that `1` +is the only value that can be loaded. In fact, once `1` is loaded _S_’s required +consistency with coherence orderings means that A _is_ required to precede C in _S_ after all. So somehow, to observe this difference we need to have a _different_ `SeqCst` @@ -386,6 +435,4 @@ would make atomics significantly slower. So instead, in C++20 they simply encoded it into the specification. Generally however, this rule is so complex it’s best to just avoid it entirely -by never mixing `SeqCst` and non-`SeqCst` on a single atomic in the first place -— or even better, just avoiding `SeqCst` entirely and using a stronger ordering -instead that has less complex semantics and fewer gotchas. +by never mixing `SeqCst` and non-`SeqCst` on a single atomic in the first place. From 805070e0f6ba448039dea5bf2df6b94e28dadeae Mon Sep 17 00:00:00 2001 From: SabrinaJewson Date: Sun, 4 Sep 2022 17:24:04 +0100 Subject: [PATCH 25/34] Write about `SeqCst` fences --- src/atomics/fences.md | 54 +++++++++++++++++++++++++++++++++++++++++++ src/atomics/seqcst.md | 13 ++++++++--- 2 files changed, 64 insertions(+), 3 deletions(-) diff --git a/src/atomics/fences.md b/src/atomics/fences.md index 76d5a371..d2196aec 100644 --- a/src/atomics/fences.md +++ b/src/atomics/fences.md @@ -199,4 +199,58 @@ following: Here, A happens-before B, which is singularly due to the `AcqRel` fence’s ability to “carry over” happens-before relations within itself. +## `SeqCst` fences + +`SeqCst` fences are the strongest kind of fence. They first of all inherit the +behaviour from an `AcqRel` fence, meaning they have both acquire and release +semantics at the same time, but being `SeqCst` operations they also participate +in _S_. Just as with all other `SeqCst` operations, their placement in _S_ is +primarily determined by strongly happens-before relations (including the +[mixed-`SeqCst` caveat] that comes with it), which then gives additional +guarantees to your code. + +Namely, the power of `SeqCst` fences can be summarized in three points: + +* Everything that happens-before a `SeqCst` fence is not coherence-ordered-after + any `SeqCst` operation that the fence precedes in _S_. +* Everything that happens-after a `SeqCst` fence is not coherence-ordered-before + any `SeqCst` operation that the fence succeeds in _S_. +* Everything that happens-before a `SeqCst` fence X is not + coherence-ordered-after anything that happens-after another `SeqCst` fence + Y, if X preceeds Y in _S_. + +> In C++11, the above three statements were similar, except they only talked +> about what was sequenced-before and sequenced-after the `SeqCst` fences; C++20 +> strengthened this to also include happens-before, because in practice this +> theoretical optimization was not being exploited by anybody. However do note +> that as of the time of writing, [Miri only implements the old, weaker +> semantics][miri scfix] and so you may see false positives when testing with +> it. + +The “motivating use-case” for `SeqCst` demonstrated in the `SeqCst` chapter can +also be rewritten to use exclusively `SeqCst` fences and `Relaxed` operations, +by inserting fences in between the loads in threads `c` and `d`: + +```text + a static X c d static Y b +╭─────────╮ ┌───────┐ ╭─────────╮ ╭─────────╮ ┌───────┐ ╭─────────╮ +│ store X ├─┐ │ false │ ┌─┤ load X │ │ load Y ├─┐ │ false │ ┌─┤ store Y │ +╰─────────╯ │ └───────┘ │ ╰────╥────╯ ╰────╥────╯ │ └───────┘ │ ╰─────────╯ + └─┬───────┐ │ ╭────⇓────╮ ╭────⇓────╮ │ ┌───────┬─┘ + │ true ├─┘ │ *fence* │ │ *fence* │ └─┤ true │ + └───────┘ ╰────╥────╯ ╰────╥────╯ └───────┘ + ╭────⇓────╮ ╭────⇓────╮ + │ load Y ├─? ?─┤ load X │ + ╰─────────╯ ╰─────────╯ +``` + +There are two executions to consider here, depending on which way round the +fences appear in _S_. Should `c`’s fence appear first, the fence–fence `SeqCst` +guarantee tells us that `c`’s load of `X` is not coherence-ordered-after `d`’s +load of `X`, which forbids `d`’s load of `X` from seeing the value `false`. The +same logic can be applied should the fences appear the other way around, proving +that at least one thread must load `true` in the end. + [`core::sync::atomic::fence`]: https://doc.rust-lang.org/stable/core/sync/atomic/fn.fence.html +[mixed-`SeqCst` caveat]: seqcst.md#the-mixed-seqcst-special-case +[miri scfix]: https://github.com/rust-lang/miri/issues/2301 diff --git a/src/atomics/seqcst.md b/src/atomics/seqcst.md index de677cd0..ee039e4d 100644 --- a/src/atomics/seqcst.md +++ b/src/atomics/seqcst.md @@ -155,11 +155,18 @@ following example where _A_ has no coherence ordering relation to _B_: ╰───╯ └───┘ ╰───╯ ``` +Because of this, “_A_ is coherence-ordered-before _B_” is subtly different from +“_A_ is not coherence-ordered-after _B_”; only the latter phrase includes the +above situation, and is synonymous with “either _A_ is coherence-ordered-before +_B_ or _A_ and _B_ are both loads, and see the same value in the atomic’s +modification order”. “Not coherence-ordered-after” is generally a more useful +relation than “coherence-ordered-before”, and so it’s important to understand +what it means. + With this terminology applied, we can use a more precise definition of `SeqCst`’s guarantee: for two `SeqCst` operations on the same atomic _A_ and -_B_, where _A_ precedes _B_ in _S_, either _A_ must be coherence-ordered-before -_B_ or they must both be loads that see the same value in the modification -order. Effectively, this one rule ensures that _S_’s order “propagates” +_B_, where _A_ precedes _B_ in _S_, _A_ is not coherence-ordered-after _B_. +Effectively, this one rule ensures that _S_’s order “propagates” throughout all the atomics of the program — you can imagine each operation in _S_ as storing a snapshot of the world, so that every subsequent operation is consistent with it. From c19184a94a789258ebab022288ba2aba1660ceff Mon Sep 17 00:00:00 2001 From: SabrinaJewson Date: Sun, 4 Sep 2022 17:35:19 +0100 Subject: [PATCH 26/34] Fix Unicode art incorrectly interpreted as Rust code --- src/atomics/fences.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/atomics/fences.md b/src/atomics/fences.md index d2196aec..9de55d11 100644 --- a/src/atomics/fences.md +++ b/src/atomics/fences.md @@ -73,7 +73,7 @@ atomic::fence(atomic::Ordering::Acquire); This can result in an execution like so: -``` +```text t_1 X t_3 Y t_2 ╭───────╮ ┌───┐ ╭───────╮ ┌───┐ ╭───────╮ │ store ├─┐ │ 0 │ ┌─┤ load │ │ 0 │ ┌─┤ store │ From 2384caab54d63060327d875b08a7821e463cd51c Mon Sep 17 00:00:00 2001 From: SabrinaJewson Date: Sun, 4 Sep 2022 17:52:16 +0100 Subject: [PATCH 27/34] =?UTF-8?q?Define=20=E2=80=9Cunsequenced=E2=80=9D=20?= =?UTF-8?q?early=20on?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit And be consistent with “sequenced before” versus “sequenced-before” to always use the hyphen. --- src/atomics/acquire-release.md | 6 +++--- src/atomics/multithread.md | 20 ++++++++++++-------- src/atomics/seqcst.md | 10 +++++----- 3 files changed, 20 insertions(+), 16 deletions(-) diff --git a/src/atomics/acquire-release.md b/src/atomics/acquire-release.md index ef630a39..6bad64a5 100644 --- a/src/atomics/acquire-release.md +++ b/src/atomics/acquire-release.md @@ -196,7 +196,7 @@ Thread 1 a Thread 2 ┃ Thread 1 a Thread 2 These arrows are a new kind of arrow we haven’t seen yet; they are known as _happens-before_ (or happens-after) relations and are represented as thin arrows -(→) on these diagrams. They are weaker than the _sequenced before_ +(→) on these diagrams. They are weaker than the _sequenced-before_ double-arrows (⇒) that occur inside a single thread, but can still be used with the coherence rules to determine which values of a memory location are valid to read. @@ -207,8 +207,8 @@ operation) we say that the release operation _synchronized-with_ the acquire operation, which in doing so establishes that the release operation _happens-before_ the acquire operation. Therefore, we can say that in the first possible execution, Thread 1’s `store` synchronizes-with Thread 2’s `load`, -which causes that `store` and everything sequenced before it to happen-before -the `load` and everything sequenced after it. +which causes that `store` and everything sequenced-before it to happen-before +the `load` and everything sequenced-after it. > More formally, we can say that A happens-before B if any of the following > conditions are true: diff --git a/src/atomics/multithread.md b/src/atomics/multithread.md index de821139..6c1a2871 100644 --- a/src/atomics/multithread.md +++ b/src/atomics/multithread.md @@ -67,8 +67,8 @@ its sequence during one possible execution can be visualized like so: ``` That double arrow in between the two boxes (`⇒`) represents that the second -statement is _sequenced after_ the first (and similarly the first statement is -_sequenced before_ the second). This is the strongest kind of ordering guarantee +statement is _sequenced-after_ the first (and similarly the first statement is +_sequenced-before_ the second). This is the strongest kind of ordering guarantee between any two operations, and only comes about when those two operations happen one after the other and on the same thread. @@ -96,10 +96,14 @@ sequence: ╰───────────────╯ ╰─────────────────╯ ``` -Note that this is **not** a representation of multiple things that _could_ -happen at runtime — instead, this diagram describes exactly what _did_ happen -when the program ran once. This distinction is key, because it highlights that -even the lowest-level representation of a program’s execution does not have +We can say that the prints of `A` and `B` are _unsequenced_ with regard to the +prints of `01` and `02` that occur in the second thread, since they have no +sequenced-before arrows connecting the boxes together. + +Note that these diagrams are **not** a representation of multiple things that +_could_ happen at runtime — instead, this diagram describes exactly what _did_ +happen when the program ran once. This distinction is key, because it highlights +that even the lowest-level representation of a program’s execution does not have a global ordering between threads; those two disconnected chains are all there is. @@ -128,8 +132,8 @@ Thread 1 data Thread 2 ╰──────╯ └────┘ ╰──────╯ ``` -That is, both threads read the same value of `0` from `data`, with no relative -ordering between them. +That is, both threads read the same value of `0` from `data`, and the two +operations are unsequenced — they have no relative ordering between them. That’s reads done, so we’ll look at the other kind of data access next: writes. We’ll also return to a single thread for now, just to keep things simple. diff --git a/src/atomics/seqcst.md b/src/atomics/seqcst.md index ee039e4d..4f427ffb 100644 --- a/src/atomics/seqcst.md +++ b/src/atomics/seqcst.md @@ -229,7 +229,7 @@ assert!(c || d); ``` As there are four `SeqCst` operations with a partial order between two pairs in -them (caused by the sequenced before relation), there are six possible +them (caused by the sequenced-before relation), there are six possible executions of this program: - All of `c`’s loads precede `d`’s loads: @@ -277,7 +277,7 @@ subtly-defined subset of happens-before relations. In particular, it excludes two situations: 1. The `SeqCst` operation A synchronizes-with an `Acquire` or `AcqRel` operation - B which is sequenced before another `SeqCst` operation C. Here, despite the + B which is sequenced-before another `SeqCst` operation C. Here, despite the fact that A happens-before C, A does not _strongly_ happen-before C and so is there not guaranteed to precede C in _S_. 2. The `SeqCst` operation A is sequenced-before the `Release` or `AcqRel` @@ -298,9 +298,9 @@ asterisks: ``` A happens-before, but does not strongly happen-before, C — and anything -sequenced after C will have the same treatment (unless more synchronization is +sequenced-after C will have the same treatment (unless more synchronization is used). This means that C is actually allowed to _precede_ A in _S_, despite -conceptually occuring after it. However, anything sequenced before A, because +conceptually occuring after it. However, anything sequenced-before A, because there is at least one sequence on either side of the synchronization, will strongly happen-before C. @@ -333,7 +333,7 @@ C and A) but C also doesn’t happen-before it (to avoid coherence getting in th way) — and to do that, all we have to do is have C appear before a `SeqCst` operation D in the modification order of another atomic, but have D be a store so as to avoid C synchronizing with it, and then our desired load E can simply -be sequenced after D (this will carry over the “precedes in _S_” guarantee, but +be sequenced-after D (this will carry over the “precedes in _S_” guarantee, but does not restore the happens-after relation to C since that was already dropped by having D be a store). From d9dabf4653255641634523e1e25e9e614e0f7b09 Mon Sep 17 00:00:00 2001 From: SabrinaJewson Date: Sun, 4 Sep 2022 18:23:56 +0100 Subject: [PATCH 28/34] Note that a release fence followed by multiple stores is not necessarily faster than many release stores --- src/atomics/fences.md | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/src/atomics/fences.md b/src/atomics/fences.md index 9de55d11..210234b4 100644 --- a/src/atomics/fences.md +++ b/src/atomics/fences.md @@ -170,10 +170,11 @@ The execution with the relation looks like this: ╰───────╯ └───┴──┘╰───────╯ ``` -Like with acquire fences, release fences are commonly used to optimize over a -series of atomic stores that don’t individually need to be `Release`, since it’s -often faster to put a single release fence at the start and use `Relaxed` from -that point on than it is to use `Release` every time. +Like with acquire fences, release fences can be used to optimize over a series +of atomic stores that don’t individually need to be `Release`, since in some +conditions and on some architectures it’s faster to put a single release fence +at the start and use `Relaxed` from that point on than it is to use `Release` +every time. ## `AcqRel` fences From 09c428ee7bd8bd4d5e72b1f912430668f202d86a Mon Sep 17 00:00:00 2001 From: SabrinaJewson Date: Sat, 1 Oct 2022 09:28:14 +0100 Subject: [PATCH 29/34] Remove the signals section for now --- src/SUMMARY.md | 1 - src/atomics/signals.md | 3 --- 2 files changed, 4 deletions(-) delete mode 100644 src/atomics/signals.md diff --git a/src/SUMMARY.md b/src/SUMMARY.md index e767c8e5..01304c5e 100644 --- a/src/SUMMARY.md +++ b/src/SUMMARY.md @@ -47,7 +47,6 @@ * [Acquire and Release](./atomics/acquire-release.md) * [SeqCst](./atomics/seqcst.md) * [Fences](./atomics/fences.md) - * [Signals](./atomics/signals.md) * [Implementing Vec](./vec/vec.md) * [Layout](./vec/vec-layout.md) * [Allocating](./vec/vec-alloc.md) diff --git a/src/atomics/signals.md b/src/atomics/signals.md deleted file mode 100644 index 4e539928..00000000 --- a/src/atomics/signals.md +++ /dev/null @@ -1,3 +0,0 @@ -# Signals - -(and compiler fences) From ff32f702029e4e127e6c4b78d4d77a280386ba1b Mon Sep 17 00:00:00 2001 From: SabrinaJewson Date: Fri, 4 Nov 2022 10:52:13 +0000 Subject: [PATCH 30/34] Fix CI --- book.toml | 3 +++ 1 file changed, 3 insertions(+) diff --git a/book.toml b/book.toml index a2011c61..dc7b49a7 100644 --- a/book.toml +++ b/book.toml @@ -31,5 +31,8 @@ git-repository-url = "https://github.com/rust-lang/nomicon" "./arc-layout.html" = "./arc-mutex/arc-layout.html" "./arc.html" = "./arc-mutex/arc.html" +# Atomics chapter +"./atomics.html" = "./atomics/atomics.html" + [rust] edition = "2018" From af524a598cbed769da8e74c10965fbc2efd8df70 Mon Sep 17 00:00:00 2001 From: SabrinaJewson Date: Fri, 12 Jan 2024 12:08:20 +0000 Subject: [PATCH 31/34] Fix typos --- src/atomics/acquire-release.md | 6 +++--- src/atomics/atomics.md | 2 +- src/atomics/multithread.md | 4 ++-- src/atomics/seqcst.md | 2 +- 4 files changed, 7 insertions(+), 7 deletions(-) diff --git a/src/atomics/acquire-release.md b/src/atomics/acquire-release.md index 6bad64a5..6d6f0b95 100644 --- a/src/atomics/acquire-release.md +++ b/src/atomics/acquire-release.md @@ -26,7 +26,7 @@ impl Mutex { ``` Now for the lock function. We need to use an RMW here, since we need to both -check whether it is locked and lock it if is not in a single atomic step; this +check whether it is locked and lock it if it isn’t in a single atomic step; this can be most simply done with a `compare_exchange` (unlike before, it doesn’t need to be in a loop this time). For the ordering, we’ll just use `Relaxed` since we don’t know of any others yet. @@ -158,7 +158,7 @@ case that doesn’t seem to be enough, since even if atomics were used it still would have the _option_ of reading `0` instead of `1`, and really if we want our mutex to be sane, it should only be able to read `1`. -So it seems that want we _want_ is to be able to apply the coherence rules from +So it seems that what we _want_ is to be able to apply the coherence rules from before to completely rule out zero from the set of the possible values — if we were able to draw a large arrow from the Thread 1’s `+= 1;` to Thread 2’s `guard`, then we could trivially then use the rule to rule out `0` as a value @@ -256,7 +256,7 @@ We now can trace back along the reverse direction of arrows from the `guard` bubble to the `+= 1` bubble; we have established that Thread 2’s load happens-after the `+= 1` side effect, because Thread 2’s CAS synchronizes-with Thread 1’s store. This both avoids the data race and gives the guarantee that -`1` will be always read by Thread 2 (as long as locks after Thread 1, of +`1` will be always read by Thread 2 (as long as it locks after Thread 1, of course). However, that is not the only execution of the program possible. Even with this diff --git a/src/atomics/atomics.md b/src/atomics/atomics.md index 12b2131b..71194e5e 100644 --- a/src/atomics/atomics.md +++ b/src/atomics/atomics.md @@ -87,7 +87,7 @@ For instance, say we convince the compiler to emit this logic: ```text initial state: x = 0, y = 1 -THREAD 1 THREAD2 +THREAD 1 THREAD 2 y = 3; if x == 1 { x = 1; y *= 2; } diff --git a/src/atomics/multithread.md b/src/atomics/multithread.md index 6c1a2871..4a4ce3d6 100644 --- a/src/atomics/multithread.md +++ b/src/atomics/multithread.md @@ -24,7 +24,7 @@ The Abstract Machine has a few properties that are essential to understand: that intentionally produce these executions like [Loom] and [Miri]. 1. Its model is highly formalized and not representative of what goes on underneath. Because C++ needs to be defined by a formal specification and - not just hand-wavy rules about “this is what allowed and this is what + not just hand-wavy rules about “this is what is allowed and this is what isn’t”, the Abstract Machine defines things in a very mathematical and, well, _abstract_, way; instead of saying things like “the compiler is allowed to do X” it will find a way to define the system such that the @@ -247,7 +247,7 @@ above the `2`: Now to sort out the read operation in the middle. We can use the same rule as before to trace up to the first write and rule out us reading either the `0` -value or the garbage that exists between it and `1`, but how to we choose +value or the garbage that exists between it and `1`, but how do we choose between the `1` and the `2`? Well, as it turns out there is a complement to the rule we already defined which gives us the exact answer we need: diff --git a/src/atomics/seqcst.md b/src/atomics/seqcst.md index 4f427ffb..e49d1871 100644 --- a/src/atomics/seqcst.md +++ b/src/atomics/seqcst.md @@ -279,7 +279,7 @@ two situations: 1. The `SeqCst` operation A synchronizes-with an `Acquire` or `AcqRel` operation B which is sequenced-before another `SeqCst` operation C. Here, despite the fact that A happens-before C, A does not _strongly_ happen-before C and so is - there not guaranteed to precede C in _S_. + not guaranteed to precede C in _S_. 2. The `SeqCst` operation A is sequenced-before the `Release` or `AcqRel` operation B, which synchronizes-with another `SeqCst` operation C. Similarly, despite the fact that A happens-before C, A might not precede C in _S_. From f3277bfbc957a509c67d50a5220214c618e149b3 Mon Sep 17 00:00:00 2001 From: SabrinaJewson Date: Fri, 12 Jan 2024 12:19:16 +0000 Subject: [PATCH 32/34] Mention that Rust atomics correspond to `atomic_ref` --- src/atomics/atomics.md | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/src/atomics/atomics.md b/src/atomics/atomics.md index 71194e5e..fc3acb48 100644 --- a/src/atomics/atomics.md +++ b/src/atomics/atomics.md @@ -13,9 +13,11 @@ received some bugfixes since then.) Trying to fully explain the model in this book is fairly hopeless. It's defined in terms of madness-inducing causality graphs that require a full book to properly understand in a practical way. If you want all the nitty-gritty -details, you should check out the [C++ specification][C++-model]. -Still, we'll try to cover the basics and some of the problems Rust developers -face. +details, you should check out the [C++ specification][C++-model] — +note that Rust atomics correspond to C++’s `atomic_ref`, since Rust allows +accessing atomics via non-atomic operations when it is safe to do so. +In this section we aim to give an informal overview of the topic to cover the +common problems that Rust developers face. ## Motivation From 6d16ea57a3d34b3627001f2e3b91500c218ee4dc Mon Sep 17 00:00:00 2001 From: SabrinaJewson Date: Fri, 12 Jan 2024 12:24:29 +0000 Subject: [PATCH 33/34] =?UTF-8?q?Explain=20the=20terms=20=E2=80=9Cstrongly?= =?UTF-8?q?/weakly-ordered=20hardware=E2=80=9D?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- src/atomics/atomics.md | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/src/atomics/atomics.md b/src/atomics/atomics.md index fc3acb48..979f102b 100644 --- a/src/atomics/atomics.md +++ b/src/atomics/atomics.md @@ -105,10 +105,12 @@ However there's a third potential state that the hardware enables: * `y = 2`: (thread 2 saw `x = 1`, but not `y = 3`, and then overwrote `y = 3`) It's worth noting that different kinds of CPU provide different guarantees. It -is common to separate hardware into two categories: strongly-ordered and weakly-ordered. -Most notably x86/64 provides strong ordering guarantees, while ARM -provides weak ordering guarantees. This has two consequences for concurrent -programming: +is common to separate hardware into two categories: strongly-ordered and +weakly-ordered, where strongly-ordered hardware implements weak orderings like +`Relaxed` using strong orderings like `Acquire`, while weakly-ordered hardware +makes use of the optimization potential that weak orderings like `Relaxed` give. +Most notably, x86/64 provides strong ordering guarantees, while ARM provides +weak ordering guarantees. This has two consequences for concurrent programming: * Asking for stronger guarantees on strongly-ordered hardware may be cheap or even free because they already provide strong guarantees unconditionally. From b139a3ce581c7340830131a27de0d527912f129e Mon Sep 17 00:00:00 2001 From: SabrinaJewson Date: Sun, 10 Mar 2024 11:21:51 +0000 Subject: [PATCH 34/34] Simplify SeqCst demonstration, and remove incorrect claim --- src/atomics/fences.md | 28 +++---- src/atomics/seqcst.md | 171 +++++++++++++++++++----------------------- 2 files changed, 93 insertions(+), 106 deletions(-) diff --git a/src/atomics/fences.md b/src/atomics/fences.md index 210234b4..6bc08c4f 100644 --- a/src/atomics/fences.md +++ b/src/atomics/fences.md @@ -230,25 +230,25 @@ Namely, the power of `SeqCst` fences can be summarized in three points: The “motivating use-case” for `SeqCst` demonstrated in the `SeqCst` chapter can also be rewritten to use exclusively `SeqCst` fences and `Relaxed` operations, -by inserting fences in between the loads in threads `c` and `d`: +by inserting fences in between the operations in the two threads: ```text - a static X c d static Y b -╭─────────╮ ┌───────┐ ╭─────────╮ ╭─────────╮ ┌───────┐ ╭─────────╮ -│ store X ├─┐ │ false │ ┌─┤ load X │ │ load Y ├─┐ │ false │ ┌─┤ store Y │ -╰─────────╯ │ └───────┘ │ ╰────╥────╯ ╰────╥────╯ │ └───────┘ │ ╰─────────╯ - └─┬───────┐ │ ╭────⇓────╮ ╭────⇓────╮ │ ┌───────┬─┘ - │ true ├─┘ │ *fence* │ │ *fence* │ └─┤ true │ - └───────┘ ╰────╥────╯ ╰────╥────╯ └───────┘ - ╭────⇓────╮ ╭────⇓────╮ - │ load Y ├─? ?─┤ load X │ - ╰─────────╯ ╰─────────╯ + a static X static Y b +╭─────────╮ ┌───────┐ ┌───────┐ ╭─────────╮ +│ store X ├─┐ │ false │ │ false │ ┌─┤ store Y │ +╰────╥────╯ │ └───────┘ └───────┘ │ ╰────╥────╯ +╭────⇓────╮ └─┬───────┐ ┌───────┬─┘ ╭────⇓────╮ +│ *fence* │ │ true │ │ true │ │ *fence* │ +╰────╥────╯ └───────┘ └───────┘ ╰────╥────╯ +╭────⇓────╮ ╭────⇓────╮ +│ load Y ├─? ?─┤ load X │ +╰─────────╯ ╰─────────╯ ``` There are two executions to consider here, depending on which way round the -fences appear in _S_. Should `c`’s fence appear first, the fence–fence `SeqCst` -guarantee tells us that `c`’s load of `X` is not coherence-ordered-after `d`’s -load of `X`, which forbids `d`’s load of `X` from seeing the value `false`. The +fences appear in _S_. Should `a`’s fence appear first, the fence–fence `SeqCst` +guarantee tells us that `b`’s load of `X` is not coherence-ordered-after `a`’s +store of `X`, which forbids `b`’s load of `X` from seeing the value `false`. The same logic can be applied should the fences appear the other way around, proving that at least one thread must load `true` in the end. diff --git a/src/atomics/seqcst.md b/src/atomics/seqcst.md index e49d1871..38e3d1ab 100644 --- a/src/atomics/seqcst.md +++ b/src/atomics/seqcst.md @@ -12,10 +12,8 @@ behind it are, and it gets really ugly really fast as soon as you try to mix it with any other ordering. To understand `SeqCst`, we first have to understand the problem it exists to -solve. The first complexity is that this problem can only be observed in the -presence of at least four different threads _and_ two separate atomic variables; -anything less and it’s not possible to notice a difference. The common example -used to show where weaker orderings produce counterintuitive results is this: +solve. A simple example used to show where weaker orderings produce +counterintuitive results is this: ```rust # use std::sync::atomic::{self, AtomicBool}; @@ -29,56 +27,48 @@ const ORDERING: atomic::Ordering = atomic::Ordering::Relaxed; static X: AtomicBool = AtomicBool::new(false); static Y: AtomicBool = AtomicBool::new(false); -let a = thread::spawn(|| { X.store(true, ORDERING) }); -let b = thread::spawn(|| { Y.store(true, ORDERING) }); -let c = thread::spawn(|| { while !X.load(ORDERING) {} Y.load(ORDERING) }); -let d = thread::spawn(|| { while !Y.load(ORDERING) {} X.load(ORDERING) }); +let a = thread::spawn(|| { X.store(true, ORDERING); Y.load(ORDERING) }); +let b = thread::spawn(|| { Y.store(true, ORDERING); X.load(ORDERING) }); let a = a.join().unwrap(); let b = b.join().unwrap(); -let c = c.join().unwrap(); -let d = d.join().unwrap(); # return; // This assert is allowed to fail. -assert!(c || d); +assert!(a || b); ``` The basic setup of this code, for all of its possible executions, looks like this: ```text - a static X c d static Y b -╭─────────╮ ┌───────┐ ╭─────────╮ ╭─────────╮ ┌───────┐ ╭─────────╮ -│ store X ├─┐ │ false │ ┌─┤ load X │ │ load Y ├─┐ │ false │ ┌─┤ store Y │ -╰─────────╯ │ └───────┘ │ ╰────╥────╯ ╰────╥────╯ │ └───────┘ │ ╰─────────╯ - └─┬───────┐ │ ╭────⇓────╮ ╭────⇓────╮ │ ┌───────┬─┘ - │ true ├─┘ │ load Y ├─? ?─┤ load X │ └─┤ true │ - └───────┘ ╰─────────╯ ╰─────────╯ └───────┘ + a static X static Y b +╭─────────╮ ┌───────┐ ┌───────┐ ╭─────────╮ +│ store X ├─┐ │ false │ │ false │ ┌─┤ store Y │ +╰────╥────╯ │ └───────┘ └───────┘ │ ╰────╥────╯ +╭────⇓────╮ └─┬───────┐ ┌───────┬─┘ ╭────⇓────╮ +│ load Y ├─? │ true │ │ true │ ?─┤ load X │ +╰─────────╯ └───────┘ └───────┘ ╰─────────╯ ``` -In other words, `a` and `b` are guaranteed to, at some point, store `true` into -`X` and `Y` respectively, and `c` and `d` are guaranteed to, at some point, load -those values of `true` from `X` and `Y` (there could also be an arbitrary number -of loads of `false` by `c` and `d`, but they’ve been omitted since they don’t -actually affect the execution at all). The question now is when `c` and `d` load -from `Y` and `X` respectively, is it possible for them _both_ to load `false`? +In other words, `a` and `b` are guaranteed to store `true` into `X` and `Y` +respectively, and then attempt to load from the other thread’s atomic. The +question now is: is it possible for them _both_ to load `false`? And looking at this diagram, there’s absolutely no reason why not. There isn’t -even a single arrow connecting the left and right hand sides so far, so the load -has no coherence-based restrictions on which value it is allowed to pick — and -this goes for both sides equally, so we could end up with an execution like -this: +even a single arrow connecting the left and right hand sides so far, so the +loads have no coherence-based restrictions on which values they are allowed to +pick, and we could end up with an execution like this: ```text - a static X c d static Y b -╭─────────╮ ┌───────┐ ╭─────────╮ ╭─────────╮ ┌───────┐ ╭─────────╮ -│ store X ├─┐ │ false ├┐ ┌┤ load X │ │ load Y ├┐ ┌┤ false │ ┌─┤ store Y │ -╰─────────╯ │ └───────┘│ │╰────╥────╯ ╰────╥────╯│ │└───────┘ │ ╰─────────╯ - └─┬───────┐└─│─────║──────┐┌─────║─────│─┘┌───────┬─┘ - │ true ├──┘╭────⇓────╮┌─┘╭────⇓────╮└──┤ true │ - └───────┘ │ load Y ├┘└─┤ load X │ └───────┘ - ╰─────────╯ ╰─────────╯ + a static X static Y b +╭─────────╮ ┌───────┐ ┌───────┐ ╭─────────╮ +│ store X ├┐ │ false ├─┐┌┤ false │ ┌┤ store Y │ +╰────╥────╯│ └───────┘┌─┘└───────┘ │╰────╥────╯ + ║ │ ┌─────────┘└───────────┐│ ║ +╭────⇓────╮└─│┬───────┐ ┌───────┬─│┘╭────⇓────╮ +│ load Y ├──┘│ true │ │ true │ └─┤ load X │ +╰─────────╯ └───────┘ └───────┘ ╰─────────╯ ``` Which results in a failed assert. This execution is brought about because the @@ -178,16 +168,16 @@ make that execution invalid. As a refresher, here’s the framework for every possible execution of the program: ```text - a static X c d static Y b -╭─────────╮ ┌───────┐ ╭─────────╮ ╭─────────╮ ┌───────┐ ╭─────────╮ -│ store X ├─┐ │ false │ ┌─┤ load X │ │ load Y ├─┐ │ false │ ┌─┤ store Y │ -╰─────────╯ │ └───────┘ │ ╰────╥────╯ ╰────╥────╯ │ └───────┘ │ ╰─────────╯ - └─┬───────┐ │ ╭────⇓────╮ ╭────⇓────╮ │ ┌───────┬─┘ - │ true ├─┘ │ load Y ├─? ?─┤ load X │ └─┤ true │ - └───────┘ ╰─────────╯ ╰─────────╯ └───────┘ + a static X static Y b +╭─────────╮ ┌───────┐ ┌───────┐ ╭─────────╮ +│ store X ├─┐ │ false │ │ false │ ┌─┤ store Y │ +╰────╥────╯ │ └───────┘ └───────┘ │ ╰────╥────╯ +╭────⇓────╮ └─┬───────┐ ┌───────┬─┘ ╭────⇓────╮ +│ load Y ├─? │ true │ │ true │ ?─┤ load X │ +╰─────────╯ └───────┘ └───────┘ ╰─────────╯ ``` -First of all, both the final loads (`c` and `d`’s second operations) need to +First of all, both the final loads (`a` and `b`’s second operations) need to become `SeqCst`, because they need to be aware of the total ordering that determines whether `X` or `Y` becomes `true` first. And secondly, we need to establish that ordering in the first place, and that needs to be done by making @@ -195,77 +185,74 @@ sure that there is always one operation in _S_ that both sees one of the atomics as `true` and precedes both final loads in _S_, so that the coherence ordering guarantee will apply (the final loads themselves don’t work for this since although they “know” that their corresponding atomic is `true` they don’t -interact with it directly so _S_ doesn’t care). - -There are two operations in the program that could fulfill the first condition, -should they be made `SeqCst`: the stores of `true` and the first loads. However, -the second condition ends up ruling out using the stores, since in order to make -sure that they precede the final loads in _S_ it would be necessary to have the -first loads be `SeqCst` anyway (due to the mixed-`SeqCst` special case detailed -later), so in the end we can just leave them as `Relaxed`. +interact with it directly so _S_ doesn’t care) — for this, we must set both +stores to use the `SeqCst` ordering. This leaves us with the correct version of the above program, which is guaranteed to never panic: ```rust -# use std::sync::atomic::{AtomicBool, Ordering::{Relaxed, SeqCst}}; +# use std::sync::atomic::{self, AtomicBool}; use std::thread; +const ORDERING: atomic::Ordering = atomic::Ordering::SeqCst; + static X: AtomicBool = AtomicBool::new(false); static Y: AtomicBool = AtomicBool::new(false); -let a = thread::spawn(|| { X.store(true, Relaxed) }); -let b = thread::spawn(|| { Y.store(true, Relaxed) }); -let c = thread::spawn(|| { while !X.load(SeqCst) {} Y.load(SeqCst) }); -let d = thread::spawn(|| { while !Y.load(SeqCst) {} X.load(SeqCst) }); +let a = thread::spawn(|| { X.store(true, ORDERING); Y.load(ORDERING) }); +let b = thread::spawn(|| { Y.store(true, ORDERING); X.load(ORDERING) }); let a = a.join().unwrap(); let b = b.join().unwrap(); -let c = c.join().unwrap(); -let d = d.join().unwrap(); +# return; // This assert is **not** allowed to fail. -assert!(c || d); +assert!(a || b); ``` As there are four `SeqCst` operations with a partial order between two pairs in them (caused by the sequenced-before relation), there are six possible executions of this program: -- All of `c`’s loads precede `d`’s loads: - 1. `c` loads `X` (gives `true`) - 1. `c` loads `Y` (gives either `false` or `true`) - 1. `d` loads `Y` (gives `true`) - 1. `d` loads `X` (required to be `true`) -- Both initial loads precede both final loads: - 1. `c` loads `X` (gives `true`) - 1. `d` loads `Y` (gives `true`) - 1. `c` loads `Y` (required to be `true`) - 1. `d` loads `X` (required to be `true`) -- As above, but the final loads occur in a different order: - 1. `c` loads `X` (gives `true`) - 1. `d` loads `Y` (gives `true`) - 1. `d` loads `X` (required to be `true`) - 1. `c` loads `Y` (required to be `true`) -- As before, but the initial loads occur in a different order: - 1. `d` loads `Y` (gives `true`) - 1. `c` loads `X` (gives `true`) - 1. `c` loads `Y` (required to be `true`) - 1. `d` loads `X` (required to be `true`) -- As above, but the final loads occur in a different order: - 1. `d` loads `Y` (gives `true`) - 1. `c` loads `X` (gives `true`) - 1. `d` loads `X` (required to be `true`) - 1. `c` loads `Y` (required to be `true`) -- All of `d`’s loads precede `c`’s loads: - 1. `d` loads `Y` (gives `true`) - 1. `d` loads `X` (gives either `false` or `true`) - 1. `c` loads `X` (gives `true`) - 1. `c` loads `Y` (required to be `true`) +- All of `a`’s operations precede `b`’s operations: + 1. `a` stores `true` into `X` + 1. `a` loads `Y` (gives `false`) + 1. `b` stores `true` into `Y` + 1. `b` loads `X` (required to give `true`) +- All of `b`’s operations precede `a`’s operations: + 1. `b` stores `true` into `Y` + 1. `b` loads `X` (gives `false`) + 1. `a` stores `true` into `X` + 1. `a` loads `Y` (required to give `true`) +- The stores precede the loads, + `a`’s store precedes `b`’s and `a`’s load precedes `b`’s: + 1. `a` stores `true` to `X` + 1. `b` stores `true` into `Y` + 1. `a` loads `Y` (required to give `true`) + 1. `b` loads `X` (required to give `true`) +- The stores precede the loads, + `a`’s store precedes `b`’s and `b`’s load precedes `a`’s: + 1. `a` stores `true` to `X` + 1. `b` stores `true` into `Y` + 1. `b` loads `X` (required to give `true`) + 1. `a` loads `Y` (required to give `true`) +- The stores precede the loads, + `b`’s store precedes `a`’s and `a`’s load precedes `b`’s: + 1. `b` stores `true` into `Y` + 1. `a` stores `true` to `X` + 1. `a` loads `Y` (required to give `true`) + 1. `b` loads `X` (required to give `true`) +- The stores precede the loads, + `b`’s store precedes `a`’s and `b`’s load precedes `a`’s: + 1. `b` stores `true` into `Y` + 1. `a` stores `true` to `X` + 1. `b` loads `X` (required to give `true`) + 1. `a` loads `Y` (required to give `true`) All the places where the load was required to give `true` were caused by a -preceding load in _S_ of the same atomic which saw `true` — otherwise, the load -would be coherence-ordered-before a load which precedes it in _S_, and that is +preceding store in _S_ of the same atomic of `true` — otherwise, the load would +be coherence-ordered-before a store which precedes it in _S_, and that is impossible. ## The mixed-`SeqCst` special case