diff --git a/book.toml b/book.toml
index 693aca4a..cf98b4e6 100644
--- a/book.toml
+++ b/book.toml
@@ -31,5 +31,8 @@ git-repository-url = "https://github.com/rust-lang/nomicon"
 "./arc-layout.html" = "./arc-mutex/arc-layout.html"
 "./arc.html" = "./arc-mutex/arc.html"
 
+# Atomics chapter
+"./atomics.html" = "./atomics/atomics.html"
+
 [rust]
 edition = "2021"
diff --git a/src/SUMMARY.md b/src/SUMMARY.md
index f1d15a71..01304c5e 100644
--- a/src/SUMMARY.md
+++ b/src/SUMMARY.md
@@ -41,7 +41,12 @@
 * [Concurrency](concurrency.md)
   * [Races](races.md)
   * [Send and Sync](send-and-sync.md)
-  * [Atomics](atomics.md)
+  * [Atomics](./atomics/atomics.md)
+	* [Multithreaded Execution](./atomics/multithread.md)
+	* [Relaxed](./atomics/relaxed.md)
+	* [Acquire and Release](./atomics/acquire-release.md)
+	* [SeqCst](./atomics/seqcst.md)
+	* [Fences](./atomics/fences.md)
 * [Implementing Vec](./vec/vec.md)
   * [Layout](./vec/vec-layout.md)
   * [Allocating](./vec/vec-alloc.md)
diff --git a/src/arc-mutex/arc-clone.md b/src/arc-mutex/arc-clone.md
index 1adc6c9e..29cb5c77 100644
--- a/src/arc-mutex/arc-clone.md
+++ b/src/arc-mutex/arc-clone.md
@@ -28,7 +28,7 @@ happens-before relationship but is atomic. When `Drop`ping the Arc, however,
 we'll need to atomically synchronize when decrementing the reference count. This
 is described more in [the section on the `Drop` implementation for
 `Arc`](arc-drop.md). For more information on atomic relationships and Relaxed
-ordering, see [the section on atomics](../atomics.md).
+ordering, see [the section on atomics](../atomics/atomics.md).
 
 Thus, the code becomes this:
 
diff --git a/src/atomics.md b/src/atomics.md
deleted file mode 100644
index 72a2d56f..00000000
--- a/src/atomics.md
+++ /dev/null
@@ -1,239 +0,0 @@
-# Atomics
-
-Rust pretty blatantly just inherits the memory model for atomics from C++20. This is not
-due to this model being particularly excellent or easy to understand. Indeed,
-this model is quite complex and known to have [several flaws][C11-busted].
-Rather, it is a pragmatic concession to the fact that *everyone* is pretty bad
-at modeling atomics. At very least, we can benefit from existing tooling and
-research around the C/C++ memory model.
-(You'll often see this model referred to as "C/C++11" or just "C11". C just copies
-the C++ memory model; and C++11 was the first version of the model but it has
-received some bugfixes since then.)
-
-Trying to fully explain the model in this book is fairly hopeless. It's defined
-in terms of madness-inducing causality graphs that require a full book to
-properly understand in a practical way. If you want all the nitty-gritty
-details, you should check out the [C++ specification][C++-model].
-Still, we'll try to cover the basics and some of the problems Rust developers
-face.
-
-The C++ memory model is fundamentally about trying to bridge the gap between the
-semantics we want, the optimizations compilers want, and the inconsistent chaos
-our hardware wants. *We* would like to just write programs and have them do
-exactly what we said but, you know, fast. Wouldn't that be great?
-
-## Compiler Reordering
-
-Compilers fundamentally want to be able to do all sorts of complicated
-transformations to reduce data dependencies and eliminate dead code. In
-particular, they may radically change the actual order of events, or make events
-never occur! If we write something like:
-
-<!-- ignore: simplified code -->
-```rust,ignore
-x = 1;
-y = 3;
-x = 2;
-```
-
-The compiler may conclude that it would be best if your program did:
-
-<!-- ignore: simplified code -->
-```rust,ignore
-x = 2;
-y = 3;
-```
-
-This has inverted the order of events and completely eliminated one event.
-From a single-threaded perspective this is completely unobservable: after all
-the statements have executed we are in exactly the same state. But if our
-program is multi-threaded, we may have been relying on `x` to actually be
-assigned to 1 before `y` was assigned. We would like the compiler to be
-able to make these kinds of optimizations, because they can seriously improve
-performance. On the other hand, we'd also like to be able to depend on our
-program *doing the thing we said*.
-
-## Hardware Reordering
-
-On the other hand, even if the compiler totally understood what we wanted and
-respected our wishes, our hardware might instead get us in trouble. Trouble
-comes from CPUs in the form of memory hierarchies. There is indeed a global
-shared memory space somewhere in your hardware, but from the perspective of each
-CPU core it is *so very far away* and *so very slow*. Each CPU would rather work
-with its local cache of the data and only go through all the anguish of
-talking to shared memory only when it doesn't actually have that memory in
-cache.
-
-After all, that's the whole point of the cache, right? If every read from the
-cache had to run back to shared memory to double check that it hadn't changed,
-what would the point be? The end result is that the hardware doesn't guarantee
-that events that occur in some order on *one* thread, occur in the same
-order on *another* thread. To guarantee this, we must issue special instructions
-to the CPU telling it to be a bit less smart.
-
-For instance, say we convince the compiler to emit this logic:
-
-```text
-initial state: x = 0, y = 1
-
-THREAD 1        THREAD 2
-y = 3;          if x == 1 {
-x = 1;              y *= 2;
-                }
-```
-
-Ideally this program has 2 possible final states:
-
-* `y = 3`: (thread 2 did the check before thread 1 completed)
-* `y = 6`: (thread 2 did the check after thread 1 completed)
-
-However there's a third potential state that the hardware enables:
-
-* `y = 2`: (thread 2 saw `x = 1`, but not `y = 3`, and then overwrote `y = 3`)
-
-It's worth noting that different kinds of CPU provide different guarantees. It
-is common to separate hardware into two categories: strongly-ordered and weakly-ordered.
-Most notably x86/64 provides strong ordering guarantees, while ARM
-provides weak ordering guarantees. This has two consequences for concurrent
-programming:
-
-* Asking for stronger guarantees on strongly-ordered hardware may be cheap or
-  even free because they already provide strong guarantees unconditionally.
-  Weaker guarantees may only yield performance wins on weakly-ordered hardware.
-
-* Asking for guarantees that are too weak on strongly-ordered hardware is
-  more likely to *happen* to work, even though your program is strictly
-  incorrect. If possible, concurrent algorithms should be tested on
-  weakly-ordered hardware.
-
-## Data Accesses
-
-The C++ memory model attempts to bridge the gap by allowing us to talk about the
-*causality* of our program. Generally, this is by establishing a *happens
-before* relationship between parts of the program and the threads that are
-running them. This gives the hardware and compiler room to optimize the program
-more aggressively where a strict happens-before relationship isn't established,
-but forces them to be more careful where one is established. The way we
-communicate these relationships are through *data accesses* and *atomic
-accesses*.
-
-Data accesses are the bread-and-butter of the programming world. They are
-fundamentally unsynchronized and compilers are free to aggressively optimize
-them. In particular, data accesses are free to be reordered by the compiler on
-the assumption that the program is single-threaded. The hardware is also free to
-propagate the changes made in data accesses to other threads as lazily and
-inconsistently as it wants. Most critically, data accesses are how data races
-happen. Data accesses are very friendly to the hardware and compiler, but as
-we've seen they offer *awful* semantics to try to write synchronized code with.
-Actually, that's too weak.
-
-**It is literally impossible to write correct synchronized code using only data
-accesses.**
-
-Atomic accesses are how we tell the hardware and compiler that our program is
-multi-threaded. Each atomic access can be marked with an *ordering* that
-specifies what kind of relationship it establishes with other accesses. In
-practice, this boils down to telling the compiler and hardware certain things
-they *can't* do. For the compiler, this largely revolves around re-ordering of
-instructions. For the hardware, this largely revolves around how writes are
-propagated to other threads. The set of orderings Rust exposes are:
-
-* Sequentially Consistent (SeqCst)
-* Release
-* Acquire
-* Relaxed
-
-(Note: We explicitly do not expose the C++ *consume* ordering)
-
-TODO: negative reasoning vs positive reasoning? TODO: "can't forget to
-synchronize"
-
-## Sequentially Consistent
-
-Sequentially Consistent is the most powerful of all, implying the restrictions
-of all other orderings. Intuitively, a sequentially consistent operation
-cannot be reordered: all accesses on one thread that happen before and after a
-SeqCst access stay before and after it. A data-race-free program that uses
-only sequentially consistent atomics and data accesses has the very nice
-property that there is a single global execution of the program's instructions
-that all threads agree on. This execution is also particularly nice to reason
-about: it's just an interleaving of each thread's individual executions. This
-does not hold if you start using the weaker atomic orderings.
-
-The relative developer-friendliness of sequential consistency doesn't come for
-free. Even on strongly-ordered platforms sequential consistency involves
-emitting memory fences.
-
-In practice, sequential consistency is rarely necessary for program correctness.
-However sequential consistency is definitely the right choice if you're not
-confident about the other memory orders. Having your program run a bit slower
-than it needs to is certainly better than it running incorrectly! It's also
-mechanically trivial to downgrade atomic operations to have a weaker
-consistency later on. Just change `SeqCst` to `Relaxed` and you're done! Of
-course, proving that this transformation is *correct* is a whole other matter.
-
-## Acquire-Release
-
-Acquire and Release are largely intended to be paired. Their names hint at their
-use case: they're perfectly suited for acquiring and releasing locks, and
-ensuring that critical sections don't overlap.
-
-Intuitively, an acquire access ensures that every access after it stays after
-it. However operations that occur before an acquire are free to be reordered to
-occur after it. Similarly, a release access ensures that every access before it
-stays before it. However operations that occur after a release are free to be
-reordered to occur before it.
-
-When thread A releases a location in memory and then thread B subsequently
-acquires *the same* location in memory, causality is established. Every write
-(including non-atomic and relaxed atomic writes) that happened before A's
-release will be observed by B after its acquisition. However no causality is
-established with any other threads. Similarly, no causality is established
-if A and B access *different* locations in memory.
-
-Basic use of release-acquire is therefore simple: you acquire a location of
-memory to begin the critical section, and then release that location to end it.
-For instance, a simple spinlock might look like:
-
-```rust
-use std::sync::Arc;
-use std::sync::atomic::{AtomicBool, Ordering};
-use std::thread;
-
-fn main() {
-    let lock = Arc::new(AtomicBool::new(false)); // value answers "am I locked?"
-
-    // ... distribute lock to threads somehow ...
-
-    // Try to acquire the lock by setting it to true
-    while lock.compare_and_swap(false, true, Ordering::Acquire) { }
-    // broke out of the loop, so we successfully acquired the lock!
-
-    // ... scary data accesses ...
-
-    // ok we're done, release the lock
-    lock.store(false, Ordering::Release);
-}
-```
-
-On strongly-ordered platforms most accesses have release or acquire semantics,
-making release and acquire often totally free. This is not the case on
-weakly-ordered platforms.
-
-## Relaxed
-
-Relaxed accesses are the absolute weakest. They can be freely re-ordered and
-provide no happens-before relationship. Still, relaxed operations are still
-atomic. That is, they don't count as data accesses and any read-modify-write
-operations done to them occur atomically. Relaxed operations are appropriate for
-things that you definitely want to happen, but don't particularly otherwise care
-about. For instance, incrementing a counter can be safely done by multiple
-threads using a relaxed `fetch_add` if you're not using the counter to
-synchronize any other accesses.
-
-There's rarely a benefit in making an operation relaxed on strongly-ordered
-platforms, since they usually provide release-acquire semantics anyway. However
-relaxed operations can be cheaper on weakly-ordered platforms.
-
-[C11-busted]: http://plv.mpi-sws.org/c11comp/popl15.pdf
-[C++-model]: https://en.cppreference.com/w/cpp/atomic/memory_order
diff --git a/src/atomics/acquire-release.md b/src/atomics/acquire-release.md
new file mode 100644
index 00000000..6d6f0b95
--- /dev/null
+++ b/src/atomics/acquire-release.md
@@ -0,0 +1,354 @@
+# Acquire and Release
+
+Next, we’re going to try and implement one of the simplest concurrent utilities
+possible — a mutex, but without support for waiting (since that’s not really
+related to what we’re doing now). It will hold both an atomic flag that
+indicates whether it is locked or not, and the protected data itself. In code
+this translates to:
+
+```rs
+use std::cell::UnsafeCell;
+use std::sync::atomic::AtomicBool;
+
+pub struct Mutex<T> {
+    locked: AtomicBool,
+    data: UnsafeCell<T>,
+}
+
+impl<T> Mutex<T> {
+    pub const fn new(data: T) -> Self {
+        Self {
+            locked: AtomicBool::new(false),
+            data: UnsafeCell::new(data),
+        }
+    }
+}
+```
+
+Now for the lock function. We need to use an RMW here, since we need to both
+check whether it is locked and lock it if it isn’t in a single atomic step; this
+can be most simply done with a `compare_exchange` (unlike before, it doesn’t
+need to be in a loop this time). For the ordering, we’ll just use `Relaxed`
+since we don’t know of any others yet.
+
+```rust
+# use std::cell::UnsafeCell;
+# use std::sync::atomic::{self, AtomicBool};
+# pub struct Mutex<T> {
+#     locked: AtomicBool,
+#     data: UnsafeCell<T>,
+# }
+impl<T> Mutex<T> {
+    pub fn lock(&self) -> Option<Guard<'_, T>> {
+        match self.locked.compare_exchange(
+            false,
+            true,
+            atomic::Ordering::Relaxed,
+            atomic::Ordering::Relaxed,
+        ) {
+            Ok(_) => Some(Guard(self)),
+            Err(_) => None,
+        }
+    }
+}
+
+pub struct Guard<'mutex, T>(&'mutex Mutex<T>);
+// Deref impl omitted…
+```
+
+We also need to implement `Drop` for `Guard` to make sure the lock on the mutex
+is released once the guard is destroyed. Again we’re just using the `Relaxed`
+ordering.
+
+```rust
+# use std::cell::UnsafeCell;
+# use std::sync::atomic::{self, AtomicBool};
+# pub struct Mutex<T> {
+#     locked: AtomicBool,
+#     data: UnsafeCell<T>,
+# }
+# pub struct Guard<'mutex, T>(&'mutex Mutex<T>);
+impl<T> Drop for Guard<'_, T> {
+    fn drop(&mut self) {
+        self.0.locked.store(false, atomic::Ordering::Relaxed);
+    }
+}
+```
+
+Great! In the normal operation then, this primitive should allow unique access
+to the data of the mutex to be transferred across different threads. Usual usage
+could look like this:
+
+```rust,ignore
+// Initial state
+let mutex = Mutex::new(0);
+// Thread 1
+if let Some(guard) = mutex.lock() {
+    *guard += 1;
+}
+// Thread 2
+if let Some(guard) = mutex.lock() {
+    println!("{}", *guard);
+}
+```
+
+Now, there are many possible executions of this code. For example, Thread 2 (the
+reader thread) could lock the mutex first, and Thread 1 (the writer thread)
+could fail to lock it:
+
+```text
+Thread 1      locked    data      Thread 2
+╭───────╮   ┌────────┐ ┌───┐     ╭───────╮
+│  cas  ├─┐ │ false  │ │ 0 ├╌┐ ┌─┤  cas  │
+╰───────╯ │ └────────┘ └───┘ ┊ │ ╰───╥───╯
+          │ ┌────────┬───────┼─┘ ╭───⇓───╮
+          └─┤  true  │       └╌╌╌┤ guard │
+            └────────┘           ╰───╥───╯
+            ┌────────┬─────────┐ ╭───⇓───╮
+            │ false  │         └─┤ store │
+            └────────┘           ╰───────╯
+```
+
+Or potentially Thread _1_ could lock the mutex first, and Thread _2_ could fail
+to lock it:
+
+```text
+Thread 1      locked      data    Thread 2
+╭───────╮   ┌────────┐   ┌───┐   ╭───────╮
+│  cas  ├─┐ │ false  │ ┌─│ 0 │───┤  cas  │
+╰───╥───╯ │ └────────┘ │┌┼╌╌╌┤   ╰───────╯
+╭───⇓───╮ └─┬────────┐ │├┼╌╌╌┤
+│ += 1; ├╌┐ │  true  ├─┘┊│ 1 │
+╰───╥───╯ ┊ └────────┘  ┊└───┘
+╭───⇓───╮ └╌╌╌╌╌╌╌╌╌╌╌╌╌┘
+│ store ├───┬────────┐
+╰───────╯   │ false  │
+            └────────┘
+```
+
+But the interesting case comes in when Thread 1 successfully locks and unlocks
+the mutex, and then Thread 2 locks it. Let’s draw that one out too:
+
+```text
+Thread 1      locked     data       Thread 2
+╭───────╮   ┌────────┐   ┌───┐     ╭───────╮
+│  cas  ├─┐ │ false  │   │ 0 │ ┌───┤  cas  │
+╰───╥───╯ │ └────────┘  ┌┼╌╌╌┤ │   ╰───╥───╯
+╭───⇓───╮ └─┬────────┐  ├┼╌╌╌┤ │   ╭───⇓───╮
+│ += 1; ├╌┐ │  true  │  ┊│ 1 │ │ ?╌┤ guard │
+╰───╥───╯ ┊ └────────┘  ┊└───┘ │   ╰───╥───╯
+╭───⇓───╮ └╌╌╌╌╌╌╌╌╌╌╌╌╌┘      │   ╭───⇓───╮
+│ store ├───┬────────┐         │ ┌─┤ store │
+╰───────╯   │ false  │         │ │ ╰───────╯
+            └────────┘         │ │
+            ┌────────┬─────────┘ │
+            │  true  │           │
+            └────────┘           │
+            ┌────────┬───────────┘
+            │ false  │
+            └────────┘
+```
+
+Look at the second operation Thread 2 performs (the read of `data`), for which
+we haven’t yet joined the line. Where should it connect to? Well actually, it
+has multiple options…wait, we’ve seen this before! It’s a data race!
+
+That’s not good. Last time the solution was to use atomics instead — but in this
+case that doesn’t seem to be enough, since even if atomics were used it still
+would have the _option_ of reading `0` instead of `1`, and really if we want our
+mutex to be sane, it should only be able to read `1`.
+
+So it seems that what we _want_ is to be able to apply the coherence rules from
+before to completely rule out zero from the set of the possible values — if we
+were able to draw a large arrow from the Thread 1’s `+= 1;` to Thread 2’s
+`guard`, then we could trivially then use the rule to rule out `0` as a value
+that could be read.
+
+This is where the `Acquire` and `Release` orderings come in. Informally put, a
+_release store_ will cause an arrow instead of a line to be drawn from the
+operation to the destination; and similarly an _acquire load_ will cause an
+arrow to be drawn from the destination to the operation. To give a useless
+example that illustrates this, for the given program:
+
+```rust
+# use std::sync::atomic::{self, AtomicU32};
+// Initial state
+let a = AtomicU32::new(0);
+// Thread 1
+a.store(1, atomic::Ordering::Release);
+// Thread 2
+a.load(atomic::Ordering::Acquire);
+```
+
+The two possible executions look like this:
+
+```text
+    Possible Execution 1      ┃      Possible Execution 2
+                              ┃
+Thread 1      a     Thread 2  ┃  Thread 1      a     Thread 2
+╭───────╮   ┌───┐   ╭──────╮  ┃  ╭───────╮   ┌───┐   ╭──────╮
+│ store ├─┐ │ 0 │ ┌─→ load │  ┃  │ store ├─┐ │ 0 ├───→ load │
+╰───────╯ │ └───┘ │ ╰──────╯  ┃  ╰───────╯ │ └───┘   ╰──────╯
+          └─↘───┐ │           ┃            └─↘───┐
+            │ 1 ├─┘           ┃              │ 1 │
+            └───┘             ┃              └───┘
+```
+
+These arrows are a new kind of arrow we haven’t seen yet; they are known as
+_happens-before_ (or happens-after) relations and are represented as thin arrows
+(→) on these diagrams. They are weaker than the _sequenced-before_
+double-arrows (⇒) that occur inside a single thread, but can still be used with
+the coherence rules to determine which values of a memory location are valid to
+read.
+
+When a happens-before arrow stores a data value to an atomic (via a release
+operation) which is then loaded by another happens-before arrow (via an acquire
+operation) we say that the release operation _synchronized-with_ the acquire
+operation, which in doing so establishes that the release operation
+_happens-before_ the acquire operation. Therefore, we can say that in the first
+possible execution, Thread 1’s `store` synchronizes-with Thread 2’s `load`,
+which causes that `store` and everything sequenced-before it to happen-before
+the `load` and everything sequenced-after it.
+
+> More formally, we can say that A happens-before B if any of the following
+> conditions are true:
+> 1. A is sequenced-before B (i.e. A occurs before B on the same thread)
+> 2. A synchronizes-with B (i.e. A is a `Release` operation and B is an
+>    `Acquire` operation that reads the value written by A)
+> 3. A happens-before X, and X happens-before B (transitivity)
+
+There is one more rule required for these to be useful, and that is _release
+sequences_: after a release store is performed on an atomic, happens-before
+arrows will connect together each subsequent value of the atomic as long as the
+new value is caused by an RMW and not just a plain store (this means any
+subsequent normal store, no matter the ordering, will end the sequence).
+
+> In the C++11 memory model, any subsequent store by the same thread that
+> performed the original `Release` store would also contribute to the release
+> sequence. However, this was removed in C++20 for simplicity and better
+> optimizations and so **must not** be relied upon.
+
+With those rules in mind, converting Thread 1’s second store to use a `Release`
+ordering as well as converting Thread 2’s CAS to use an `Acquire` ordering
+allows us to effectively draw that arrow we needed before:
+
+```text
+Thread 1     locked     data       Thread 2
+╭───────╮   ┌───────┐   ┌───┐     ╭───────╮
+│  cas  ├─┐ │ false │   │ 0 │ ┌───→  cas  │
+╰───╥───╯ │ └───────┘  ┌┼╌╌╌┤ │   ╰───╥───╯
+╭───⇓───╮ └─┬───────┐  ├┼╌╌╌┤ │   ╭───⇓───╮
+│ += 1; ├╌┐ │ true  │  ┊│ 1 ├╌│╌╌╌┤ guard │
+╰───╥───╯ ┊ └───────┘  ┊└───┘ │   ╰───╥───╯
+╭───⇓───╮ └╌╌╌╌╌╌╌╌╌╌╌╌┘      │   ╭───⇓───╮
+│ store ├───↘───────┐         │ ┌─┤ store │
+╰───────╯   │ false │         │ │ ╰───────╯
+            └───┬───┘         │ │
+            ┌───↓───┬─────────┘ │
+            │ true  │           │
+            └───────┘           │
+            ┌───────┬───────────┘
+            │ false │
+            └───────┘
+```
+
+We now can trace back along the reverse direction of arrows from the `guard`
+bubble to the `+= 1` bubble; we have established that Thread 2’s load
+happens-after the `+= 1` side effect, because Thread 2’s CAS synchronizes-with
+Thread 1’s store. This both avoids the data race and gives the guarantee that
+`1` will be always read by Thread 2 (as long as it locks after Thread 1, of
+course).
+
+However, that is not the only execution of the program possible. Even with this
+setup, there is another execution that can also cause UB: if Thread 2 locks the
+mutex before Thread 1 does.
+
+```text
+Thread 1       locked     data      Thread 2
+╭───────╮     ┌───────┐   ┌───┐    ╭───────╮
+│  cas  ├───┐ │ false │┌──│ 0 │────→  cas  │
+╰───╥───╯   │ └───────┘│ ┌┼╌╌╌┤    ╰───╥───╯
+╭───⇓───╮   │ ┌───────┬┘ ├┼╌╌╌┤    ╭───⇓───╮
+│ += 1; ├╌┐ │ │ true  │  ┊│ 1 │  ?╌┤ guard │
+╰───╥───╯ ┊ │ └───────┘  ┊└───┘    ╰───╥───╯
+╭───⇓───╮ └╌│╌╌╌╌╌╌╌╌╌╌╌╌┘         ╭───⇓───╮
+│ store ├─┐ │ ┌───────┬────────────┤ store │
+╰───────╯ │ │ │ false │            ╰───────╯
+          │ │ └───────┘
+          │ └─┬───────┐
+          │   │ true  │
+          │   └───────┘
+          └───↘───────┐
+              │ false │
+              └───────┘
+```
+
+Once again `guard` has multiple options for values to read. This one’s a bit
+more counterintuitive than the previous one, since it requires “travelling
+forward in time” to understand why the `1` is even there in the first place —
+but since the abstract machine has no concept of time, it’s just a valid UB as
+any other.
+
+Luckily, we’ve already solved this problem once, so it easy to solve again: just
+like before, we’ll have the CAS become acquire and the store become release, and
+then we can use the second coherence rule from before to follow _forward_ the
+arrow from the `guard` bubble all the way to the `+= 1;`, determining that it is
+only possible for that read to see `0` as its value, as in the execution below.
+
+```text
+Thread 1       locked     data      Thread 2
+╭───────╮     ┌───────┐   ┌───┐    ╭───────╮
+│  cas  ←───┐ │ false │┌──│ 0 ├╌┐──→  cas  │
+╰───╥───╯   │ └───────┘│ ┌┼╌╌╌┤ ┊  ╰───╥───╯
+╭───⇓───╮   │ ┌───────┬┘ ├┼╌╌╌┤ ┊  ╭───⇓───╮
+│ += 1; ├╌┐ │ │ true  │  ┊│ 1 │ └─╌┤ guard │
+╰───╥───╯ ┊ │ └───────┘  ┊└───┘    ╰───╥───╯
+╭───⇓───╮ └╌│╌╌╌╌╌╌╌╌╌╌╌╌┘         ╭───⇓───╮
+│ store ├─┐ │ ┌───────↙────────────┤ store │
+╰───────╯ │ │ │ false │            ╰───────╯
+          │ │ └───┬───┘
+          │ └─┬───↓───┐
+          │   │ true  │
+          │   └───────┘
+          └───↘───────┐
+              │ false │
+              └───────┘
+```
+
+This leads us to the proper memory orderings for any mutex (and other locks like
+RW locks too, even): use `Acquire` to lock it, and `Release` to unlock it. So
+let’s go back to and update our original mutex definition with this knowledge.
+
+But wait, `compare_exchange` takes two ordering parameters, not just one! That’s
+right — it also takes a second one to apply when the exchange fails (in our case,
+when the mutex is already locked). But we don’t need an `Acquire` here, since in
+that case we won’t be reading from the `data` value anyway, so we’ll just stick
+with `Relaxed`.
+
+```rust,ignore
+impl<T> Mutex<T> {
+    pub fn lock(&self) -> Option<Guard<'_, T>> {
+        match self.locked.compare_exchange(
+            false,
+            true,
+            atomic::Ordering::Acquire,
+            atomic::Ordering::Relaxed,
+        ) {
+            Ok(_) => Some(Guard(self)),
+            Err(_) => None,
+        }
+    }
+}
+
+impl<T> Drop for Guard<'_, T> {
+    fn drop(&mut self) {
+        self.0.locked.store(false, atomic::Ordering::Release);
+    }
+}
+```
+
+Note that similarly to how atomic operations only make sense when paired with
+other atomic operations on the same locations, `Acquire` only makes sense when
+paired with `Release` and vice versa. That is, both an `Acquire` with no
+corresponding `Release` and a `Release` with no corresponding `Acquire` are
+useless, since the arrows will be unable to connect to anything.
diff --git a/src/atomics/atomics.md b/src/atomics/atomics.md
new file mode 100644
index 00000000..979f102b
--- /dev/null
+++ b/src/atomics/atomics.md
@@ -0,0 +1,125 @@
+# Atomics
+
+Rust pretty blatantly just inherits the memory model for atomics from C++20. This is not
+due to this model being particularly excellent or easy to understand. Indeed,
+this model is quite complex and known to have [several flaws][C11-busted].
+Rather, it is a pragmatic concession to the fact that *everyone* is pretty bad
+at modeling atomics. At very least, we can benefit from existing tooling and
+research around the C/C++ memory model.
+(You'll often see this model referred to as "C/C++11" or just "C11". C just copies
+the C++ memory model; and C++11 was the first version of the model but it has
+received some bugfixes since then.)
+
+Trying to fully explain the model in this book is fairly hopeless. It's defined
+in terms of madness-inducing causality graphs that require a full book to
+properly understand in a practical way. If you want all the nitty-gritty
+details, you should check out the [C++ specification][C++-model] —
+note that Rust atomics correspond to C++’s `atomic_ref`, since Rust allows
+accessing atomics via non-atomic operations when it is safe to do so.
+In this section we aim to give an informal overview of the topic to cover the
+common problems that Rust developers face.
+
+## Motivation
+
+The C++ memory model is very large and confusing with lots of seemingly
+arbitrary design decisions. To understand the motivation behind this, it can
+help to look at what got us in this situation in the first place. There are
+three main factors at play here:
+
+1. Users of the language, who want fast, cross-platform code;
+2. compilers, who want to optimize code to make it fast;
+3. and the hardware, which is ready to unleash a wrath of inconsistent chaos on
+  your program at a moment's notice.
+
+The memory model is fundamentally about trying to bridge the gap between these
+three, allowing users to write the algorithms they want while the compiler and
+hardware perform the arcane magic necessary to make them run fast.
+
+### Compiler Reordering
+
+Compilers fundamentally want to be able to do all sorts of complicated
+transformations to reduce data dependencies and eliminate dead code. In
+particular, they may radically change the actual order of events, or make events
+never occur! If we write something like:
+
+<!-- ignore: simplified code -->
+```rust,ignore
+x = 1;
+y = 3;
+x = 2;
+```
+
+The compiler may conclude that it would be best if your program did:
+
+<!-- ignore: simplified code -->
+```rust,ignore
+x = 2;
+y = 3;
+```
+
+This has inverted the order of events and completely eliminated one event.
+From a single-threaded perspective this is completely unobservable: after all
+the statements have executed we are in exactly the same state. But if our
+program is multi-threaded, we may have been relying on `x` to actually be
+assigned to 1 before `y` was assigned. We would like the compiler to be
+able to make these kinds of optimizations, because they can seriously improve
+performance. On the other hand, we'd also like to be able to depend on our
+program *doing the thing we said*.
+
+### Hardware Reordering
+
+On the other hand, even if the compiler totally understood what we wanted and
+respected our wishes, our hardware might instead get us in trouble. Trouble
+comes from CPUs in the form of memory hierarchies. There is indeed a global
+shared memory space somewhere in your hardware, but from the perspective of each
+CPU core it is *so very far away* and *so very slow*. Each CPU would rather work
+with its local cache of the data and only go through all the anguish of
+talking to shared memory only when it doesn't actually have that memory in
+cache.
+
+After all, that's the whole point of the cache, right? If every read from the
+cache had to run back to shared memory to double check that it hadn't changed,
+what would the point be? The end result is that the hardware doesn't guarantee
+that events that occur in some order on *one* thread, occur in the same
+order on *another* thread. To guarantee this, we must issue special instructions
+to the CPU telling it to be a bit less smart.
+
+For instance, say we convince the compiler to emit this logic:
+
+```text
+initial state: x = 0, y = 1
+
+THREAD 1        THREAD 2
+y = 3;          if x == 1 {
+x = 1;              y *= 2;
+                }
+```
+
+Ideally this program has 2 possible final states:
+
+* `y = 3`: (thread 2 did the check before thread 1 completed)
+* `y = 6`: (thread 2 did the check after thread 1 completed)
+
+However there's a third potential state that the hardware enables:
+
+* `y = 2`: (thread 2 saw `x = 1`, but not `y = 3`, and then overwrote `y = 3`)
+
+It's worth noting that different kinds of CPU provide different guarantees. It
+is common to separate hardware into two categories: strongly-ordered and
+weakly-ordered, where strongly-ordered hardware implements weak orderings like
+`Relaxed` using strong orderings like `Acquire`, while weakly-ordered hardware
+makes use of the optimization potential that weak orderings like `Relaxed` give.
+Most notably, x86/64 provides strong ordering guarantees, while ARM provides
+weak ordering guarantees. This has two consequences for concurrent programming:
+
+* Asking for stronger guarantees on strongly-ordered hardware may be cheap or
+  even free because they already provide strong guarantees unconditionally.
+  Weaker guarantees may only yield performance wins on weakly-ordered hardware.
+
+* Asking for guarantees that are too weak on strongly-ordered hardware is
+  more likely to *happen* to work, even though your program is strictly
+  incorrect. If possible, concurrent algorithms should be tested on
+  weakly-ordered hardware.
+
+[C11-busted]: http://plv.mpi-sws.org/c11comp/popl15.pdf
+[C++-model]: https://en.cppreference.com/w/cpp/atomic/memory_order
diff --git a/src/atomics/fences.md b/src/atomics/fences.md
new file mode 100644
index 00000000..6bc08c4f
--- /dev/null
+++ b/src/atomics/fences.md
@@ -0,0 +1,257 @@
+# Fences
+
+As well as loads, stores, and RMWs, there is one more kind of atomic operation
+to be aware of: fences. Fences can be triggered by the
+[`core::sync::atomic::fence`] function, which accepts a single ordering
+parameter and returns nothing. They don’t do anything on their own, but can be
+thought of as events that strengthen the ordering of nearby atomic operations.
+
+## Acquire fences
+
+The most common kind of fence is an _acquire fence_, which can be triggered in
+three different ways:
+1. `atomic::fence(atomic::Ordering::Acquire)`
+1. `atomic::fence(atomic::Ordering::AcqRel)`
+1. `atomic::fence(atomic::Ordering::SeqCst)`
+
+An acquire fence retroactively makes every single non-`Acquire` operation that
+was sequenced-before it act like an `Acquire` operation that occurred at the
+fence — in other words, it causes every prior `Release`d value that was
+previously loaded on the thread to synchronize-with the fence. For example, the
+following code:
+
+```rust
+# use std::sync::atomic::{self, AtomicU32};
+static X: AtomicU32 = AtomicU32::new(0);
+
+// t_1
+X.store(1, atomic::Ordering::Release);
+
+// t_2
+let value = X.load(atomic::Ordering::Relaxed);
+atomic::fence(atomic::Ordering::Acquire);
+```
+
+Can result in two possible executions:
+
+```text
+      Possible Execution 1      ┃      Possible Execution 2
+                                ┃
+   t_1        X        t_2      ┃      t_1        X        t_2
+╭───────╮   ┌───┐   ╭───────╮   ┃   ╭───────╮   ┌───┐   ╭───────╮
+│ store ├─┐ │ 0 │ ┌─┤ load  │   ┃   │ store ├─┐ │ 0 ├───┤ load  │
+╰───────╯ │ └───┘ │ ╰───╥───╯   ┃   ╰───────╯ │ └───┘   ╰───╥───╯
+          └─↘───┐ │ ╭───⇓───╮   ┃             └─↘───┐   ╭───⇓───╮
+            │ 1 ├─┘┌→ fence │   ┃               │ 1 │   │ fence │
+            └───┴──┘╰───────╯   ┃               └───┘   ╰───────╯
+```
+
+In the first execution, `t_1`’s store synchronizes-with and therefore
+happens-before `t_2`’s fence due to the prior load, but note that it does _not_
+happen-before `t_2`’s load.
+
+Acquire fences work on any number of atomics, and on release sequences too. A
+more complex example is as follows:
+
+```rust
+# use std::sync::atomic::{self, AtomicU32};
+static X: AtomicU32 = AtomicU32::new(0);
+static Y: AtomicU32 = AtomicU32::new(0);
+
+// t_1
+X.store(1, atomic::Ordering::Release);
+X.fetch_add(1, atomic::Ordering::Relaxed);
+
+// t_2
+Y.store(1, atomic::Ordering::Release);
+
+// t_3
+let x = X.load(atomic::Ordering::Relaxed);
+let y = Y.load(atomic::Ordering::Relaxed);
+atomic::fence(atomic::Ordering::Acquire);
+```
+
+This can result in an execution like so:
+
+```text
+   t_1        X        t_3        Y        t_2
+╭───────╮   ┌───┐   ╭───────╮   ┌───┐   ╭───────╮
+│ store ├─┐ │ 0 │ ┌─┤ load  │   │ 0 │ ┌─┤ store │
+╰───╥───╯ │ └───┘ │ ╰───╥───╯   └───┘ │ ╰───────╯
+╭───⇓───╮ └─↘───┐ │ ╭───⇓───╮   ┌───↙─┘
+│  rmw  ├─┐ │ 1 │ │ │ load  ├───┤ 1 │
+╰───────╯ │ └─┬─┘ │ ╰───╥───╯ ┌─┴───┘
+          └─┬─↓─┐ │ ╭───⇓───╮ │
+            │ 2 ├─┘┌→ fence ←─┘
+            └───┴──┘╰───────╯
+```
+
+There are two common scenarios in which acquire fences are used:
+1. When an `Acquire` ordering is only necessary when a specific value is loaded.
+	For example, you may only wish to acquire when an `initialized` boolean is
+	`true`, since otherwise you won’t be reading the shared state at all. In
+	this case, you can load with a `Relaxed` ordering and then issue an
+	`Acquire` fence afterward only if that condition is met, which can aid in
+	performance sometimes (since the acquire operation is avoided when
+	`initialized == false`).
+2. When several `Acquire` operations on different locations need to be performed
+	in a row, but individually each operation doesn’t need `Acquire` ordering;
+	it is often faster to perform all the loads as `Relaxed` first and use a
+	single `Acquire` fence at the end then it is to make each one separately use
+	`Acquire`.
+
+## Release fences
+
+Release fences are the natural complement to acquire fences, and they similarly
+can be triggered in three different ways:
+1. `atomic::fence(atomic::Ordering::Release)`
+1. `atomic::fence(atomic::Ordering::AcqRel)`
+1. `atomic::fence(atomic::Ordering::SeqCst)`
+
+Release fences convert every subsequent atomic access in the same thread into a
+release operation that has its arrow starting from the fence — in other words,
+every `Acquire` operation that sees a value that was written by the fence’s
+thread after the release fence will synchronize-with the release fence. For
+example, the following code:
+
+```rust
+# use std::sync::atomic::{self, AtomicU32};
+static X: AtomicU32 = AtomicU32::new(0);
+
+// t_1
+atomic::fence(atomic::Ordering::Release);
+X.store(1, atomic::Ordering::Relaxed);
+
+// t_2
+X.load(atomic::Ordering::Acquire);
+```
+
+Can result in this execution:
+
+```text
+   t_1        X        t_2
+╭───────╮   ┌───┐   ╭───────╮
+│ fence ├─┐ │ 0 │ ┌─→ load  │
+╰───╥───╯ │ └───┘ │ ╰───────╯
+╭───⇓───╮ └─↘───┐ │
+│ store ├───┤ 1 ├─┘
+╰───────╯   └───┘
+```
+
+As well as it being possible for a release fence to synchronize-with an acquire
+load (fence–atomic synchronization) and a release store to synchronize-with an
+acquire fence (atomic–fence synchronization), it is also possible for release
+fences to synchronize with acquire fences (fence–fence synchronization). In this
+code snippet, only fences and `Relaxed` operations are used to establish a
+happens-before relation (in some executions):
+
+```rust
+# use std::sync::atomic::{self, AtomicU32};
+static X: AtomicU32 = AtomicU32::new(0);
+
+// t_1
+atomic::fence(atomic::Ordering::Release);
+X.store(1, atomic::Ordering::Relaxed);
+
+// t_2
+X.load(atomic::Ordering::Relaxed);
+atomic::fence(atomic::Ordering::Acquire);
+```
+
+The execution with the relation looks like this:
+
+```text
+   t_1        X        t_2
+╭───────╮   ┌───┐   ╭───────╮
+│ fence ├─┐ │ 0 │ ┌─┤ load  │
+╰───╥───╯ │ └───┘ │ ╰───╥───╯
+╭───⇓───╮ └─↘───┐ │ ╭───⇓───╮
+│ store ├───┤ 1 ├─┘┌→ fence │
+╰───────╯   └───┴──┘╰───────╯
+```
+
+Like with acquire fences, release fences can be used to optimize over a series
+of atomic stores that don’t individually need to be `Release`, since in some
+conditions and on some architectures it’s faster to put a single release fence
+at the start and use `Relaxed` from that point on than it is to use `Release`
+every time.
+
+## `AcqRel` fences
+
+`AcqRel` fences are just the combined behaviour of an `Acquire` fence and a
+`Release` fence in one operation. There isn’t much special to note about them,
+other than that they behave more like an acquire fence followed by a release
+fence than the other way around, which is useful to know in situations like the
+following:
+
+```text
+   t_1        X        t_2        Y        t_3
+╭───────╮   ┌───┐   ╭───────╮   ┌───┐   ╭───────╮
+│   A   │   │ 0 │ ┌─┤ load  │   │ 0 │ ┌─→ load  │
+╰───╥───╯   └───┘ │ ╰───╥───╯   └───┘ │ ╰───╥───╯
+╭───⇓───╮ ┌─↘───┐ │ ╭───⇓───╮┌──↘───┐ │ ╭───⇓───╮
+│ store ├─┘ │ 1 ├─┘┌→ fence ├┘┌─┤ 1 ├─┘ │   B   │
+╰───────╯   └───┴──┘╰───╥───╯ │ └───┘   ╰───────╯
+                    ╭───⇓───╮ │
+                    │ store ├─┘
+                    ╰───────╯
+```
+
+Here, A happens-before B, which is singularly due to the `AcqRel` fence’s
+ability to “carry over” happens-before relations within itself.
+
+## `SeqCst` fences
+
+`SeqCst` fences are the strongest kind of fence. They first of all inherit the
+behaviour from an `AcqRel` fence, meaning they have both acquire and release
+semantics at the same time, but being `SeqCst` operations they also participate
+in _S_. Just as with all other `SeqCst` operations, their placement in _S_ is
+primarily determined by strongly happens-before relations (including the
+[mixed-`SeqCst` caveat] that comes with it), which then gives additional
+guarantees to your code.
+
+Namely, the power of `SeqCst` fences can be summarized in three points:
+
+* Everything that happens-before a `SeqCst` fence is not coherence-ordered-after
+	any `SeqCst` operation that the fence precedes in _S_.
+* Everything that happens-after a `SeqCst` fence is not coherence-ordered-before
+	any `SeqCst` operation that the fence succeeds in _S_.
+* Everything that happens-before a `SeqCst` fence X is not
+	coherence-ordered-after anything that happens-after another `SeqCst` fence
+	Y, if X preceeds Y in _S_.
+
+> In C++11, the above three statements were similar, except they only talked
+> about what was sequenced-before and sequenced-after the `SeqCst` fences; C++20
+> strengthened this to also include happens-before, because in practice this
+> theoretical optimization was not being exploited by anybody. However do note
+> that as of the time of writing, [Miri only implements the old, weaker
+> semantics][miri scfix] and so you may see false positives when testing with
+> it.
+
+The “motivating use-case” for `SeqCst` demonstrated in the `SeqCst` chapter can
+also be rewritten to use exclusively `SeqCst` fences and `Relaxed` operations,
+by inserting fences in between the operations in the two threads:
+
+```text
+     a        static X    static Y         b
+╭─────────╮   ┌───────┐   ┌───────┐   ╭─────────╮
+│ store X ├─┐ │ false │   │ false │ ┌─┤ store Y │
+╰────╥────╯ │ └───────┘   └───────┘ │ ╰────╥────╯
+╭────⇓────╮ └─┬───────┐   ┌───────┬─┘ ╭────⇓────╮
+│ *fence* │   │ true  │   │ true  │   │ *fence* │
+╰────╥────╯   └───────┘   └───────┘   ╰────╥────╯
+╭────⇓────╮                           ╭────⇓────╮
+│ load Y  ├─?                       ?─┤ load X  │
+╰─────────╯                           ╰─────────╯
+```
+
+There are two executions to consider here, depending on which way round the
+fences appear in _S_. Should `a`’s fence appear first, the fence–fence `SeqCst`
+guarantee tells us that `b`’s load of `X` is not coherence-ordered-after `a`’s
+store of `X`, which forbids `b`’s load of `X` from seeing the value `false`. The
+same logic can be applied should the fences appear the other way around, proving
+that at least one thread must load `true` in the end.
+
+[`core::sync::atomic::fence`]: https://doc.rust-lang.org/stable/core/sync/atomic/fn.fence.html
+[mixed-`SeqCst` caveat]: seqcst.md#the-mixed-seqcst-special-case
+[miri scfix]: https://github.com/rust-lang/miri/issues/2301
diff --git a/src/atomics/multithread.md b/src/atomics/multithread.md
new file mode 100644
index 00000000..4a4ce3d6
--- /dev/null
+++ b/src/atomics/multithread.md
@@ -0,0 +1,291 @@
+# Multithreaded Execution
+
+When you write Rust code to run on your computer, it may surprise you but you’re
+not actually writing Rust code to run on your computer — instead, you’re writing
+Rust code to run on the _abstract machine_ (or AM for short). The abstract
+machine, to be contrasted with the physical machine, is an abstract
+representation of a theoretical computer: it doesn’t actually exist _per se_,
+but the combination of a compiler, target architecture and target operating
+system is capable of emulating a subset of its possible behaviours.
+
+The Abstract Machine has a few properties that are essential to understand:
+1. It is architecture and OS-independent. The Abstract Machine doesn’t care
+	whether you’re on x86_64 or iOS or a Nintendo 3DS, the rules are the same
+	for everyone. This enables you to write code without having to think about
+	what the underlying system does or how it does it, as long as you obey the
+	Abstract Machine’s rules you know you’ll be fine.
+1. It is the lowest common denominator of all supported computer systems. This
+	means it is allowed to result in executions no sane computer would actually
+	generate in real life. It is also purposefully built with forward
+	compatibility in mind, giving compilers the opportunity to make better and
+	more aggressive optimizations in the future. As a result, it can be quite
+	hard to test code, especially if you’re on a system that exploits fewer of
+	the AM’s allowed semantics, so it is highly recommended to utilize tools
+	that intentionally produce these executions like [Loom] and [Miri].
+1. Its model is highly formalized and not representative of what goes on
+	underneath. Because C++ needs to be defined by a formal specification and
+	not just hand-wavy rules about “this is what is allowed and this is what
+	isn’t”, the Abstract Machine defines things in a very mathematical and,
+	well, _abstract_, way; instead of saying things like “the compiler is
+	allowed to do X” it will find a way to define the system such that the
+	compiler’s ability to do X simply follows as a natural consequence. This
+	makes it very elegant and keeps the mathematicians happy, but you should
+	keep in mind that this is not how computers actually function, it is merely
+	a representation of it.
+
+With that out of the way, let’s look into how the C++20 Abstract Machine is
+actually defined.
+
+The first important thing to understand is that **the abstract machine has no
+concept of time**. You might expect there to be a single global ordering of
+events across the program where each happens at the same time or one after the
+other, but under the abstract model no such ordering exists; instead, a possible
+execution of the program must be treated as a single event that happens
+instantaneously. There is never any such thing as “now”, or a “latest value”,
+and using that terminology will only lead you to more confusion. Of course, in
+reality there does exist a concept of time, but you must keep in mind that
+you’re not programming for the hardware, you’re programming for the AM.
+
+However, while no global ordering of operations exists _between_ threads, there
+does exist a single total ordering _within_ each thread, which is known as its
+_sequence_. For example, given this simple Rust program:
+
+```rust
+println!("A");
+println!("B");
+```
+
+its sequence during one possible execution can be visualized like so:
+
+```text
+╭───────────────╮
+│ println!("A") │
+╰───────╥───────╯
+╭───────⇓───────╮
+│ println!("B") │
+╰───────────────╯
+```
+
+That double arrow in between the two boxes (`⇒`) represents that the second
+statement is _sequenced-after_ the first (and similarly the first statement is
+_sequenced-before_ the second). This is the strongest kind of ordering guarantee
+between any two operations, and only comes about when those two operations
+happen one after the other and on the same thread.
+
+If we add a second thread to the mix:
+
+```rust
+// Thread 1:
+println!("A");
+println!("B");
+// Thread 2:
+eprintln!("01");
+eprintln!("02");
+```
+
+it will simply coexist in parallel, with each thread getting its own independent
+sequence:
+
+```text
+    Thread 1              Thread 2
+╭───────────────╮    ╭─────────────────╮
+│ println!("A") │    │ eprintln!("01") │
+╰───────╥───────╯    ╰────────╥────────╯
+╭───────⇓───────╮    ╭────────⇓────────╮
+│ println!("B") │    │ eprintln!("02") │
+╰───────────────╯    ╰─────────────────╯
+```
+
+We can say that the prints of `A` and `B` are _unsequenced_ with regard to the
+prints of `01` and `02` that occur in the second thread, since they have no
+sequenced-before arrows connecting the boxes together.
+
+Note that these diagrams are **not** a representation of multiple things that
+_could_ happen at runtime — instead, this diagram describes exactly what _did_
+happen when the program ran once. This distinction is key, because it highlights
+that even the lowest-level representation of a program’s execution does not have
+a global ordering between threads; those two disconnected chains are all there
+is.
+
+Now let’s make things more interesting by introducing some shared data, and have
+both threads read it.
+
+```rust
+// Initial state
+let data = 0;
+// Thread 1:
+println!("{data}");
+// Thread 2:
+eprintln!("{data}");
+```
+
+Each memory location, similarly to threads, can be shown as another column on
+our diagram, but holding values instead of instructions, and each access (read
+or write) manifests as a line from the instruction that performed the access to
+the associated value in the column. So this code can produce (and is in fact
+guaranteed to produce) the following execution:
+
+```text
+Thread 1     data     Thread 2
+╭──────╮    ┌────┐    ╭──────╮
+│ data ├╌╌╌╌┤  0 ├╌╌╌╌┤ data │
+╰──────╯    └────┘    ╰──────╯
+```
+
+That is, both threads read the same value of `0` from `data`, and the two
+operations are unsequenced — they have no relative ordering between them.
+
+That’s reads done, so we’ll look at the other kind of data access next: writes.
+We’ll also return to a single thread for now, just to keep things simple.
+
+```rust
+let mut data = 0;
+data = 1;
+```
+
+Here, we have a single variable that the main thread writes to once — this means
+that in its lifetime, it holds two values, first `0`, and then `1`.
+Diagrammatically, this code’s execution can be represented like so:
+
+```text
+ Thread 1        data
+╭───────╮       ┌────┐
+│  = 1  ├╌╌╌┐   │  0 │
+╰───────╯   ├╌╌╌┼╌╌╌╌┤
+            └╌╌╌┼╌╌╌╌┤
+                │  1 │
+                └────┘
+```
+
+Note the use of dashed padding in between the values of `data`’s column. Those
+spaces won’t ever contain a value, but they’re used to represent an
+unsynchronized (non-atomic) write — it is garbage data and attempting to read it
+would result in a data race.
+
+Now let’s put all of our knowledge thus far together, and make a program both
+that reads _and_ writes data — woah, scary!
+
+```rust
+let mut data = 0;
+data = 1;
+println!("{data}");
+data = 2;
+```
+
+Working out executions of code like this is rather like solving a Sudoku puzzle:
+you must first lay out all the facts that you know, and then fill in the blanks
+with logical reasoning. The initial information we’ve been given is both the
+initial value of `data` and the sequential order of Thread 1; we also know that
+over its lifetime, `data` takes on a total of three different values that were
+caused by two different non-atomic writes. This allows us to start drawing out
+some boxes:
+
+```text
+ Thread 1        data
+╭───────╮       ┌────┐
+│  = 1  ├╌?     │  0 │
+╰───╥───╯     ?╌┼╌╌╌╌┤
+╭───⇓───╮     ?╌┼╌╌╌╌┤
+│  data ├╌?     │  ? │
+╰───╥───╯     ?╌┼╌╌╌╌┤
+╭───⇓───╮     ?╌┼╌╌╌╌┤
+│  = 2  ├╌?     │  ? │
+╰───────╯       └────┘
+```
+
+We know all of those lines need to be joined _somewhere_, but we don’t quite
+know _where_ yet. This is where we need to bring in our first rule, a rule that
+universally governs all accesses to every location in memory:
+
+> From the point at which the access occurs, find every other point that can be
+> reached by following the reverse direction of arrows, then for each one of
+> those, take a single step across every line that connects to the relevant
+> memory location. **It is not allowed for the access to read or write any value
+> that appears above any one of these points**.
+
+In our case, there are two potential executions: one, where the first write
+corresponds to the first value in `data`, and two, where the first write
+corresponds to the second value in `data`. Considering the second case for a
+moment, it would also force the second write to correspond to the first
+value in `data`. Therefore its diagram would look something like this:
+
+```text
+ Thread 1        data
+╭───────╮       ┌────┐
+│  = 1  ├╌╌┐    │  0 │
+╰───╥───╯  ┊ ┌╌╌┼╌╌╌╌┤
+╭───⇓───╮  ┊ ├╌╌┼╌╌╌╌┤
+│  data ├╌?┊ ┊  │  2 │
+╰───╥───╯  ├╌┼╌╌┼╌╌╌╌┤
+╭───⇓───╮  └╌┼╌╌┼╌╌╌╌┤
+│  = 2  ├╌╌╌╌┘  │  1 │
+╰───────╯       └────┘
+```
+
+However, that second line breaks the rule we just established! Following up the
+arrows from the third operation in Thread 1, we reach the first operation, and
+from there we can take a single step to reach the space in between the `2` and
+the `1`, which excludes the third access from writing any value above that point
+— including the `2` that it is currently writing!
+
+So evidently, this execution is no good. We can therefore conclude that the only
+possible execution of this program is the other one, in which the `1` appears
+above the `2`:
+
+```text
+ Thread 1     data
+╭───────╮     ┌────┐
+│  = 1  ├╌╌┐  │  0 │
+╰───╥───╯  ├╌╌┼╌╌╌╌┤
+╭───⇓───╮  └╌╌┼╌╌╌╌┤
+│  data ├╌?   │  1 │
+╰───╥───╯  ┌╌╌┼╌╌╌╌┤
+╭───⇓───╮  ├╌╌┼╌╌╌╌┤
+│  = 2  ├╌╌┘  │  2 │
+╰───────╯     └────┘
+```
+
+Now to sort out the read operation in the middle. We can use the same rule as
+before to trace up to the first write and rule out us reading either the `0`
+value or the garbage that exists between it and `1`, but how do we choose
+between the `1` and the `2`? Well, as it turns out there is a complement to the
+rule we already defined which gives us the exact answer we need:
+
+> From the point at which the access occurs, find every other point that can be
+> reached by following the _forward_ direction of arrows, then for each one of
+> those, take a single step across every line that connects to the relevant
+> memory location. **It is not allowed for the access to read or write any value
+> that appears below any one of these points**.
+
+Using this rule, we can follow the arrow downwards and then across and finally
+rule out `2` as well as the garbage before it. This leaves us with exactly _one_
+value that the read operation can return, and exactly one possible execution
+guaranteed by the Abstract Machine:
+
+```text
+ Thread 1      data
+╭───────╮     ┌────┐
+│  = 1  ├╌╌┐  │  0 │
+╰───╥───╯  ├╌╌┼╌╌╌╌┤
+╭───⇓───╮  └╌╌┼╌╌╌╌┤
+│  data ├╌╌╌╌╌┤  1 │
+╰───╥───╯  ┌╌╌┼╌╌╌╌┤
+╭───⇓───╮  ├╌╌┼╌╌╌╌┤
+│  = 2  ├╌╌┘  │  2 │
+╰───────╯     └────┘
+```
+
+These two rules combined make up the more generalized rule known as _coherence_,
+which is put in place to guarantee that a thread will never see a value earlier
+than the last one it read or later than a one it will in future write. Coherence
+is basically required for any program to act in a sane way, so luckily the C++20
+standard guarantees it as one of its most fundamental principles.
+
+You might be thinking that all this has been is the longest, most convoluted
+explanation ever of the most basic intuitive semantics of programming — and
+you’d be absolutely right. But it’s essential to grasp these fundamentals,
+because once you have this model in mind, the extension into multiple threads
+and the complicated semantics of real atomics becomes completely natural.
+
+[Loom]: https://docs.rs/loom
+[Miri]: https://github.com/rust-lang/miri
diff --git a/src/atomics/relaxed.md b/src/atomics/relaxed.md
new file mode 100644
index 00000000..1ee18193
--- /dev/null
+++ b/src/atomics/relaxed.md
@@ -0,0 +1,452 @@
+# Relaxed
+
+Now we’ve got single-threaded mutation semantics out of the way, we can try
+reintroducing a second thread. We’ll have one thread perform a write to the
+memory location, and a second thread read from it, like so:
+
+```rust
+// Initial state
+let mut data = 0;
+// Thread 1:
+data = 1;
+// Thread 2:
+println!("{data}");
+```
+
+Of course, any Rust programmer will immediately tell you that this code doesn’t
+compile, and indeed it definitely does not, and for good reason. But suspend
+your disbelief for a moment, and imagine what would happen if it did. Let’s draw
+a diagram, leaving out the reading lines for now:
+
+```text
+Thread 1     data    Thread 2
+╭───────╮   ┌────┐   ╭───────╮
+│  = 1  ├╌┐ │  0 │ ?╌┤  data │
+╰───────╯ ├╌┼╌╌╌╌┤   ╰───────╯
+          └╌┼╌╌╌╌┤
+            │  1 │
+            └────┘
+```
+
+Unfortunately, coherence doesn’t help us in finding out where Thread 2’s line
+joins up to, since there are no arrows connecting that operation to anything and
+therefore we can’t immediately rule any values out. As a result, we end up
+facing a situation we haven’t faced before: there is _more than one_ potential
+value for Thread 2 to read.
+
+And this is where we encounter the big limitation with unsynchronized data
+accesses: the price we pay for their speed and optimization capability is that
+this situation is considered **Undefined Behavior**. For an unsynchronized read
+to be acceptable, there has to be _exactly one_ potential value for it to read,
+and when there are multiple like in this situation it is considered a data race.
+
+So what can we do about this? Well, two things need to be changed. First of all,
+Thread 1 has to use an atomic store instead of an unsynchronized write, and
+secondly Thread 2 has to use an atomic load instead of an unsynchronized read.
+You’ll also notice that all the atomic functions accept one (and sometimes two)
+parameters of `atomic::Ordering`s — we’ll explore the details of the differences
+between them later, but for now we’ll use `Relaxed` because it is by far the
+simplest of the lot.
+
+```rust
+# use std::sync::atomic::{self, AtomicU32};
+// Initial state
+let data = AtomicU32::new(0);
+// Thread 1:
+data.store(1, atomic::Ordering::Relaxed);
+// Thread 2:
+data.load(atomic::Ordering::Relaxed);
+```
+
+The use of the atomic store provides one additional ability in comparison to an
+unsynchronized store, and that is that there is no “in-between” state between
+the old and new values — instead, it immediately updates, resulting in a diagram
+that look a bit more like this:
+
+```text
+Thread 1     data
+╭───────╮   ┌────┐
+│  = 1  ├─┐ │  0 │
+╰───────╯ │ └────┘
+          └─┬────┐
+            │  1 │
+            └────┘
+```
+
+We have now established a _modification order_ for `data`: a total, ordered list
+of distinct, separated values that it takes over its lifetime.
+
+On the loading side, we also obtain one additional ability: when there are
+multiple possible values to choose from in the modification order, instead of it
+triggering UB, exactly one (but it is unspecified which) value is chosen. This
+means that there are now _two_ potential executions of our program, with no way
+for us to control which one occurs:
+
+```text
+     Possible Execution 1       ┃       Possible Execution 2
+                                ┃
+Thread 1     data    Thread 2   ┃  Thread 1     data    Thread 2
+╭───────╮   ┌────┐   ╭───────╮  ┃  ╭───────╮   ┌────┐   ╭───────╮
+│ store ├─┐ │  0 ├───┤  load │  ┃  │ store ├─┐ │  0 │ ┌─┤  load │
+╰───────╯ │ └────┘   ╰───────╯  ┃  ╰───────╯ │ └────┘ │ ╰───────╯
+          └─┬────┐              ┃            └─┬────┐ │
+            │  1 │              ┃              │  1 ├─┘
+            └────┘              ┃              └────┘
+```
+
+Note that **both sides must be atomic to avoid the data race**: if only the
+writing side used atomic operations, the reading side would still have multiple
+values to choose from (UB), and if only the reading side used atomic operations
+it could end up reading the garbage data “in-between” `0` and `1` (also UB).
+
+> **NOTE:** This description of why both sides are needed to be atomic
+> operations, while neat and intuitive, is not strictly correct: in reality the
+> answer is simply “because the spec says so”. However, it is functionally
+> equivalent to the real rules, so it can aid in understanding.
+
+## Read-modify-write operations
+
+Loads and stores are pretty neat in avoiding data races, but you can’t get very
+far with them. For example, suppose you wanted to implement a global shared
+counter that can be used to assign unique IDs to objects. Naïvely, you might try
+to write code like this:
+
+```rust
+# use std::sync::atomic::{self, AtomicU64};
+static COUNTER: AtomicU64 = AtomicU64::new(0);
+pub fn get_id() -> u64 {
+    let value = COUNTER.load(atomic::Ordering::Relaxed);
+    COUNTER.store(value + 1, atomic::Ordering::Relaxed);
+    value
+}
+```
+
+But then calling that function from multiple threads opens you up to an
+execution like below that results in two threads obtaining the same ID (note
+that the duplication of `1` in the modification order is intentional; even if
+two values are the same, they always get separate entries in the order if they
+were caused by different accesses):
+
+```text
+Thread 1   COUNTER   Thread 2
+╭───────╮   ┌───┐   ╭───────╮
+│ load  ├───┤ 0 ├───┤  load │
+╰───╥───╯   └───┘   ╰────╥──╯
+╭───⇓───╮ ┌─┬───┐   ╭────⇓──╮
+│ store ├─┘ │ 1 │ ┌─┤ store │
+╰───────╯   └───┘ │ ╰───────╯
+            ┌───┬─┘
+            │ 1 │
+            └───┘
+```
+
+This is known as a a **race condition** — a logic error in a program caused by a
+specific unintended execution of concurrent code. Note that this is distinct
+from a _data race_: while a data race is caused by two threads performing
+unsynchronized operations at the same time and is always undefined behaviour,
+race conditions are totally OK and defined behaviour from the AM’s perspective,
+but are only harmful because the programmer didn’t expect it to be possible. You
+can think of the distinction between the two as analagous to the difference
+between indexing out-of-bounds and indexing in-bounds, but to the wrong element:
+both are bugs, but only one is universally a bug, and the other is merely a
+logic problem.
+
+Technically, I believe it is _possible_ to solve this problem with just loads
+and stores, if you try hard enough and use several atomics. But luckily, you
+don’t have to because there also exists another kind of operation, the
+read-modify-write, which is specifically suited to this purpose.
+
+A read-modify-write operation (shortened to RMW) is a special kind of atomic
+operation that reads, changes and writes back a value _in one step_. This means
+that there are guaranteed to exist no other values in the modification order in
+between the read and the write; it happens as a single operation. I would also
+like to point out that this is true of **all** atomic orderings, since a common
+misconception is that the `Relaxed` ordering somehow negates this guarantee.
+
+> Another common confusion about RMWs is that they are guaranteed to “see the
+> latest value” of an atomic, which I believe came from a misinterpretation of
+> the C++ specification and was later spread by rumour. Of course, this makes no
+> sense, since atomics have no latest value due to the lack of the concept of
+> time. The original statement in the specification was actually just specifying
+> that atomic RMWs are atomic: they only consider the directly previous value in
+> the modification order and not any value before it, and gave no additional
+> guarantee.
+
+There are many different RMW operations to choose from, but the one most
+appropriate for this use case is `fetch_add`, which adds a number to the atomic,
+as well as returns the old value. So our code can be rewritten as this:
+
+```rust
+# use std::sync::atomic::{self, AtomicU64};
+static COUNTER: AtomicU64 = AtomicU64::new(0);
+pub fn get_id() -> u64 {
+    COUNTER.fetch_add(1, atomic::Ordering::Relaxed)
+}
+```
+
+And then, no matter how many threads there are, that race condition from earlier
+can never occur. Executions will have to look more like this:
+
+```text
+  Thread 1     COUNTER     Thread 2
+╭───────────╮   ┌───┐   ╭───────────╮
+│ fetch_add ├─┐ │ 0 │ ┌─┤ fetch_add │
+╰───────────╯ │ └───┘ │ ╰───────────╯
+              └─┬───┐ │
+                │ 1 │ │
+                └───┘ │
+                ┌───┬─┘
+                │ 2 │
+                └───┘
+```
+
+There is one problem with this code however, and that is that if `get_id()` is
+called over 18 446 744 073 709 551 615 times, the counter will overflow and it
+will start generating duplicate IDs. Of course, this won’t feasibly happen, but
+it can be problematic if you need to _prove_ that it can’t happen (e.g. for
+safety purposes) or you’re using a smaller integer type like `u32`.
+
+So we’re going to modify this function so that instead of returning a plain
+`u64` it returns an `Option<u64>`, where `None` is used to indicate that an
+overflow occurred and no more IDs could be generated. Additionally, it’s not
+enough to just return `None` once, because if there are multiple threads
+involved they will not see that result if it just occurs on a single thread —
+instead, it needs to continue to return `None` _until the end of time_ (or,
+well, this execution of the program).
+
+That means we have to do away with `fetch_add`, because `fetch_add` will always
+overflow and there’s no `checked_fetch_add` equivalent. We’ll return to our racy
+algorithm for a minute, this time thinking more about what went wrong. The steps
+look something like this:
+
+1. Load a value of the atomic
+1. Perform the checked add, propagating `None`
+1. Store in the new value of the atomic
+
+The problem here is that the store does not necessarily occur directly after the
+load in the atomic’s modification order, and that leads to the races. What we
+need is some way to say, “add this new value to the modification order, but
+_only if_ it occurs directly after the value we loaded”. And luckily for us,
+there exists a function that does exactly\* this: `compare_exchange`.
+
+`compare_exchange` is a bit like a store, but instead of unconditionally storing
+the value, it will first check the value directly before the `compare_exchange`
+in the modification order to see whether it is what we expect, and if not it
+will simply tell us that and not make any changes. It is an RMW operation, so
+all of this happens fully atomically — there is no chance for a race condition.
+
+> \* It’s not quite the same, because `compare_exchange` can suffer from ABA
+> problems in which it will see a later value in the modification order that
+> just happened to be same and succeed. For example, if the modification order
+> contained `1, 2, 1` and a thread loaded the first `1`,
+> `compare_exchange(1, 3)` could succeed in replacing either the first or second
+> `1`, giving either `1, 3, 2, 1` or `1, 2, 1, 3`.
+>
+> For some algorithms, this is problematic and needs to be taken into account
+> with additional checks; however for us, values can never be reused so we don’t
+> have to worry about it.
+
+In our case, we can simply replace the store with a compare exchange of the old
+value and itself plus one (returning `None` instead if the addition overflowed,
+to prevent overflowing the atomic). Should the `compare_exchange` fail, we know
+that some other thread inserted a value in the modification order after the
+value we loaded. This isn’t really a problem — we can just try again and again
+until we succeed, and `compare_exchange` is even nice enough to give us the
+updated value so we don’t have to load again. Also note that after we’ve updated
+our value of the atomic, we’re guaranteed to never see the old value again, by
+the coherence rules from the previous chapter.
+
+So here’s how it looks with these changes appplied:
+
+```rust
+# use std::sync::atomic::{self, AtomicU64};
+static COUNTER: AtomicU64 = AtomicU64::new(0);
+pub fn get_id() -> Option<u64> {
+    // Load the counter’s initial value from some place in the modification
+    // order (it doesn’t matter where, because the compare exchange makes sure
+    // that our new value appears directly after it).
+    let mut value = COUNTER.load(atomic::Ordering::Relaxed);
+    loop {
+        // Attempt to add one to the atomic.
+        let res = COUNTER.compare_exchange(
+            value,
+            value.checked_add(1)?,
+            atomic::Ordering::Relaxed,
+            atomic::Ordering::Relaxed,
+        );
+        // Check what happened…
+        match res {
+            // If there was no value in between the value we loaded and our
+            // newly written value in the modification order, the compare
+            // exchange suceeded and so we are done.
+            Ok(_) => break,
+
+            // Otherwise, there was a value in between and so we need to retry
+            // the addition and continue looping.
+            Err(updated_value) => value = updated_value,
+        }
+    }
+    Some(value)
+}
+```
+
+This `compare_exchange` loop enables the algorithm to succeed even under
+contention; it will simply try again (and again and again). In the below
+execution, Thread 1 gets raced to storing its value of `1` to the counter, but
+that’s okay because it will just add `1` to the `1`, making `2`, and retry the
+compare exchange with that, eventually resulting in a unique ID.
+
+```text
+Thread 1   COUNTER   Thread 2
+╭───────╮   ┌───┐   ╭───────╮
+│ load  ├───┤ 0 ├───┤ load  │
+╰───╥───╯   └───┘   ╰───╥───╯
+╭───⇓───╮   ┌───┬─┐ ╭───⇓───╮
+│  cas  ├───┤ 1 │ └─┤  cas  │
+╰───╥───╯   └───┘   ╰───────╯
+╭───⇓───╮ ┌─┬───┐
+│  cas  ├─┘ │ 2 │
+╰───────╯   └───┘
+```
+
+> `compare_exchange` is abbreviated to CAS here (which stands for
+> compare-and-swap), since that is the more general name for the operation. It
+> is not to be confused with `compare_and_swap`, a deprecated method on Rust
+> atomics that performs the same task as `compare_exchange` but has an inferior
+> design in some ways.
+
+There are two additional improvements we can make here. First, because our
+algorithm occurs in a loop, it is actually perfectly fine for the CAS to fail
+even when there wasn’t a value inserted in the modification order in between,
+since we’ll just run it again. This allows to switch out our call to
+`compare_exchange` with a call to the weaker `compare_exchange_weak`, that
+unlike the former function is allowed to _spuriously_ (i.e. randomly, from the
+programmer’s perspective) fail. This often results in better performance on
+architectures like ARM, since their `compare_exchange` is really just a loop
+around the underlying `compare_exchange_weak`. x86\_64 however will see no
+difference in performance.
+
+The second improvement is that this pattern is so common that the standard
+library even provides a helper function for it, called `fetch_update`. It
+implements the boilerplate `load`-`loop`-`match` parts for us, so all we have to
+do is provide the closure that calls `checked_add(1)` and it will all just work.
+This leads us to our final code for this example:
+
+```rust
+# use std::sync::atomic::{self, AtomicU64};
+static COUNTER: AtomicU64 = AtomicU64::new(0);
+pub fn get_id() -> Option<u64> {
+    COUNTER.fetch_update(
+        atomic::Ordering::Relaxed,
+        atomic::Ordering::Relaxed,
+        |value| value.checked_add(1),
+    )
+    .ok()
+}
+```
+
+These CAS loops are the absolute bread and butter of concurrent programming;
+they’re absolutely everywhere and essential to know about. Every other RMW
+operation on atomics can (and often is, if the hardware doesn’t have a more
+efficient implementation) be implemented via a CAS loop. This is why CAS is seen
+as the canonical example of an RMW — it’s pretty much the most fundamental
+operation you can get on atomics.
+
+I’d also like to briefly bring attention to the atomic orderings used in this
+section. They were mostly glossed over, but we were exclusively using `Relaxed`,
+and that’s because for something as simple as a global ID counter, _you never
+need more than `Relaxed`_. The more complex cases which we’ll look at later
+definitely do need stronger orderings, but as a general rule, if:
+
+- you only have one atomic, and
+- you have no other related pieces of data
+
+`Relaxed` is more than sufficient.
+
+## “Out-of-thin-air” values
+
+One peculiar consequence of the semantics of `Relaxed` operations is that it is
+theoretically possible for values to come into existence “out-of-thin-air”
+(commonly abbreviated to OOTA) — that is, a value could appear despite not ever
+being calculated anywhere in code. In particular, consider this setup:
+
+```rust
+# use std::sync::atomic::{self, AtomicU32};
+let x = AtomicU32::new(0);
+let y = AtomicU32::new(0);
+
+// Thread 1:
+let r1 = y.load(atomic::Ordering::Relaxed);
+x.store(r1, atomic::Ordering::Relaxed);
+
+// Thread 2:
+let r2 = x.load(atomic::Ordering::Relaxed);
+y.store(r2, atomic::Ordering::Relaxed);
+```
+
+When starting to draw a diagram for a possible execution of this program, we
+have to first lay out the basic facts that we know:
+- `x` and `y` both start out as zero
+- Thread 1 performs a load of `y` followed by a store of `x`
+- Thread 2 performs a load of `x` followed by a store of `y`
+- Each of `x` and `y` take on exactly two values in their lifetime
+
+Then we can start to construct boxes:
+
+```text
+Thread 1      x       y      Thread 2
+╭───────╮   ┌───┐   ┌───┐   ╭───────╮
+│  load ├─┐ │ 0 │   │ 0 │ ┌─┤ load  │
+╰───╥───╯ │ └───┘   └───┘ │ ╰───╥───╯
+    ║     │   ?───────────┘     ║
+╭───⇓───╮ └───────────?     ╭───⇓───╮
+│ store ├───┬───┐   ┌───┬───┤ store │
+╰───────╯   │ ? │   │ ? │   ╰───────╯
+            └───┘   └───┘
+```
+
+At this point, if either of those lines were to connect to the higher box then
+the execution would be simple: that thread would forward the value to its lower
+box, which the other thread would then either read, or load the same value
+(zero) from the box above it, and we’d end up with zero in both atomics. But
+what if they were to connect downwards? Then we’d end up with an execution that
+looks like this:
+
+```text
+Thread 1      x       y      Thread 2
+╭───────╮   ┌───┐   ┌───┐   ╭───────╮
+│  load ├─┐ │ 0 │   │ 0 │ ┌─┤ load  │
+╰───╥───╯ │ └───┘   └───┘ │ ╰───╥───╯
+    ║     │   ┌───────────┘     ║
+╭───⇓───╮ └───┼───────┐     ╭───⇓───╮
+│ store ├───┬─┴─┐   ┌─┴─┬───┤ store │
+╰───────╯   │ ? │   │ ? │   ╰───────╯
+            └───┘   └───┘
+```
+
+But hang on — it’s not fully resolved yet, we still haven’t put in a value in
+those lower question marks. So what value should it be? Well, the second value
+of `x` is just copied from from the second value of `y`, so we just have to find
+the value of that — but the second value of `y` is itself copied from the second
+value of `x`! This means that we can actually put any value we like in that box,
+including `0` or `42`, and the logic will check out perfectly fine — meaning if
+this program were to execute in this fashion, it would end up reading a value
+produced out of thin air!
+
+Now, if we were to strictly follow the rules we’ve laid out thus far, then this
+would be totally valid thing to happen. But luckily, the authors of the C++
+specification have recognized this as a problem, and as such refined the
+semantics of `Relaxed` to implement a thorough, logically sound, mathematically
+proven formal model that prevents it, that’s just too complex and technical to
+explain here—
+
+> No “out-of-thin-air” values can be computed that circularly depend on their
+> own computations.
+
+Just kidding. Turns out, it’s a *really* difficult problem to solve, and to my
+knowledge even now there is no known formal way to express how to prevent it. So
+in the specification they just kind of hand-wave and say that it shouldn’t
+happen, and that the above program must always give zero in both atomics,
+despite the theoretical execution that could result in something else. Well, it
+generally works in practice so I can’t complain — it’s just a very interesting
+detail to know about.
diff --git a/src/atomics/seqcst.md b/src/atomics/seqcst.md
new file mode 100644
index 00000000..38e3d1ab
--- /dev/null
+++ b/src/atomics/seqcst.md
@@ -0,0 +1,432 @@
+# SeqCst
+
+`SeqCst` is probably the most interesting ordering, because it is simultaneously
+the simplest and most complex atomic memory ordering in existence. It’s
+simple, because if you do only use `SeqCst` everywhere then you can kind of
+maybe pretend like the Abstract Machine has a concept of time; phrases like
+“latest value” make sense, the program can be thought of as a set of steps that
+interleave, there is a universal “now” and “before” and wouldn’t that be nice?
+But it’s also the most complex, because as soon as look under the hood you
+realize just how incredibly convoluted and hard to follow the actual rules
+behind it are, and it gets really ugly really fast as soon as you try to mix it
+with any other ordering.
+
+To understand `SeqCst`, we first have to understand the problem it exists to
+solve. A simple example used to show where weaker orderings produce
+counterintuitive results is this:
+
+```rust
+# use std::sync::atomic::{self, AtomicBool};
+use std::thread;
+
+// Set this to Relaxed, Acquire, Release, AcqRel, doesn’t matter — the result is
+// the same (modulo panics caused by attempting acquire stores or release
+// loads).
+const ORDERING: atomic::Ordering = atomic::Ordering::Relaxed;
+
+static X: AtomicBool = AtomicBool::new(false);
+static Y: AtomicBool = AtomicBool::new(false);
+
+let a = thread::spawn(|| { X.store(true, ORDERING); Y.load(ORDERING) });
+let b = thread::spawn(|| { Y.store(true, ORDERING); X.load(ORDERING) });
+
+let a = a.join().unwrap();
+let b = b.join().unwrap();
+
+# return;
+// This assert is allowed to fail.
+assert!(a || b);
+```
+
+The basic setup of this code, for all of its possible executions, looks like
+this:
+
+```text
+     a        static X    static Y         b
+╭─────────╮   ┌───────┐   ┌───────┐   ╭─────────╮
+│ store X ├─┐ │ false │   │ false │ ┌─┤ store Y │
+╰────╥────╯ │ └───────┘   └───────┘ │ ╰────╥────╯
+╭────⇓────╮ └─┬───────┐   ┌───────┬─┘ ╭────⇓────╮
+│ load Y  ├─? │ true  │   │ true  │ ?─┤ load X  │
+╰─────────╯   └───────┘   └───────┘   ╰─────────╯
+```
+
+In other words, `a` and `b` are guaranteed to store `true` into `X` and `Y`
+respectively, and then attempt to load from the other thread’s atomic. The
+question now is: is it possible for them _both_ to load `false`?
+
+And looking at this diagram, there’s absolutely no reason why not. There isn’t
+even a single arrow connecting the left and right hand sides so far, so the
+loads have no coherence-based restrictions on which values they are allowed to
+pick, and we could end up with an execution like this:
+
+```text
+     a        static X    static Y         b
+╭─────────╮   ┌───────┐   ┌───────┐   ╭─────────╮
+│ store X ├┐  │ false ├─┐┌┤ false │  ┌┤ store Y │
+╰────╥────╯│  └───────┘┌─┘└───────┘  │╰────╥────╯
+     ║     │ ┌─────────┘└───────────┐│     ║
+╭────⇓────╮└─│┬───────┐   ┌───────┬─│┘╭────⇓────╮
+│ load Y  ├──┘│ true  │   │ true  │ └─┤ load X  │
+╰─────────╯   └───────┘   └───────┘   ╰─────────╯
+```
+
+Which results in a failed assert. This execution is brought about because the
+model of separate modification orders means that there is no relative ordering
+between `X` and `Y` being changed, and so each thread is allowed to “see” either
+order. However, some algorithms will require a globally agreed-upon ordering,
+and this is where `SeqCst` can come in useful.
+
+This ordering, first and foremost, inherits the guarantees from all the other
+orderings — it is an acquire operation for loads, a release operation for stores
+and an acquire-release operation for RMWs. In addition to this, it gives some
+guarantees unique to `SeqCst` about what values it is allowed to load. Note that
+these guarantees are not about preventing data races: unless you have some
+unrelated code that triggers a data race given an unexpected condition, using
+`SeqCst` can only prevent you from race conditions because its guarantees only
+apply to other `SeqCst` operations rather than all data accesses.
+
+## S
+
+`SeqCst` is fundamentally about _S_, which is the global ordering of all
+`SeqCst` operations in an execution of the program. It is consistent between
+every atomic and every thread, and all stores, fences and RMWs that use a
+sequentially consistent ordering have a place in it (but no other operations
+do). It is in contrast to modification orders, which are similarly total but
+only scoped to a single atomic rather than the whole program.
+
+Other than an edge case involving `SeqCst` mixed with weaker orderings (detailed
+later on), _S_ is primarily controlled by the happens-before relations in a
+program: this means that if an action _A_ happens-before an action _B_, it is
+also guaranteed to appear before _B_ in _S_. Other than that restriction, _S_ is
+unspecified and will be chosen arbitrarily during execution.
+
+Once a particular _S_ has been established, every atomic’s modification order is
+then guaranteed to be consistent with it, so a `SeqCst` load will never see a
+value that has been overwritten by a write that occurred before it in _S_, or a
+value that has been written by a write that occured after it in _S_ (note that a
+`Relaxed`/`Acquire` load however might, since there is no “before” or “after” as
+it is not in _S_ in the first place).
+
+More formally, this guarantee can be described with _coherence orderings_, a
+relation which expresses which of two operations appears before the other in an
+atomic’s modification order. It is said that an operation _A_ is
+_coherence-ordered-before_ another operation _B_ if any of the following
+conditions are met:
+1. _A_ is a store or RMW, _B_ is a store or RMW, and _A_ appears before _B_ in
+	the modification order.
+1. _A_ is a store or RMW, _B_ is a load, and _B_ reads the value stored by _A_.
+1. _A_ is a load, _B_ is a store or RMW, and _A_ takes its value from a place in
+	the modification order that appears before _B_.
+1. _A_ is coherence-ordered-before a different operation _X_, and _X_ is
+	coherence-ordered-before _B_ (the basic transitivity property).
+
+The following diagram gives examples for the main three rules (in each case _A_
+is coherence-ordered-before _B_):
+
+```text
+        Rule 1        ┃         Rule 2        ┃         Rule 3
+                      ┃                       ┃
+╭───╮ ┌─┬───┐   ╭───╮ ┃ ╭───╮ ┌─┬───┐   ╭───╮ ┃ ╭───╮   ┌───┐   ╭───╮
+│ A ├─┘ │   │ ┌─┤ B │ ┃ │ A ├─┘ │   ├───┤ B │ ┃ │ A ├───┤   │ ┌─┤ B │
+╰───╯   └───┘ │ ╰───╯ ┃ ╰───╯   └───┘   ╰───╯ ┃ ╰───╯   └───┘ │ ╰───╯
+        ┌───┬─┘       ┃                       ┃         ┌───┬─┘
+        │   │         ┃                       ┃         │   │
+        └───┘         ┃                       ┃         └───┘
+```
+
+The only important thing to note is that for two loads of the same value in the
+modification order, neither is coherence-ordered-before the other, as in the
+following example where _A_ has no coherence ordering relation to _B_:
+
+```text
+╭───╮   ┌───┐   ╭───╮
+│ A ├───┤   ├───┤ B │
+╰───╯   └───┘   ╰───╯
+```
+
+Because of this, “_A_ is coherence-ordered-before _B_” is subtly different from
+“_A_ is not coherence-ordered-after _B_”; only the latter phrase includes the
+above situation, and is synonymous with “either _A_ is coherence-ordered-before
+_B_ or _A_ and _B_ are both loads, and see the same value in the atomic’s
+modification order”. “Not coherence-ordered-after” is generally a more useful
+relation than “coherence-ordered-before”, and so it’s important to understand
+what it means.
+
+With this terminology applied, we can use a more precise definition of
+`SeqCst`’s guarantee: for two `SeqCst` operations on the same atomic _A_ and
+_B_, where _A_ precedes _B_ in _S_, _A_ is not coherence-ordered-after _B_.
+Effectively, this one rule ensures that _S_’s order “propagates”
+throughout all the atomics of the program — you can imagine each operation in
+_S_ as storing a snapshot of the world, so that every subsequent operation is
+consistent with it.
+
+## Applying `SeqCst`
+
+So, looking back at our program, let’s consider how we could use `SeqCst` to
+make that execution invalid. As a refresher, here’s the framework for every
+possible execution of the program:
+
+```text
+     a        static X    static Y         b
+╭─────────╮   ┌───────┐   ┌───────┐   ╭─────────╮
+│ store X ├─┐ │ false │   │ false │ ┌─┤ store Y │
+╰────╥────╯ │ └───────┘   └───────┘ │ ╰────╥────╯
+╭────⇓────╮ └─┬───────┐   ┌───────┬─┘ ╭────⇓────╮
+│ load Y  ├─? │ true  │   │ true  │ ?─┤ load X  │
+╰─────────╯   └───────┘   └───────┘   ╰─────────╯
+```
+
+First of all, both the final loads (`a` and `b`’s second operations) need to
+become `SeqCst`, because they need to be aware of the total ordering that
+determines whether `X` or `Y` becomes `true` first. And secondly, we need to
+establish that ordering in the first place, and that needs to be done by making
+sure that there is always one operation in _S_ that both sees one of the atomics
+as `true` and precedes both final loads in _S_, so that the coherence ordering
+guarantee will apply (the final loads themselves don’t work for this since
+although they “know” that their corresponding atomic is `true` they don’t
+interact with it directly so _S_ doesn’t care) — for this, we must set both
+stores to use the `SeqCst` ordering.
+
+This leaves us with the correct version of the above program, which is
+guaranteed to never panic:
+
+```rust
+# use std::sync::atomic::{self, AtomicBool};
+use std::thread;
+
+const ORDERING: atomic::Ordering = atomic::Ordering::SeqCst;
+
+static X: AtomicBool = AtomicBool::new(false);
+static Y: AtomicBool = AtomicBool::new(false);
+
+let a = thread::spawn(|| { X.store(true, ORDERING); Y.load(ORDERING) });
+let b = thread::spawn(|| { Y.store(true, ORDERING); X.load(ORDERING) });
+
+let a = a.join().unwrap();
+let b = b.join().unwrap();
+
+# return;
+// This assert is **not** allowed to fail.
+assert!(a || b);
+```
+
+As there are four `SeqCst` operations with a partial order between two pairs in
+them (caused by the sequenced-before relation), there are six possible
+executions of this program:
+
+- All of `a`’s operations precede `b`’s operations:
+	1. `a` stores `true` into `X`
+	1. `a` loads `Y` (gives `false`)
+	1. `b` stores `true` into `Y`
+	1. `b` loads `X` (required to give `true`)
+- All of `b`’s operations precede `a`’s operations:
+	1. `b` stores `true` into `Y`
+	1. `b` loads `X` (gives `false`)
+	1. `a` stores `true` into `X`
+	1. `a` loads `Y` (required to give `true`)
+- The stores precede the loads,
+	`a`’s store precedes `b`’s and `a`’s load precedes `b`’s:
+	1. `a` stores `true` to `X`
+	1. `b` stores `true` into `Y`
+	1. `a` loads `Y` (required to give `true`)
+	1. `b` loads `X` (required to give `true`)
+- The stores precede the loads,
+	`a`’s store precedes `b`’s and `b`’s load precedes `a`’s:
+	1. `a` stores `true` to `X`
+	1. `b` stores `true` into `Y`
+	1. `b` loads `X` (required to give `true`)
+	1. `a` loads `Y` (required to give `true`)
+- The stores precede the loads,
+	`b`’s store precedes `a`’s and `a`’s load precedes `b`’s:
+	1. `b` stores `true` into `Y`
+	1. `a` stores `true` to `X`
+	1. `a` loads `Y` (required to give `true`)
+	1. `b` loads `X` (required to give `true`)
+- The stores precede the loads,
+	`b`’s store precedes `a`’s and `b`’s load precedes `a`’s:
+	1. `b` stores `true` into `Y`
+	1. `a` stores `true` to `X`
+	1. `b` loads `X` (required to give `true`)
+	1. `a` loads `Y` (required to give `true`)
+
+All the places where the load was required to give `true` were caused by a
+preceding store in _S_ of the same atomic of `true` — otherwise, the load would
+be coherence-ordered-before a store which precedes it in _S_, and that is
+impossible.
+
+## The mixed-`SeqCst` special case
+
+As I’ve been alluding to for a while, I wasn’t being totally truthful when I
+said that _S_ is consistent with happens-before relations — in reality, it is
+only consistent with _strongly happens-before_ relations, which presents a
+subtly-defined subset of happens-before relations. In particular, it excludes
+two situations:
+
+1. The `SeqCst` operation A synchronizes-with an `Acquire` or `AcqRel` operation
+   B which is sequenced-before another `SeqCst` operation C. Here, despite the
+   fact that A happens-before C, A does not _strongly_ happen-before C and so is
+   not guaranteed to precede C in _S_.
+2. The `SeqCst` operation A is sequenced-before the `Release` or `AcqRel`
+   operation B, which synchronizes-with another `SeqCst` operation C. Similarly,
+   despite the fact that A happens-before C, A might not precede C in _S_.
+
+The first situation is illustrated below, with `SeqCst` accesses repesented with
+asterisks:
+
+```text
+  t_1       x       t_2
+╭─────╮ ┌─↘───┐   ╭─────╮
+│ *A* ├─┘ │ 1 ├───→  B  │
+╰─────╯   └───┘   ╰──╥──╯
+                  ╭──⇓──╮
+                  │ *C* │
+                  ╰─────╯
+```
+
+A happens-before, but does not strongly happen-before, C — and anything
+sequenced-after C will have the same treatment (unless more synchronization is
+used). This means that C is actually allowed to _precede_ A in _S_, despite
+conceptually occuring after it. However, anything sequenced-before A, because
+there is at least one sequence on either side of the synchronization, will
+strongly happen-before C.
+
+But this is all highly theoretical at the moment, so let’s make an example to
+show how that rule can actually affect the execution of code. So, if C were to
+precede A in _S_ (and they are not both loads) then that means C is always
+coherence-ordered-before A. Let’s say then that C loads from `x` (the atomic
+that A has to access), it may load the value that came before A if it were to
+precede A in _S_:
+
+```text
+  t_1       x       t_2
+╭─────╮   ┌───┐   ╭─────╮
+│ *A* ├─┐ │ 0 ├─┐┌→  B  │
+╰─────╯ │ └───┘ ││╰──╥──╯
+        └─↘───┐┌─┘╭──⇓──╮
+          │ 1 ├┘└─→ *C* │
+          └───┘   ╰─────╯
+```
+
+Ah wait no, that doesn’t work because regular coherence still mandates that `1`
+is the only value that can be loaded. In fact, once `1` is loaded _S_’s required
+consistency with coherence orderings means that A _is_ required to precede C in
+_S_ after all.
+
+So somehow, to observe this difference we need to have a _different_ `SeqCst`
+operation, let’s call it E, be the one that loads from `x`, where C is
+guaranteed to precede it in _S_ (so we can observe the “weird” state in between
+C and A) but C also doesn’t happen-before it (to avoid coherence getting in the
+way) — and to do that, all we have to do is have C appear before a `SeqCst`
+operation D in the modification order of another atomic, but have D be a store
+so as to avoid C synchronizing with it, and then our desired load E can simply
+be sequenced-after D (this will carry over the “precedes in _S_” guarantee, but
+does not restore the happens-after relation to C since that was already dropped
+by having D be a store).
+
+In diagram form, that looks like this:
+
+```text
+  t_1       x       t_2     helper      t_3
+╭─────╮   ┌───┐   ╭─────╮   ┌─────┐   ╭─────╮
+│ *A* ├─┐ │ 0 ├┐┌─→  B  │ ┌─┤  0  │ ┌─┤ *D* │
+╰─────╯ │ └───┘││ ╰──╥──╯ │ └─────┘ │ ╰──╥──╯
+        │      └│────║────│─────────│┐   ║
+        └─↘───┐ │ ╭──⇓──╮ │ ┌─────↙─┘│╭──⇓──╮
+          │ 1 ├─┘ │ *C* ←─┘ │  1  │  └→ *E* │
+          └───┘   ╰─────╯   └─────┘   ╰─────╯
+
+S = C → D → E → A
+```
+
+C is guaranteed to precede D in _S_, and D is guaranteed to precede E, but
+because this exception means that A is _not_ guaranteed to precede C, it is
+totally possible for it to come at the end, resulting in the surprising but
+totally valid outcome of E loading `0` from `x`. In code, this can be expressed
+as the following code _not_ being guaranteed to panic:
+
+```rust
+# use std::sync::atomic::{AtomicU8, Ordering::{Acquire, SeqCst}};
+# return;
+static X: AtomicU8 = AtomicU8::new(0);
+static HELPER: AtomicU8 = AtomicU8::new(0);
+
+// thread_1
+X.store(1, SeqCst); // A
+
+// thread_2
+assert_eq!(X.load(Acquire), 1); // B
+assert_eq!(HELPER.load(SeqCst), 0); // C
+
+// thread_3
+HELPER.store(1, SeqCst); // D
+assert_eq!(X.load(SeqCst), 0); // E
+```
+
+The second situation listed above has very similar consequences. Its abstract
+form is the following execution in which A is not guaranteed to precede C in
+_S_, despite A happening-before C:
+
+```text
+  t_1       x       t_2
+╭─────╮ ┌─↘───┐   ╭─────╮
+│ *A* │ │ │ 0 ├───→ *C* │
+╰──╥──╯ │ └───┘   ╰─────╯
+╭──⇓──╮ │
+│  B  ├─┘
+╰─────╯
+```
+
+Similarly to before, we can’t just have A access `x` to show why A not
+necessarily preceding C in _S_ matters; instead, we have to introduce a second
+atomic and third thread to break the happens-before chain first. And finally, a
+single relaxed load F at the end is added just to prove that the weird execution
+actually happened (leaving `x` as 2 instead of 1).
+
+```text
+  t_3     helper      t_1       x       t_2
+╭─────╮   ┌─────┐   ╭─────╮   ┌───┐   ╭─────╮
+│ *D* ├┐┌─┤  0  │ ┌─┤ *A* │   │ 0 │ ┌─→ *C* │
+╰──╥──╯││ └─────┘ │ ╰──╥──╯   └───┘ │ ╰──╥──╯
+   ║   └│─────────│────║─────┐      │    ║
+╭──⇓──╮ │ ┌─────↙─┘ ╭──⇓──╮ ┌─↘───┐ │ ╭──⇓──╮
+│ *E* ←─┘ │  1  │   │  B  ├─┘││ 1 ├─┘┌┤  F  │
+╰─────╯   └─────┘   ╰─────╯  │└───┘  │╰─────╯
+                             └↘───┐  │
+                              │ 2 ├──┘
+                              └───┘
+S = C → D → E → A
+```
+
+This execution mandates both C preceding A in _S_ and A happening-before C,
+something that is only possible through these two mixed-`SeqCst` special
+exceptions. It can be expressed in code as well:
+
+```rust
+# use std::sync::atomic::{AtomicU8, Ordering::{Release, Relaxed, SeqCst}};
+# return;
+static X: AtomicU8 = AtomicU8::new(0);
+static HELPER: AtomicU8 = AtomicU8::new(0);
+
+// thread_3
+X.store(2, SeqCst); // D
+assert_eq!(HELPER.load(SeqCst), 0); // E
+
+// thread_1
+HELPER.store(1, SeqCst); // A
+X.store(1, Release); // B
+
+// thread_2
+assert_eq!(X.load(SeqCst), 1); // C
+assert_eq!(X.load(Relaxed), 2); // F
+```
+
+If this seems ridiculously specific and obscure, that’s because it is.
+Originally, back in C++11, this special case didn’t exist — but then six years
+later it was discovered that in practice atomics on Power, Nvidia GPUs and
+sometimes ARMv7 _would_ have this special case, and fixing the implementations
+would make atomics significantly slower. So instead, in C++20 they simply
+encoded it into the specification.
+
+Generally however, this rule is so complex it’s best to just avoid it entirely
+by never mixing `SeqCst` and non-`SeqCst` on a single atomic in the first place.