diff --git a/blog/feed.rss b/blog/feed.rss index c8934a2..0cf0cf4 100644 --- a/blog/feed.rss +++ b/blog/feed.rss @@ -7,11 +7,28 @@ Alisa Sireneva, CC BY me@purplesyringa.moe (Alisa Sireneva) me@purplesyringa.moe (Alisa Sireneva) - Thu, 12 Dec 2024 14:50:16 GMT + Wed, 18 Dec 2024 21:58:51 GMT https://www.rssboard.org/rss-specification 60 + + The RAM myth + https://purplesyringa.moe/blog/./the-ram-myth/ + The RAM myth is a belief that modern computer memory resembles perfect random-access memory. Cache is seen as an optimization for small data: if it fits in L2, it’s going to be processed faster; if it doesn’t, there’s nothing we can do. +Most likely, you believe that code like this is the fastest way to shard data: +groups = [[] for _ in range(n_groups)] +for element in elements: +groups[element.group].append(element) + +Indeed, it’s linear (i.e. asymptotically optimal), and we have to access random indices anyway, so cache isn’t going to help us in any case. +In reality, this is leaving a lot of performance on the table, and certain asymptotically slower algorithms can perform sharding significantly faster on large input. They are mostly used by on-disk databases, but, surprisingly, they are useful even for in-RAM data. + me@purplesyringa.moe (Alisa Sireneva) + + https://purplesyringa.moe/blog/./the-ram-myth/ + Thu, 19 Dec 2024 00:00:00 GMT + + Thoughts on Rust hashing https://purplesyringa.moe/blog/./thoughts-on-rust-hashing/ diff --git a/blog/index.html b/blog/index.html index b872334..a2cac68 100644 --- a/blog/index.html +++ b/blog/index.html @@ -1,4 +1,7 @@ -purplesyringa's blog

Subscribe to RSS

Thoughts on Rust hashing

Reddit IRLO

In languages like Python, Java, or C++, values are hashed by calling a “hash me” method on them, implemented by the type author. This fixed-hash size is then immediately used by the hash table or what have you. This design suffers from some obvious problems, like:

How do you hash an integer? If you use a no-op hasher (booo), DoS attacks on hash tables are inevitable. If you hash it thoroughly, consumers that only cache hashes to optimize equality checks lose out of performance.

Keep reading

Any Python program fits in 24 characters*

* If you don’t take whitespace into account.

My friend challenged me to find the shortest solution to a certain Leetcode-style problem in Python. They were generous enough to let me use whitespace for free, so that the code stays readable. So that’s exactly what we’ll abuse to encode any Python program in 24 bytes, ignoring whitespace.

Keep reading

The Rust Trademark Policy is still harmful

Reddit

Four days ago, the Rust Foundation released a new draft of the Rust Language Trademark Policy. The previous draft caused division within the community several years ago, prompting its retraction with the aim of creating a new, milder version.

Well, that failed. While certain issues were addressed (thank you, we appreciate it!), the new version remains excessively restrictive and, in my opinion, will harm both the Rust community as a whole and compiler and crate developers. While I expect the stricter rules to not be enforced in practice, I don’t want to constantly feel like I’m under threat while contributing to the Rust ecosystem, and this is exactly what it would feel like if this draft is finalized.

Below are some of my core objections to the draft.

Keep reading

Bringing faster exceptions to Rust

Reddit

Three months ago, I wrote about why you might want to use panics for error handling. Even though it’s a catchy title, panics are hardly suited for this goal, even if you try to hack around with macros and libraries. The real star is the unwinding mechanism, which powers panics. This post is the first in a series exploring what unwinding is, how to speed it up, and how it can benefit Rust and C++ programmers.

Keep reading

We built the best "Bad Apple!!" in Minecraft

Hacker News

Demoscene is the art of pushing computers to perform tasks they weren’t designed to handle. One recurring theme in demoscene is the shadow-art animation “Bad Apple!!”. We’ve played it on the Commodore 64, Vectrex (a unique game console utilizing only vector graphics), Impulse Tracker, and even exploited Super Mario Bros. to play it.

But how about Bad Apple!!.. in Minecraft?

Keep reading

Minecraft сравнивает массивы за куб

Telegram

Коллизии в играх обнаруживаются тяжелыми алгоритмами. Для примера попробуйте представить себе, насколько сложно это для просто двух произвольно повернутых кубов в пространстве. Они могут контактировать двумя ребрами, вершиной и гранью или еще как-то более сложно.

В майнкрафте вся геометрия хитбоксов параллельна осям координат, т.е. наклона не бывает. Это сильно упрощает поиск коллизий.

Я бы такое писала просто. Раз хитбокс блока — это объединение нескольких параллелепипедов, то можно его так и хранить: как список 6-элементных тьюплов. В подавляющем большинстве случаев этот список будет очень коротким. Для обычных кубов его длина — 1, для стеклопаналей может достигать 2, наковальня, о боги, состоит из 3 элементов, а стены могут иметь их аж целых 4. Для проверки хитбоксов на пересечение достаточно перебрать пары параллелепипедов двух хитбоксов (кажется, их может быть максимум 16). Для параллелепипедов с параллельными осями задача решается тривиально.

Но Minecraft JE писала не я, поэтому там реализация иная.

Keep reading

WebP: The WebPage compression format

Hacker News Reddit Lobsters Russian

I want to provide a smooth experience to my site visitors, so I work on accessibility and ensure it works without JavaScript enabled. I care about page load time because some pages contain large illustrations, so I minify my HTML.

But one thing makes turning my blog light as a feather a pain in the ass.

Keep reading

Division is hard, but it doesn't have to be

Reddit

Developers don’t usually divide numbers all the time, but hashmaps often need to compute remainders modulo a prime. Hashmaps are really common, so fast division is useful.

For instance, rolling hashes might compute u128 % u64 with a fixed divisor. Compilers just drop the ball here:

fn modulo(n: u128) -> u64 {
+purplesyringa's blog

Subscribe to RSS

The RAM myth

The RAM myth is a belief that modern computer memory resembles perfect random-access memory. Cache is seen as an optimization for small data: if it fits in L2, it’s going to be processed faster; if it doesn’t, there’s nothing we can do.

Most likely, you believe that code like this is the fastest way to shard data:

groups = [[] for _ in range(n_groups)]
+for element in elements:
+    groups[element.group].append(element)
+

Indeed, it’s linear (i.e. asymptotically optimal), and we have to access random indices anyway, so cache isn’t going to help us in any case.

In reality, this is leaving a lot of performance on the table, and certain asymptotically slower algorithms can perform sharding significantly faster on large input. They are mostly used by on-disk databases, but, surprisingly, they are useful even for in-RAM data.

Keep reading

Thoughts on Rust hashing

Reddit IRLO

In languages like Python, Java, or C++, values are hashed by calling a “hash me” method on them, implemented by the type author. This fixed-hash size is then immediately used by the hash table or what have you. This design suffers from some obvious problems, like:

How do you hash an integer? If you use a no-op hasher (booo), DoS attacks on hash tables are inevitable. If you hash it thoroughly, consumers that only cache hashes to optimize equality checks lose out of performance.

Keep reading

Any Python program fits in 24 characters*

* If you don’t take whitespace into account.

My friend challenged me to find the shortest solution to a certain Leetcode-style problem in Python. They were generous enough to let me use whitespace for free, so that the code stays readable. So that’s exactly what we’ll abuse to encode any Python program in 24 bytes, ignoring whitespace.

Keep reading

The Rust Trademark Policy is still harmful

Reddit

Four days ago, the Rust Foundation released a new draft of the Rust Language Trademark Policy. The previous draft caused division within the community several years ago, prompting its retraction with the aim of creating a new, milder version.

Well, that failed. While certain issues were addressed (thank you, we appreciate it!), the new version remains excessively restrictive and, in my opinion, will harm both the Rust community as a whole and compiler and crate developers. While I expect the stricter rules to not be enforced in practice, I don’t want to constantly feel like I’m under threat while contributing to the Rust ecosystem, and this is exactly what it would feel like if this draft is finalized.

Below are some of my core objections to the draft.

Keep reading

Bringing faster exceptions to Rust

Reddit

Three months ago, I wrote about why you might want to use panics for error handling. Even though it’s a catchy title, panics are hardly suited for this goal, even if you try to hack around with macros and libraries. The real star is the unwinding mechanism, which powers panics. This post is the first in a series exploring what unwinding is, how to speed it up, and how it can benefit Rust and C++ programmers.

Keep reading

We built the best "Bad Apple!!" in Minecraft

Hacker News

Demoscene is the art of pushing computers to perform tasks they weren’t designed to handle. One recurring theme in demoscene is the shadow-art animation “Bad Apple!!”. We’ve played it on the Commodore 64, Vectrex (a unique game console utilizing only vector graphics), Impulse Tracker, and even exploited Super Mario Bros. to play it.

But how about Bad Apple!!.. in Minecraft?

Keep reading

Minecraft сравнивает массивы за куб

Telegram

Коллизии в играх обнаруживаются тяжелыми алгоритмами. Для примера попробуйте представить себе, насколько сложно это для просто двух произвольно повернутых кубов в пространстве. Они могут контактировать двумя ребрами, вершиной и гранью или еще как-то более сложно.

В майнкрафте вся геометрия хитбоксов параллельна осям координат, т.е. наклона не бывает. Это сильно упрощает поиск коллизий.

Я бы такое писала просто. Раз хитбокс блока — это объединение нескольких параллелепипедов, то можно его так и хранить: как список 6-элементных тьюплов. В подавляющем большинстве случаев этот список будет очень коротким. Для обычных кубов его длина — 1, для стеклопаналей может достигать 2, наковальня, о боги, состоит из 3 элементов, а стены могут иметь их аж целых 4. Для проверки хитбоксов на пересечение достаточно перебрать пары параллелепипедов двух хитбоксов (кажется, их может быть максимум 16). Для параллелепипедов с параллельными осями задача решается тривиально.

Но Minecraft JE писала не я, поэтому там реализация иная.

Keep reading

WebP: The WebPage compression format

Hacker News Reddit Lobsters Russian

I want to provide a smooth experience to my site visitors, so I work on accessibility and ensure it works without JavaScript enabled. I care about page load time because some pages contain large illustrations, so I minify my HTML.

But one thing makes turning my blog light as a feather a pain in the ass.

Keep reading

Division is hard, but it doesn't have to be

Reddit

Developers don’t usually divide numbers all the time, but hashmaps often need to compute remainders modulo a prime. Hashmaps are really common, so fast division is useful.

For instance, rolling hashes might compute u128 % u64 with a fixed divisor. Compilers just drop the ball here:

fn modulo(n: u128) -> u64 {
     (n % 0xffffffffffffffc5) as u64
 }
 
modulo:
diff --git a/blog/the-ram-myth/benchmark.rs b/blog/the-ram-myth/benchmark.rs
new file mode 100644
index 0000000..9d57647
--- /dev/null
+++ b/blog/the-ram-myth/benchmark.rs
@@ -0,0 +1,163 @@
+use core::mem::MaybeUninit;
+use criterion::{
+    BenchmarkId, Criterion, SamplingMode, Throughput, {criterion_group, criterion_main},
+};
+use fixed_slice_vec::FixedSliceVec;
+use std::time::Duration;
+use wyrand::WyRand;
+
+// const CUTOFF: usize = 50_000;
+const CUTOFF: usize = 200_000;
+// const CUTOFF: usize = 1_000_000;
+
+#[inline(never)]
+fn fallback(
+    elements: impl Iterator + Clone,
+    elements_len: usize,
+    key: &mut impl FnMut(u64) -> usize,
+    key_bitness: u32,
+    callback: &mut impl FnMut(&mut dyn Iterator),
+) {
+    let n_groups = 1 << key_bitness;
+
+    let mut counts: Vec = vec![0; n_groups];
+    for element in elements.clone() {
+        counts[key(element) & (n_groups - 1)] += 1;
+    }
+
+    let mut group_ptrs: Vec = vec![0; n_groups];
+    for i in 1..n_groups {
+        group_ptrs[i] = group_ptrs[i - 1] + counts[i - 1];
+    }
+
+    let mut buffer = vec![MaybeUninit::uninit(); elements_len];
+    for element in elements {
+        let group_ptr = &mut group_ptrs[key(element) & ((1 << key_bitness) - 1)];
+        buffer[*group_ptr].write(element);
+        *group_ptr += 1;
+    }
+
+    let mut end_ptr = 0;
+    for i in 0..n_groups {
+        let start_ptr = end_ptr;
+        end_ptr += counts[i];
+        if counts[i] > 0 {
+            assert_eq!(end_ptr, group_ptrs[i]); // safety check for initialization!
+            let group = &buffer[start_ptr..end_ptr];
+            let group = unsafe { &*(group as *const [MaybeUninit] as *const [u64]) };
+            callback(&mut group.iter().copied());
+        }
+    }
+}
+
+struct Bucket<'buffer, T> {
+    reserved: FixedSliceVec<'buffer, T>,
+    overflow: Vec,
+}
+
+impl<'buffer, T> Bucket<'buffer, T> {
+    fn new(reserved: FixedSliceVec<'buffer, T>) -> Self {
+        Self {
+            reserved,
+            overflow: Vec::new(),
+        }
+    }
+
+    fn push(&mut self, element: T) {
+        if let Err(element) = self.reserved.try_push(element) {
+            self.overflow.push(element.0);
+        }
+    }
+
+    fn len(&self) -> usize {
+        self.reserved.len() + self.overflow.len()
+    }
+
+    fn iter(&self) -> core::iter::Chain, core::slice::Iter> {
+        self.reserved.iter().chain(self.overflow.iter())
+    }
+}
+
+pub fn radix_sort(
+    elements: impl Iterator + Clone,
+    elements_len: usize,
+    key: &mut impl FnMut(u64) -> usize,
+    key_bitness: u32,
+    callback: &mut impl FnMut(&mut dyn Iterator),
+) {
+    // The step at which `key` is consumed. `2 ** BITS` buckets are allocated.
+    const BITS: u32 = 8;
+
+    if elements_len <= CUTOFF || key_bitness <= BITS {
+        fallback(elements, elements_len, key, key_bitness, callback);
+        return;
+    }
+
+    let shift = key_bitness - BITS;
+
+    let reserved_capacity = (elements_len >> BITS).max(1); // 0 breaks `chunks_mut`
+
+    // Partitioning a single allocation is more efficient than allocating multiple times
+    let mut buffer = vec![MaybeUninit::uninit(); reserved_capacity << BITS];
+    let mut reserved = buffer.chunks_mut(reserved_capacity);
+    let mut buckets: [Bucket; 1 << BITS] = core::array::from_fn(|_| {
+        Bucket::new(FixedSliceVec::new(reserved.next().unwrap_or(&mut [])))
+    });
+
+    for element in elements {
+        buckets[(key(element) >> shift) & ((1 << BITS) - 1)].push(element);
+    }
+
+    for bucket in buckets {
+        radix_sort(
+            bucket.iter().copied(),
+            bucket.len(),
+            key,
+            key_bitness - BITS,
+            callback,
+        );
+    }
+}
+
+macro_rules! run {
+    ($fn:ident, $input:expr, $n:expr, $m:expr) => {{
+        let mut total = 0;
+        $fn(
+            $input,
+            $n,
+            &mut |element| (element.wrapping_mul(0x9a08c0ebcf5bc11b) >> (64 - $m)) as usize,
+            $m,
+            &mut |group| {
+                total += group.min().unwrap();
+            },
+        );
+        total
+    }};
+}
+
+fn bench_grouping(c: &mut Criterion) {
+    let mut group = c.benchmark_group("grouping");
+    group
+        .warm_up_time(Duration::from_secs(1))
+        .measurement_time(Duration::from_secs(1))
+        .sampling_mode(SamplingMode::Flat);
+    for shift in 0..10 {
+        let n = 80000usize << shift;
+        let m = 13 + shift;
+
+        let mut rng = WyRand::new(0x9a08c0ebcf5bc11b);
+        let input = (0..n).map(move |_| rng.rand());
+
+        group.throughput(Throughput::Elements(n as u64));
+        group.bench_with_input(BenchmarkId::new("old", n), &m, |b, &m| {
+            b.iter(|| run!(fallback, input.clone(), n, m));
+        });
+        group.bench_with_input(BenchmarkId::new("new", n), &m, |b, &m| {
+            b.iter(|| run!(radix_sort, input.clone(), n, m));
+        });
+    }
+    group.finish();
+}
+
+criterion_group!(benches, bench_grouping);
+criterion_main!(benches);
diff --git a/blog/the-ram-myth/benchmark.svg b/blog/the-ram-myth/benchmark.svg
new file mode 100644
index 0000000..42900f9
--- /dev/null
+++ b/blog/the-ram-myth/benchmark.svg
@@ -0,0 +1 @@
+
\ No newline at end of file
diff --git a/blog/the-ram-myth/improvement.svg b/blog/the-ram-myth/improvement.svg
new file mode 100644
index 0000000..d9bc175
--- /dev/null
+++ b/blog/the-ram-myth/improvement.svg
@@ -0,0 +1 @@
+
\ No newline at end of file
diff --git a/blog/the-ram-myth/index.html b/blog/the-ram-myth/index.html
new file mode 100644
index 0000000..c3f0a37
--- /dev/null
+++ b/blog/the-ram-myth/index.html
@@ -0,0 +1,95 @@
+The RAM myth | purplesyringa's blog

The RAM myth

The RAM myth is a belief that modern computer memory resembles perfect random-access memory. Cache is seen as an optimization for small data: if it fits in L2, it’s going to be processed faster; if it doesn’t, there’s nothing we can do.

Most likely, you believe that code like this is the fastest way to shard data:

groups = [[] for _ in range(n_groups)]
+for element in elements:
+    groups[element.group].append(element)
+

Indeed, it’s linear (i.e. asymptotically optimal), and we have to access random indices anyway, so cache isn’t going to help us in any case.

In reality, this is leaving a lot of performance on the table, and certain asymptotically slower algorithms can perform sharding significantly faster on large input. They are mostly used by on-disk databases, but, surprisingly, they are useful even for in-RAM data.

SolutionThe algorithm from above has Θ(n) cache misses on random input. The only way to reduce this number is to make the memory accesses more ordered. If you can ensure the elements are ordered by group, that’s great. If you can’t, you can still sort the accesses before the for loop:

elements.sort(key = lambda element: element.group)
+

Sorting costs some time, but in return, removes cache misses from the for loop entirely. If the data is large enough that it doesn’t fit in cache, this is a net win. As a bonus, creating individual lists can be replaced with a group-by operation:

elements.sort(key = lambda element: element.group)
+groups = [
+    group_elements
+    for _, group_elements
+    in itertools.groupby(elements, key = lambda element: element.group)
+]
+

There’s many cache-aware sorting algorithms, but as indices are just integers, radix sort works best here. Among off-the-shelf implementations, radsort worked the best for me in Rust.

Speedups

This is already better than the straightforward algorithm on large data, but there’s many tricks to make it faster.

GeneratorsSorting APIs try to make it seem like the data is sorted in-place, even when that’s not the case. This requires sorted data to be explicitly written to memory in a particular format. But if we only need to iterate over groups, generators or callbacks help avoid this:

# Assuming 32-bit indices
+def radix_sort(elements, bits = 32):
+    # Base case -- nothing to sort or all indices are equal
+    if len(elements) <= 1 or bits <= 0:
+        yield from elements
+        return
+
+    # Split by most significant byte of index we haven't seen yet
+    buckets = [[] for _ in range(256)]
+    for element in elements:
+        buckets[(element.index >> max(0, bits - 8)) & 0xff].append(element)
+
+    # Sort buckets recursively
+    for bucket in buckets:
+        yield from radix_sort(bucket, bits - 8)
+

We can even remove the groupby step by yielding individual groups:

# Base case -- nothing to sort or all indices are equal
+if bits <= 0:
+    if elements:
+        # Group!
+        yield elements
+    return
+

ReallocsThe next problem with this code is constantly reallocating the bucket arrays on append. This invokes memcpy more often than necessary and is bad for cache. A common fix is to compute sizes beforehand:

def get_bucket(element):
+    return (element.index >> max(0, bits - 8)) & 0xff
+
+sizes = Counter(map(get_bucket, elements))
+
+# Python can't really reserve space for lists, but pretend `reserve` did that anyway. In C++, this
+# is `std::vector::reserve`. In Rust, it's `Vec::with_capacity`.
+buckets = [reserve(sizes[i]) for i in range(256)]
+for element in elements:
+    buckets[get_bucket(element)].append(element)
+

This, however, requires two iterations, and ideally we’d keep the code single-pass. If the index is random, we can have our cake and eat it too: estimate the size of each bucket as len(elements) / 256 and reserve that much space. There’s going to be some leftovers if we underestimate, which we’ll store in a small separate storage:

class Bucket:
+    reserved: list
+    leftovers: list
+
+    def __init__(self, capacity):
+        self.reserved = reserve(capacity) # pseudocode
+        self.leftovers = []
+
+    def append(self, element):
+        if len(self.reserved) < self.reserved.capacity(): # pseudocode
+            self.reserved.append(element)
+        else:
+            self.leftovers.append(element)
+
+    def __len__(self):
+        return len(self.reserved) + len(self.leftovers)
+
+    def __iter__(self):
+        yield from self.reserved
+        yield from self.leftovers
+

The probability distribution plays ball here: on large input, only a tiny percentage of the elements overflow into leftovers, so the memory overhead is pretty small, reallocations on pushing into leftovers are fast, and bucketing (and iterating over a bucket) is cache-friendly.

PartitioningOne simple micro-optimization is to allocate once and split the returned memory into chunks instead of invoking malloc (or creating vectors) multiple times. Allocations are pretty slow, and this is a cheap way to reduce the effect.

Base caseSwitching to the straightforward algorithm on small inputs increases performance, as the effects of 𝒪(nlogn) code are more pronounced there. However, as radix_sort is recursive, we can perform this check on every level of recursion, scoring a win even on large data:

# Base case -- small enough to use a straightforward algorithm
+if len(elements) <= CUTOFF or bits <= 8:
+    counts = [0] * 256
+    for element in elements:
+        counts[element.index & 0xff] += 1
+
+    groups = [[] for _ in range(256)]
+    for element in elements:
+        groups[get_bucket(element)].append(element)
+
+    for group in groups:
+        if group:
+            yield group
+    return
+

The optimal CUTOFF is heavily machine-dependent. It depends on the relative speed of cache levels and RAM, as well as cache size and data types. For 64-bit integers, I’ve seen machines where the optimal value was 50k, 200k, and 1M. The best way to determine it is to benchmark in runtime – an acceptable solution for long-running software, like databases.

Benchmark

SetupHere’s a small benchmark.

The input data is an array of random 64-bit integers. We want to group them by a simple multiplicative hash and perform a simple analysis on the buckets – say, compute the sum of minimums among buckets. (In reality, you’d consume the buckets with some other cache-friendly algorithm down the pipeline.)

We’ll compare two implementations:

  1. The straightforward algorithm with optimized allocations.
  2. Radix sort-based grouping, with all optimizations from above and the optimal cutoff.

The average group size is 10.

The code is available on GitHub.

ResultsThe relative efficiency of the optimized algorithm grows as the data gets larger. Both the straightforward algorithm and the optimized one eventually settle at a fixed throughput. Depending on the machine, the improvement can be anywhere between 2.5× and 9× in the limit.

The results are (A, Y, M indicate different devices):

Grouping performance is benchmarked on three devices on element counts from 80k to 40M (with power-of-two steps). The cutoffs are 50k for A, 200k for Y, and 1M for M. On all three device, the throughput of the two algorithms is equivalent up to the cutoff, with radix sort getting faster and faster above the cutoff. The throughput of the straightforward algorithm degrades faster, while radix sort is way more stable.

Performance improvement of the new algorithm on different element counts, as measured on three devices. On A, the improvement slowly increases from 1x to 3x up to 5M elements and then quickly raises to 8x at 40M elements. On Y and M, the improvement is not so drastic, slowly but surely raising to 2.5x -- 3x.

ConclusionIs it worth it? If you absolutely need performance and sharding is a large part of your pipeline, by all means, use this. For example, I use this to find a collision-free hash on a given dataset. But just like with any optimization, you need to consider if increasing the code complexity is worth the hassle.

At the very least, if you work with big data, this trick is good to keep in mind.

Here’s another takeaway lesson. Everyone knows that, when working with on-disk data, you shouldn’t just map it to memory and run typical in-memory algorithms. It’s possible, but the performance are going to be bad. The take-away lesson here is that this applies to RAM and cache too: if you’ve got more than, say, 32 MiB of data, you need to seriously consider partitioning your data or switching to external memory algorithms.

Made with my own bare hands (why.)

\ No newline at end of file diff --git a/blog/the-ram-myth/index.md b/blog/the-ram-myth/index.md new file mode 100644 index 0000000..cc79afb --- /dev/null +++ b/blog/the-ram-myth/index.md @@ -0,0 +1,209 @@ +--- +title: The RAM myth +time: December 19, 2024 +intro: | + The RAM myth is a belief that modern computer memory resembles perfect random-access memory. Cache is seen as an optimization for small data: if it fits in L2, it's going to be processed faster; if it doesn't, there's nothing we can do. + + Most likely, you believe that code like this is the fastest way to shard data: + + ```python + groups = [[] for _ in range(n_groups)] + for element in elements: + groups[element.group].append(element) + ``` + + Indeed, it's linear (i.e. asymptotically optimal), and we have to access random indices anyway, so cache isn't going to help us in any case. + + In reality, this is leaving a lot of performance on the table, and certain *asymptotically slower* algorithms can perform sharding significantly faster on large input. They are mostly used by on-disk databases, but, surprisingly, they are useful even for in-RAM data. +--- + +The RAM myth is a belief that modern computer memory resembles perfect random-access memory. Cache is seen as an optimization for small data: if it fits in L2, it's going to be processed faster; if it doesn't, there's nothing we can do. + +Most likely, you believe that code like this is the fastest way to shard data: + +```python +groups = [[] for _ in range(n_groups)] +for element in elements: + groups[element.group].append(element) +``` + +Indeed, it's linear (i.e. asymptotically optimal), and we have to access random indices anyway, so cache isn't going to help us in any case. + +In reality, this is leaving a lot of performance on the table, and certain *asymptotically slower* algorithms can perform sharding significantly faster on large input. They are mostly used by on-disk databases, but, surprisingly, they are useful even for in-RAM data. + + +### Solution + +The algorithm from above has $\Theta(n)$ cache misses on random input. The only way to reduce this number is to make the memory accesses more ordered. If you can ensure the elements are ordered by `group`, that's great. If you can't, you can still sort the accesses before the `for` loop: + +```python +elements.sort(key = lambda element: element.group) +``` + +Sorting costs some time, but in return, removes cache misses from the `for` loop entirely. If the data is large enough that it doesn't fit in cache, this is a net win. As a bonus, creating individual lists can be replaced with a group-by operation: + +```python +elements.sort(key = lambda element: element.group) +groups = [ + group_elements + for _, group_elements + in itertools.groupby(elements, key = lambda element: element.group) +] +``` + +There's many cache-aware sorting algorithms, but as indices are just integers, radix sort works best here. Among off-the-shelf implementations, [radsort](https://docs.rs/radsort/latest/radsort/) worked the best for me in Rust. + + +## Speedups + +### + +This is already better than the straightforward algorithm on large data, but there's many tricks to make it faster. + + +### Generators + +Sorting APIs try to make it seem like the data is sorted in-place, even when that's not the case. This requires sorted data to be explicitly written to memory in a particular format. But if we only need to iterate over groups, generators or callbacks help avoid this: + +```python +# Assuming 32-bit indices +def radix_sort(elements, bits = 32): + # Base case -- nothing to sort or all indices are equal + if len(elements) <= 1 or bits <= 0: + yield from elements + return + + # Split by most significant byte of index we haven't seen yet + buckets = [[] for _ in range(256)] + for element in elements: + buckets[(element.index >> max(0, bits - 8)) & 0xff].append(element) + + # Sort buckets recursively + for bucket in buckets: + yield from radix_sort(bucket, bits - 8) +``` + +We can even remove the `groupby` step by yielding individual groups: + +```python +# Base case -- nothing to sort or all indices are equal +if bits <= 0: + if elements: + # Group! + yield elements + return +``` + + +### Reallocs + +The next problem with this code is constantly reallocating the `bucket` arrays on `append`. This invokes `memcpy` more often than necessary and is bad for cache. A common fix is to compute sizes beforehand: + +```python +def get_bucket(element): + return (element.index >> max(0, bits - 8)) & 0xff + +sizes = Counter(map(get_bucket, elements)) + +# Python can't really reserve space for lists, but pretend `reserve` did that anyway. In C++, this +# is `std::vector::reserve`. In Rust, it's `Vec::with_capacity`. +buckets = [reserve(sizes[i]) for i in range(256)] +for element in elements: + buckets[get_bucket(element)].append(element) +``` + +This, however, requires two iterations, and ideally we'd keep the code single-pass. If the index is random, we can have our cake and eat it too: *estimate* the size of each bucket as `len(elements) / 256` and reserve that much space. There's going to be some leftovers if we underestimate, which we'll store in a small separate storage: + +```python +class Bucket: + reserved: list + leftovers: list + + def __init__(self, capacity): + self.reserved = reserve(capacity) # pseudocode + self.leftovers = [] + + def append(self, element): + if len(self.reserved) < self.reserved.capacity(): # pseudocode + self.reserved.append(element) + else: + self.leftovers.append(element) + + def __len__(self): + return len(self.reserved) + len(self.leftovers) + + def __iter__(self): + yield from self.reserved + yield from self.leftovers +``` + +The probability distribution plays ball here: on large input, only a tiny percentage of the elements overflow into `leftovers`, so the memory overhead is pretty small, reallocations on pushing into `leftovers` are fast, and bucketing (and iterating over a bucket) is cache-friendly. + + +### Partitioning + +One simple micro-optimization is to allocate once and split the returned memory into chunks instead of invoking `malloc` (or creating vectors) multiple times. Allocations are pretty slow, and this is a cheap way to reduce the effect. + + +### Base case + +Switching to the straightforward algorithm on small inputs increases performance, as the effects of $\mathcal{O}(n \log n)$ code are more pronounced there. However, as `radix_sort` is recursive, we can perform this check on every level of recursion, scoring a win even on large data: + +```python +# Base case -- small enough to use a straightforward algorithm +if len(elements) <= CUTOFF or bits <= 8: + counts = [0] * 256 + for element in elements: + counts[element.index & 0xff] += 1 + + groups = [[] for _ in range(256)] + for element in elements: + groups[get_bucket(element)].append(element) + + for group in groups: + if group: + yield group + return +``` + +The optimal `CUTOFF` is heavily machine-dependent. It depends on the relative speed of cache levels and RAM, as well as cache size and data types. For 64-bit integers, I've seen machines where the optimal value was `50k`, `200k`, and `1M`. The best way to determine it is to benchmark in runtime -- an acceptable solution for long-running software, like databases. + + +## Benchmark + +### Setup + +Here's a small benchmark. + +The input data is an array of random 64-bit integers. We want to group them by a simple multiplicative hash and perform a simple analysis on the buckets -- say, compute the sum of minimums among buckets. (In reality, you'd consume the buckets with some other cache-friendly algorithm down the pipeline.) + +We'll compare two implementations: + +1. The straightforward algorithm with optimized allocations. +2. Radix sort-based grouping, with all optimizations from above and the optimal cutoff. + +The average group size is $10$. + +The code is available on [GitHub](https://github.com/purplesyringa/site/blob/master/blog/the-ram-myth/benchmark.rs). + + +### Results + +The relative efficiency of the optimized algorithm grows as the data gets larger. Both the straightforward algorithm and the optimized one eventually settle at a fixed throughput. Depending on the machine, the improvement can be anywhere between $2.5 \times$ and $9 \times$ in the limit. + + + +The results are (`A`, `Y`, `M` indicate different devices): + +![Grouping performance is benchmarked on three devices on element counts from 80k to 40M (with power-of-two steps). The cutoffs are 50k for A, 200k for Y, and 1M for M. On all three device, the throughput of the two algorithms is equivalent up to the cutoff, with radix sort getting faster and faster above the cutoff. The throughput of the straightforward algorithm degrades faster, while radix sort is way more stable.](./benchmark.svg) + +![Performance improvement of the new algorithm on different element counts, as measured on three devices. On A, the improvement slowly increases from 1x to 3x up to 5M elements and then quickly raises to 8x at 40M elements. On Y and M, the improvement is not so drastic, slowly but surely raising to 2.5x -- 3x.](./improvement.svg) + + +### Conclusion + +Is it worth it? If you absolutely need performance and sharding is a large part of your pipeline, by all means, use this. For example, I use this to find a collision-free hash on a given dataset. But just like with any optimization, you need to consider if increasing the code complexity is worth the hassle. + +At the very least, if you work with big data, this trick is good to keep in mind. + +Here's another takeaway lesson. Everyone knows that, when working with on-disk data, you shouldn't just map it to memory and run typical in-memory algorithms. It's *possible*, but the performance are going to be bad. The take-away lesson here is that this applies to RAM and cache too: if you've got more than, say, $32$ MiB of data, you need to seriously consider partitioning your data or switching to external memory algorithms. diff --git a/blog/the-ram-myth/og.png b/blog/the-ram-myth/og.png new file mode 100644 index 0000000..5e5a030 Binary files /dev/null and b/blog/the-ram-myth/og.png differ