diff --git a/blog/thoughts-on-rust-hashing/index.html b/blog/thoughts-on-rust-hashing/index.html
new file mode 100644
index 0000000..c028f87
--- /dev/null
+++ b/blog/thoughts-on-rust-hashing/index.html
@@ -0,0 +1,267 @@
+<!doctypehtml><html prefix="og: http://ogp.me/ns#"lang=en_US><meta charset=utf-8><meta content=width=device-width,initial-scale=1 name=viewport><title>Thoughts on Rust hashing | purplesyringa's blog</title><link href=../../favicon.ico?v=2 rel=icon><link href=../../all.css rel=stylesheet><link href=../../blog.css rel=stylesheet><link href=../../vendor/Temml-Local.css rel=stylesheet><link crossorigin href=https://fonts.googleapis.com/css2?family=Noto+Sans:ital,wght@0,100..900;1,100..900&family=Roboto+Mono:ital,wght@0,100..700;1,100..700&family=Roboto:ital,wght@0,400;0,700;1,400;1,700&family=Slabo+27px&display=swap rel=stylesheet><link href=../../fonts/webfont.css rel=stylesheet><link media="screen and (prefers-color-scheme: dark"href=../../vendor/atom-one-dark.min.css rel=stylesheet><link media="screen and (prefers-color-scheme: light"href=../../vendor/a11y-light.min.css rel=stylesheet><link title="Blog posts"href=../../blog/feed.rss rel=alternate type=application/rss+xml><meta content="Thoughts on Rust hashing"property=og:title><meta content=article property=og:type><meta content=https://purplesyringa.moe/blog/thoughts-on-rust-hashing/og.png property=og:image><meta content=https://purplesyringa.moe/blog/thoughts-on-rust-hashing/ property=og:url><meta content="In languages like Python, Java, or C++, values are hashed by calling a “hash me” method on them, implemented by the type author. This fixed-hash size is then immediately used by the hash table or what have you. This design suffers from some obvious problems, like:
+How do you hash an integer? If you use a no-op hasher (booo), DoS attacks on hash tables are inevitable. If you hash it thoroughly, consumers that only cache hashes to optimize equality checks lose out of performance."property=og:description><meta content=en_US property=og:locale><meta content="purplesyringa's blog"property=og:site_name><meta content=summary_large_image name=twitter:card><meta content=https://purplesyringa.moe/blog/thoughts-on-rust-hashing/og.png name=twitter:image><script data-website-id=0da1961d-43f2-45cc-a8e2-75679eefbb69 defer src=https://zond.tei.su/script.js></script><body><header><div class=viewport-container><div class=media><a href=https://github.com/purplesyringa><img alt=GitHub src=../../images/github-mark-white.svg></a></div><h1><a href=/>purplesyringa</a></h1><nav><a href=../..>about</a><a class=current href=../../blog/>blog</a><a href=../../sink/>kitchen sink</a></nav></div></header><section><div class=viewport-container><h2>Thoughts on Rust hashing</h2><time>December 11, 2024</time><p class=next-group><span aria-level=3 class=side-header role=heading><span>Intro</span></span>In languages like Python, Java, or C++, values are hashed by calling a “hash me” method on them, implemented by the type author. This fixed-hash size is then immediately used by the hash table or what have you. This design suffers from some obvious problems, like:<p>How do you hash an integer? If you use a no-op hasher (booo), DoS attacks on hash tables are inevitable. If you hash it thoroughly, consumers that only cache hashes to optimize equality checks lose out of performance.<p>How do you mix hashes? You can:<ul><li>Leave that to the users. Everyone will then invent their own terrible mixers, like <code>x ^ y</code>. Indeed, both arguments are pseudo-random, what could possibly go wrong?<li>Provide a good-enough mixer for most use cases, like <code>a * x + y</code>. Cue CVEs because people used <code>mix(x, mix(y, z))</code> instead of <code>mix(mix(x, y), z)</code>.<li>Provide a quality mixer, missing out on performance in common simple cases.</ul><p>What if the input data is already random? Then you’re just wasting cycles.<p>What guarantees do you provide regarding the hash values?<ul><li>Do you require the avalanche effect? Your hash is suboptimal even for simple power-of-two-sized hash tables.<li>Do you require a half-avalanche effect instead? Congrats, you broke either those or prime-sized hash tables.<li>Do you require the hash table to perform finalization manually? Using strings as keys is now suboptimal, because computing a non-finalized hash of a string is of good enough quality already.</ul><p>Is your hash function seeded?<ul><li>If not, hi DoS.<li>If yes, but you reuse the same seed between different hash tables, <a href=https://accidentallyquadratic.tumblr.com/post/153545455987/rust-hash-iteration-reinsertion>your tables are now quadratic</a>.<li>If the seed is explicitly passed to each hasher, how do you ensure different hashers don’t accidentally cancel out?</ul><p class=next-group><span aria-level=3 class=side-header role=heading><span>In Rust</span></span>Rust learnt from these mistakes by splitting the responsibilities:<ul><li>Objects implement the <code>Hash</code> trait, allowing them to write underlying data into a <code>Hasher</code>.<li>Hashers implement the <code>Hasher</code> trait, which hashes the data written by <code>Hash</code> objects.</ul><p>Objects turn the structured data into a stream of integers; hashers turn the stream into a numeric hash.<p>On paper, this is a good solution:<ul><li>Hashing an integer is as simple as sending the integer to the hasher. Consumers can choose hashers that provide the necessary guarantees.<li>Users don’t have to mix hashes. Hashers can do that optimally.<li>If the data is known to be random, a fast simple hasher can be used without changing the <code>Hash</code> implementation.<li>Different hash tables can use different hashers, efficiently providing only as much avalanche as necessary.<li>The hasher can be seeded per-table. Only the hasher has access to the seed, so safely using the seed during mixing is easy.</ul><p>Surely this enables optimal and performant hashing in practice, right?<p class=next-group><span aria-level=3 class=side-header role=heading><span>No</span></span>Let’s take a look at the <code>Hasher</code> API:<pre><code class=language-rust><span class=hljs-keyword>pub</span> <span class=hljs-keyword>trait</span> <span class="hljs-title class_">Hasher</span> {
+    <span class=hljs-comment>// Required methods</span>
+    <span class=hljs-keyword>fn</span> <span class="hljs-title function_">finish</span>(&<span class=hljs-keyword>self</span>) <span class=hljs-punctuation>-></span> <span class=hljs-type>u64</span>;
+    <span class=hljs-keyword>fn</span> <span class="hljs-title function_">write</span>(&<span class=hljs-keyword>mut</span> <span class=hljs-keyword>self</span>, bytes: &[<span class=hljs-type>u8</span>]);
+
+    <span class=hljs-comment>// Provided methods</span>
+    <span class=hljs-keyword>fn</span> <span class="hljs-title function_">write_u8</span>(&<span class=hljs-keyword>mut</span> <span class=hljs-keyword>self</span>, i: <span class=hljs-type>u8</span>) { ... }
+    <span class=hljs-keyword>fn</span> <span class="hljs-title function_">write_u16</span>(&<span class=hljs-keyword>mut</span> <span class=hljs-keyword>self</span>, i: <span class=hljs-type>u16</span>) { ... }
+    <span class=hljs-keyword>fn</span> <span class="hljs-title function_">write_u32</span>(&<span class=hljs-keyword>mut</span> <span class=hljs-keyword>self</span>, i: <span class=hljs-type>u32</span>) { ... }
+    <span class=hljs-keyword>fn</span> <span class="hljs-title function_">write_u64</span>(&<span class=hljs-keyword>mut</span> <span class=hljs-keyword>self</span>, i: <span class=hljs-type>u64</span>) { ... }
+    <span class=hljs-keyword>fn</span> <span class="hljs-title function_">write_u128</span>(&<span class=hljs-keyword>mut</span> <span class=hljs-keyword>self</span>, i: <span class=hljs-type>u128</span>) { ... }
+    <span class=hljs-keyword>fn</span> <span class="hljs-title function_">write_usize</span>(&<span class=hljs-keyword>mut</span> <span class=hljs-keyword>self</span>, i: <span class=hljs-type>usize</span>) { ... }
+    <span class=hljs-keyword>fn</span> <span class="hljs-title function_">write_i8</span>(&<span class=hljs-keyword>mut</span> <span class=hljs-keyword>self</span>, i: <span class=hljs-type>i8</span>) { ... }
+    <span class=hljs-keyword>fn</span> <span class="hljs-title function_">write_i16</span>(&<span class=hljs-keyword>mut</span> <span class=hljs-keyword>self</span>, i: <span class=hljs-type>i16</span>) { ... }
+    <span class=hljs-keyword>fn</span> <span class="hljs-title function_">write_i32</span>(&<span class=hljs-keyword>mut</span> <span class=hljs-keyword>self</span>, i: <span class=hljs-type>i32</span>) { ... }
+    <span class=hljs-keyword>fn</span> <span class="hljs-title function_">write_i64</span>(&<span class=hljs-keyword>mut</span> <span class=hljs-keyword>self</span>, i: <span class=hljs-type>i64</span>) { ... }
+    <span class=hljs-keyword>fn</span> <span class="hljs-title function_">write_i128</span>(&<span class=hljs-keyword>mut</span> <span class=hljs-keyword>self</span>, i: <span class=hljs-type>i128</span>) { ... }
+    <span class=hljs-keyword>fn</span> <span class="hljs-title function_">write_isize</span>(&<span class=hljs-keyword>mut</span> <span class=hljs-keyword>self</span>, i: <span class=hljs-type>isize</span>) { ... }
+    <span class=hljs-keyword>fn</span> <span class="hljs-title function_">write_length_prefix</span>(&<span class=hljs-keyword>mut</span> <span class=hljs-keyword>self</span>, len: <span class=hljs-type>usize</span>) { ... }
+    <span class=hljs-keyword>fn</span> <span class="hljs-title function_">write_str</span>(&<span class=hljs-keyword>mut</span> <span class=hljs-keyword>self</span>, s: &<span class=hljs-type>str</span>) { ... }
+}
+</code></pre><p>This API is tuned to <em>streaming hashes</em>, like the polynomial hash and its various knock-offs. But just like in encryption, hashing is block-wise these days.<p>Block hashes have some internal state that iteratively “absorbs” input blocks of a fixed length. When the data runs out, the last block is padded with length and absorbed as a fixed-length block too. A finalization step then reduces the internal state to 64 bits (or more, depending on the use case).<p>That’s how SHA-2 and many other cryptographic hashes work, but you might be surprised to know that the top hashes <a href=https://gitlab.com/fwojcik/smhasher3/-/tree/main/results>in the SMHasher list</a> all use the same approach.<p>The block-wise design is objectively superior to streaming. Consuming as much data as possible at once reduces the amortized avalanche cost, enabling safer hash functions at greater speed than streaming hashes can achieve. Block-wise hashes have a lower latency, as the latency is accumulated per-block, not per-stream-input.<h2>Block hash support in Rust</h2><p class=next-group><span aria-level=3 class=side-header role=heading><span>:ferrisClueless:</span></span>The <code>Hasher</code> API makes no effort to suit block hashes. The hasher is not informed of the length of the data or of its structure. It must <em>always</em> be able to absorb just one more <code>u8</code>, bro, I promise. There’s only two ways to deal with this:<ul><li>Either you pad all individual inputs, even very short ones, to the full block width,<li>Or you accumulate a block and occasionally flush it to the underlying block hasher.</ul><p>Let’s see what’s wrong with these approaches.<p class=next-group><span aria-level=3 class=side-header role=heading><span>Padding</span></span>Let’s consider a very simple block-wise hash:<pre><code class=language-rust><span class=hljs-keyword>fn</span> <span class="hljs-title function_">absorb</span>(state: &<span class=hljs-keyword>mut</span> <span class=hljs-type>u64</span>, block: &[<span class=hljs-type>u8</span>; <span class=hljs-number>8</span>]) {
+    <span class=hljs-keyword>let</span> <span class=hljs-variable>block</span> = <span class=hljs-type>u64</span>::<span class="hljs-title function_ invoke__">from_ne_bytes</span>(*block);
+    *state = state.<span class="hljs-title function_ invoke__">wrapping_mul</span>(K).<span class="hljs-title function_ invoke__">wrapping_add</span>(block);
+}
+</code></pre><p>This is just a multiplicative hash, not unlike FNV-1, but consuming <eq><math><mn>8</mn></math></eq> bytes at a time instead of <eq><math><mn>1</mn></math></eq>.<p>Now what happens if you try to hash two 32-bit integers with this hash? With padding, that will compile to two multiplications even though one would work. This halves throughput and increases latency.<p>Practical hashes uses much larger blocks. <code>rapidhash</code> has a <eq><math><mn>24</mn></math></eq>-byte state and can absorb <eq><math><mn>48</mn></math></eq> bytes at once. <code>ahash</code> has a <eq><math><mn>48</mn></math></eq>-byte state and absorbs <eq><math><mn>64</mn></math></eq>-byte blocks. <code>meowhash</code> has a <eq><math><mn>128</mn></math></eq>-byte state and absorbs <eq><math><mn>256</mn></math></eq> bytes. (I only selected these particular hashes because I’m familiar with their kernels; others have similar designs.)<p>These are some of the fastest non-cryptographic hashes in the world. Do you really want to nuke their performance by padding <eq><math><mn>8</mn></math></eq>-byte inputs to <eq><math><mn>48</mn></math></eq>, <eq><math><mn>64</mn></math></eq>, or <eq><math><mn>256</mn></math></eq> bytes? Probably not.<p class=next-group><span aria-level=3 class=side-header role=heading><span>Chains</span></span>Okay, but what if we cheated and modified the hash functions to absorb small data somewhat more efficiently than absorbing a full block?<p>Say, the <code>rapidhash</code> kernel is effectively <em>this</em>:<pre><code class=language-rust><span class=hljs-keyword>fn</span> <span class="hljs-title function_">absorb</span>(state: &<span class=hljs-keyword>mut</span> [<span class=hljs-type>u64</span>; <span class=hljs-number>3</span>], seed: &[<span class=hljs-type>u64</span>; <span class=hljs-number>3</span>], block: &[<span class=hljs-type>u64</span>; <span class=hljs-number>6</span>]) {
+    <span class=hljs-keyword>for</span> <span class=hljs-variable>i</span> <span class=hljs-keyword>in</span> <span class=hljs-number>0</span>..<span class=hljs-number>3</span> {
+        state[i] = <span class="hljs-title function_ invoke__">mix</span>(block[i] ^ state[i], block[i + <span class=hljs-number>3</span>] ^ seed[i]);
+    }
+}
+</code></pre><p>That’s three independent iterations, so <em>surely</em> we can absorb a smaller 64-bit block like this instead:<pre><code class=language-rust><span class=hljs-keyword>fn</span> <span class="hljs-title function_">absorb_64bit</span>(state: &<span class=hljs-keyword>mut</span> [<span class=hljs-type>u64</span>; <span class=hljs-number>3</span>], seed: &[<span class=hljs-type>u64</span>; <span class=hljs-number>3</span>], block: <span class=hljs-type>u64</span>) {
+    state[<span class=hljs-number>0</span>] = <span class="hljs-title function_ invoke__">mix</span>(block ^ state[<span class=hljs-number>0</span>], seed[<span class=hljs-number>0</span>]);
+}
+</code></pre><p>Surely this is going to reduce the <eq><math><mrow><mn>6</mn><mo>×</mo></mrow></math></eq> slowdown to at least something like <eq><math><mrow><mn>2</mn><mo>×</mo></mrow></math></eq>, right?<p>Why does <code>rapidhash</code> even use three independent chains in the first place? That’s right, latency!<p><code>mix</code> has a <eq><math><mn>5</mn></math></eq> tick latency on modern x86 processors, but a throughput of <eq><math><mn>1</mn></math></eq>. Chain independence allows a <eq><math><mn>16</mn></math></eq>-byte block to be consumed without waiting for the previous <eq><math><mn>16</mn></math></eq> bytes to be mixed in. We just threw this optimization out.<p class=next-group><span aria-level=3 class=side-header role=heading><span>Accumulation</span></span>Okay, so padding is a terrible idea. Can we accumulate a buffer instead? How much hashes I had to scroll through in SMHasher before I found <em>one</em> Rust implementation that took this approach is a warning bell.<p><a href=https://docs.rs/farmhash/1.1.5/src/farmhash/lib.rs.html#92-110>The implementation I found</a>, of course, stores a <code>Vec&LTu8></code> and passes it to the underlying hasher in <code>finish</code>. I believe I don’t need to explain why allocating during hash function is not the brightest idea.<p>Let’s consider <a href=https://docs.rs/highway/1.2.0/src/highway/portable.rs.html#272-288>another implementation</a> that stores a fixed-size buffer instead. Huh, that’s a lot of <code>if</code>s and <code>for</code>s. I wonder what Godbolt will say about this. Let’s try something very simple:<pre><code class=language-rust><div class=expansible-code><input id=expansible1 type=checkbox><div class=highlighted><span class=hljs-keyword>struct</span> <span class="hljs-title class_">StreamingHasher</span> {
+    block_hasher: BlockHasher,
+    buffer: [<span class=hljs-type>u8</span>; <span class=hljs-number>8</span>],
+    length: <span class=hljs-type>usize</span>,
+}
+
+<span class=hljs-keyword>impl</span> <span class="hljs-title class_">StreamingHasher</span> {
+    <span class=hljs-keyword>fn</span> <span class="hljs-title function_">write</span>(&<span class=hljs-keyword>mut</span> <span class=hljs-keyword>self</span>, input: &[<span class=hljs-type>u8</span>]) {
+        <span class=hljs-comment>// If the input fits in the free space in the buffer, just copy it.</span>
+        <span class=hljs-keyword>let</span> <span class=hljs-variable>rest</span> = <span class=hljs-keyword>unsafe</span> { <span class=hljs-keyword>self</span>.buffer.<span class="hljs-title function_ invoke__">get_unchecked_mut</span>(<span class=hljs-keyword>self</span>.length..) };
+        <span class=hljs-keyword>if</span> input.<span class="hljs-title function_ invoke__">len</span>() < rest.<span class="hljs-title function_ invoke__">len</span>() {
+            rest[..input.<span class="hljs-title function_ invoke__">len</span>()].<span class="hljs-title function_ invoke__">copy_from_slice</span>(input);
+            <span class=hljs-keyword>self</span>.length += input.<span class="hljs-title function_ invoke__">len</span>();
+            <span class=hljs-keyword>return</span>;
+        }
+
+        <span class=hljs-comment>// Otherwise, copy whatever fits and hash the chunk.</span>
+        <span class=hljs-keyword>let</span> (head, tail) = input.<span class="hljs-title function_ invoke__">split_at</span>(rest.<span class="hljs-title function_ invoke__">len</span>());
+        rest.<span class="hljs-title function_ invoke__">copy_from_slice</span>(head);
+        <span class=hljs-keyword>self</span>.block_hasher.<span class="hljs-title function_ invoke__">feed</span>(<span class=hljs-keyword>self</span>.buffer);
+
+        <span class=hljs-comment>// Split the rest of the input into blocks and hash them individually, move the last one</span>
+        <span class=hljs-comment>// to the buffer.</span>
+        <span class=hljs-keyword>let</span> <span class=hljs-variable>chunks</span> = tail.<span class="hljs-title function_ invoke__">array_chunks</span>();
+        <span class=hljs-keyword>let</span> <span class=hljs-variable>remainder</span> = chunks.<span class="hljs-title function_ invoke__">remainder</span>();
+        <span class=hljs-keyword>self</span>.buffer[..remainder.<span class="hljs-title function_ invoke__">len</span>()].<span class="hljs-title function_ invoke__">copy_from_slice</span>(remainder);
+        <span class=hljs-keyword>self</span>.length = remainder.<span class="hljs-title function_ invoke__">len</span>();
+        <span class=hljs-keyword>for</span> <span class=hljs-variable>chunk</span> <span class=hljs-keyword>in</span> chunks {
+            <span class=hljs-keyword>self</span>.block_hasher.<span class="hljs-title function_ invoke__">feed</span>(*chunk);
+        }
+    }
+}
+</div><label for=expansible1>Expand</label></div></code></pre><p>Surely this will compile to good code? :ferrisClueless:<p>Here’s what writing 1 (one) byte into this hasher compiles to:<pre><code class=language-x86asm><div class=expansible-code><input id=expansible2 type=checkbox><div class=highlighted><span class=hljs-symbol>write_u8:</span>
+        <span class=hljs-keyword>push</span>    <span class=hljs-built_in>r15</span>
+        <span class=hljs-keyword>push</span>    <span class=hljs-built_in>r14</span>
+        <span class=hljs-keyword>push</span>    <span class=hljs-built_in>r13</span>
+        <span class=hljs-keyword>push</span>    <span class=hljs-built_in>r12</span>
+        <span class=hljs-keyword>push</span>    <span class=hljs-built_in>rbx</span>
+        <span class=hljs-keyword>sub</span>     <span class=hljs-built_in>rsp</span>, <span class=hljs-number>16</span>
+        <span class=hljs-keyword>mov</span>     <span class=hljs-built_in>rbx</span>, <span class=hljs-built_in>rdi</span>
+        <span class=hljs-keyword>mov</span>     <span class=hljs-built_in>byte</span> <span class=hljs-built_in>ptr</span> [<span class=hljs-built_in>rsp</span> + <span class=hljs-number>15</span>], <span class=hljs-built_in>sil</span>
+        <span class=hljs-keyword>mov</span>     <span class=hljs-built_in>r14</span>, <span class=hljs-built_in>qword</span> <span class=hljs-built_in>ptr</span> [<span class=hljs-built_in>rdi</span> + <span class=hljs-number>16</span>]
+        <span class=hljs-keyword>add</span>     <span class=hljs-built_in>rdi</span>, <span class=hljs-built_in>r14</span>
+        <span class=hljs-keyword>add</span>     <span class=hljs-built_in>rdi</span>, <span class=hljs-number>8</span>
+        <span class=hljs-keyword>cmp</span>     <span class=hljs-built_in>r14</span>, <span class=hljs-number>6</span>
+        <span class=hljs-keyword>ja</span>      .LBB0_2
+        <span class=hljs-keyword>mov</span>     <span class=hljs-built_in>byte</span> <span class=hljs-built_in>ptr</span> [<span class=hljs-built_in>rdi</span>], <span class=hljs-built_in>sil</span>
+        <span class=hljs-keyword>mov</span>     <span class=hljs-built_in>r14</span>, <span class=hljs-built_in>qword</span> <span class=hljs-built_in>ptr</span> [<span class=hljs-built_in>rbx</span> + <span class=hljs-number>16</span>]
+        <span class=hljs-keyword>inc</span>     <span class=hljs-built_in>r14</span>
+        <span class=hljs-keyword>jmp</span>     .LBB0_3
+<span class=hljs-symbol>.LBB0_2:</span>
+        <span class=hljs-keyword>lea</span>     <span class=hljs-built_in>r15</span>, [<span class=hljs-built_in>rbx</span> + <span class=hljs-number>8</span>]
+        <span class=hljs-keyword>mov</span>     <span class=hljs-built_in>edx</span>, <span class=hljs-number>8</span>
+        <span class=hljs-keyword>sub</span>     <span class=hljs-built_in>rdx</span>, <span class=hljs-built_in>r14</span>
+        <span class=hljs-keyword>lea</span>     <span class=hljs-built_in>r12</span>, [<span class=hljs-built_in>rsp</span> + <span class=hljs-built_in>rdx</span>]
+        <span class=hljs-keyword>add</span>     <span class=hljs-built_in>r12</span>, <span class=hljs-number>15</span>
+        <span class=hljs-keyword>add</span>     <span class=hljs-built_in>r14</span>, -<span class=hljs-number>7</span>
+        <span class=hljs-keyword>lea</span>     <span class=hljs-built_in>rsi</span>, [<span class=hljs-built_in>rsp</span> + <span class=hljs-number>15</span>]
+        <span class=hljs-keyword>mov</span>     <span class=hljs-built_in>r13</span>, <span class=hljs-built_in>qword</span> <span class=hljs-built_in>ptr</span> [<span class=hljs-built_in>rip</span> + memcpy@GOTPCREL]
+        <span class=hljs-keyword>call</span>    <span class=hljs-built_in>r13</span>
+        movabs  <span class=hljs-built_in>rax</span>, <span class=hljs-number>5512829513697402577</span>
+        <span class=hljs-keyword>imul</span>    <span class=hljs-built_in>rax</span>, <span class=hljs-built_in>qword</span> <span class=hljs-built_in>ptr</span> [<span class=hljs-built_in>rbx</span>]
+        <span class=hljs-keyword>add</span>     <span class=hljs-built_in>rax</span>, <span class=hljs-built_in>qword</span> <span class=hljs-built_in>ptr</span> [<span class=hljs-built_in>rbx</span> + <span class=hljs-number>8</span>]
+        <span class=hljs-keyword>mov</span>     <span class=hljs-built_in>qword</span> <span class=hljs-built_in>ptr</span> [<span class=hljs-built_in>rbx</span>], <span class=hljs-built_in>rax</span>
+        <span class=hljs-keyword>mov</span>     <span class=hljs-built_in>rsi</span>, <span class=hljs-built_in>r14</span>
+        <span class=hljs-keyword>and</span>     <span class=hljs-built_in>rsi</span>, -<span class=hljs-number>8</span>
+        <span class=hljs-keyword>add</span>     <span class=hljs-built_in>rsi</span>, <span class=hljs-built_in>r12</span>
+        <span class=hljs-keyword>mov</span>     <span class=hljs-built_in>rdi</span>, <span class=hljs-built_in>r15</span>
+        <span class=hljs-keyword>mov</span>     <span class=hljs-built_in>rdx</span>, <span class=hljs-built_in>r14</span>
+        <span class=hljs-keyword>call</span>    <span class=hljs-built_in>r13</span>
+<span class=hljs-symbol>.LBB0_3:</span>
+        <span class=hljs-keyword>mov</span>     <span class=hljs-built_in>qword</span> <span class=hljs-built_in>ptr</span> [<span class=hljs-built_in>rbx</span> + <span class=hljs-number>16</span>], <span class=hljs-built_in>r14</span>
+        <span class=hljs-keyword>add</span>     <span class=hljs-built_in>rsp</span>, <span class=hljs-number>16</span>
+        <span class=hljs-keyword>pop</span>     <span class=hljs-built_in>rbx</span>
+        <span class=hljs-keyword>pop</span>     <span class=hljs-built_in>r12</span>
+        <span class=hljs-keyword>pop</span>     <span class=hljs-built_in>r13</span>
+        <span class=hljs-keyword>pop</span>     <span class=hljs-built_in>r14</span>
+        <span class=hljs-keyword>pop</span>     <span class=hljs-built_in>r15</span>
+        <span class=hljs-keyword>ret</span>
+</div><label for=expansible2>Expand</label></div></code></pre><p>Waow, what happened? That’s right, <code>copy_from_slice</code> did! LLVM <em>cannot</em> compile a variable-length copy into anything other than <code>memcpy</code>. Did you write a loop with a guaranteed bound on the iteration count by hand? Too bad, that goes in the <code>memcpy</code> hole.<p class=next-group><span aria-level=3 class=side-header role=heading><span>SipHasher</span></span>So crates in the wild do this wrong. How does the built-in Rust hasher handle this? <a href=https://github.com/rust-lang/rust/pull/69152>It conveniently doesn’t define <code>write_*</code></a> – by design, because this important optimization leads to a small increase in compile time. Riiiiiight.<p>The <code>siphasher</code> <em>crate</em>, though, optimizes the short-length <code>memcpy</code> with <a href=https://docs.rs/siphasher/latest/src/siphasher/sip.rs.html#330-354>bitwise operations</a>. Let’s try it out:<pre><code class=language-rust><div class=expansible-code><input id=expansible3 type=checkbox><div class=highlighted><span class=hljs-keyword>fn</span> <span class="hljs-title function_">write</span>(&<span class=hljs-keyword>mut</span> <span class=hljs-keyword>self</span>, input: <span class=hljs-type>u64</span>, input_len: <span class=hljs-type>usize</span>) {
+    <span class=hljs-built_in>assert!</span>(input_len <= <span class=hljs-number>8</span>);
+    <span class=hljs-keyword>if</span> input_len != <span class=hljs-number>8</span> {
+        <span class=hljs-built_in>assert!</span>(input >> (<span class=hljs-number>8</span> * input_len) == <span class=hljs-number>0</span>);
+    }
+
+    <span class=hljs-comment>// Consume as many inputs as fit.</span>
+    <span class=hljs-keyword>let</span> <span class=hljs-variable>old_length</span> = <span class=hljs-keyword>self</span>.length;
+    <span class=hljs-keyword>self</span>.buffer |= input << (<span class=hljs-number>8</span> * <span class=hljs-keyword>self</span>.length);
+    <span class=hljs-keyword>self</span>.length += input_len;
+
+    <span class=hljs-comment>// On overflow, feed the buffer block hasher and initialize the buffer with the tail.</span>
+    <span class=hljs-keyword>if</span> <span class=hljs-keyword>self</span>.length > <span class=hljs-number>8</span> {
+        <span class=hljs-keyword>self</span>.block_hasher.<span class="hljs-title function_ invoke__">feed</span>(<span class=hljs-keyword>self</span>.buffer);
+        <span class=hljs-keyword>self</span>.buffer = input >> (<span class=hljs-number>8</span> * (<span class=hljs-number>8</span> - old_length));
+        <span class=hljs-keyword>self</span>.length -= <span class=hljs-number>8</span>;
+    }
+}
+</div><label for=expansible3>Expand</label></div></code></pre><pre><code class=language-x86asm><div class=expansible-code><input id=expansible4 type=checkbox><div class=highlighted><span class=hljs-symbol>write_u8:</span>
+        <span class=hljs-keyword>mov</span>     <span class=hljs-built_in>rax</span>, <span class=hljs-built_in>qword</span> <span class=hljs-built_in>ptr</span> [<span class=hljs-built_in>rdi</span> + <span class=hljs-number>16</span>]
+        <span class=hljs-keyword>movzx</span>   <span class=hljs-built_in>ecx</span>, <span class=hljs-built_in>sil</span>
+        <span class=hljs-keyword>lea</span>     <span class=hljs-built_in>edx</span>, [<span class=hljs-number>8</span>*<span class=hljs-built_in>rax</span>]
+        <span class=hljs-keyword>lea</span>     <span class=hljs-built_in>rsi</span>, [<span class=hljs-built_in>rax</span> + <span class=hljs-number>1</span>]
+        <span class=hljs-keyword>shlx</span>    <span class=hljs-built_in>rdx</span>, <span class=hljs-built_in>rcx</span>, <span class=hljs-built_in>rdx</span>
+        <span class=hljs-keyword>or</span>      <span class=hljs-built_in>rdx</span>, <span class=hljs-built_in>qword</span> <span class=hljs-built_in>ptr</span> [<span class=hljs-built_in>rdi</span> + <span class=hljs-number>8</span>]
+        <span class=hljs-keyword>mov</span>     <span class=hljs-built_in>qword</span> <span class=hljs-built_in>ptr</span> [<span class=hljs-built_in>rdi</span> + <span class=hljs-number>8</span>], <span class=hljs-built_in>rdx</span>
+        <span class=hljs-keyword>mov</span>     <span class=hljs-built_in>qword</span> <span class=hljs-built_in>ptr</span> [<span class=hljs-built_in>rdi</span> + <span class=hljs-number>16</span>], <span class=hljs-built_in>rsi</span>
+        <span class=hljs-keyword>cmp</span>     <span class=hljs-built_in>rsi</span>, <span class=hljs-number>9</span>
+        <span class=hljs-keyword>jb</span>      .LBB0_2
+        movabs  <span class=hljs-built_in>rsi</span>, <span class=hljs-number>5512829513697402577</span>
+        <span class=hljs-keyword>imul</span>    <span class=hljs-built_in>rsi</span>, <span class=hljs-built_in>qword</span> <span class=hljs-built_in>ptr</span> [<span class=hljs-built_in>rdi</span>]
+        <span class=hljs-keyword>add</span>     <span class=hljs-built_in>rsi</span>, <span class=hljs-built_in>rdx</span>
+        <span class=hljs-keyword>mov</span>     <span class=hljs-built_in>edx</span>, <span class=hljs-built_in>eax</span>
+        <span class=hljs-keyword>add</span>     <span class=hljs-built_in>rax</span>, -<span class=hljs-number>7</span>
+        <span class=hljs-keyword>neg</span>     <span class=hljs-built_in>dl</span>
+        <span class=hljs-keyword>mov</span>     <span class=hljs-built_in>qword</span> <span class=hljs-built_in>ptr</span> [<span class=hljs-built_in>rdi</span>], <span class=hljs-built_in>rsi</span>
+        <span class=hljs-keyword>shl</span>     <span class=hljs-built_in>dl</span>, <span class=hljs-number>3</span>
+        <span class=hljs-keyword>shrx</span>    <span class=hljs-built_in>rcx</span>, <span class=hljs-built_in>rcx</span>, <span class=hljs-built_in>rdx</span>
+        <span class=hljs-keyword>mov</span>     <span class=hljs-built_in>qword</span> <span class=hljs-built_in>ptr</span> [<span class=hljs-built_in>rdi</span> + <span class=hljs-number>8</span>], <span class=hljs-built_in>rcx</span>
+        <span class=hljs-keyword>mov</span>     <span class=hljs-built_in>qword</span> <span class=hljs-built_in>ptr</span> [<span class=hljs-built_in>rdi</span> + <span class=hljs-number>16</span>], <span class=hljs-built_in>rax</span>
+<span class=hljs-symbol>.LBB0_2:</span>
+        <span class=hljs-keyword>ret</span>
+</div><label for=expansible4>Expand</label></div></code></pre><p>This is kind of better? Now let’s try hashing <code>(u8, u8)</code> like Rust would do:<pre><code class=language-rust><span class=hljs-meta>#[no_mangle]</span>
+<span class=hljs-keyword>fn</span> <span class="hljs-title function_">write_u8_pair</span>(&<span class=hljs-keyword>mut</span> <span class=hljs-keyword>self</span>, pair: (<span class=hljs-type>u8</span>, <span class=hljs-type>u8</span>)) {
+    <span class=hljs-keyword>self</span>.<span class="hljs-title function_ invoke__">write</span>(&[pair.<span class=hljs-number>0</span>]);
+    <span class=hljs-keyword>self</span>.<span class="hljs-title function_ invoke__">write</span>(&[pair.<span class=hljs-number>1</span>]);
+}
+</code></pre><pre><code class=language-x86asm><div class=expansible-code><input id=expansible5 type=checkbox><div class=highlighted><span class=hljs-symbol>write_u8_pair:</span>
+        <span class=hljs-keyword>mov</span>     <span class=hljs-built_in>r8</span>, <span class=hljs-built_in>qword</span> <span class=hljs-built_in>ptr</span> [<span class=hljs-built_in>rdi</span> + <span class=hljs-number>16</span>]
+        <span class=hljs-keyword>movzx</span>   <span class=hljs-built_in>r9d</span>, <span class=hljs-built_in>sil</span>
+        movabs  <span class=hljs-built_in>rax</span>, <span class=hljs-number>5512829513697402577</span>
+        <span class=hljs-keyword>lea</span>     <span class=hljs-built_in>ecx</span>, [<span class=hljs-number>8</span>*<span class=hljs-built_in>r8</span>]
+        <span class=hljs-keyword>shlx</span>    <span class=hljs-built_in>rsi</span>, <span class=hljs-built_in>r9</span>, <span class=hljs-built_in>rcx</span>
+        <span class=hljs-keyword>or</span>      <span class=hljs-built_in>rsi</span>, <span class=hljs-built_in>qword</span> <span class=hljs-built_in>ptr</span> [<span class=hljs-built_in>rdi</span> + <span class=hljs-number>8</span>]
+        <span class=hljs-keyword>lea</span>     <span class=hljs-built_in>rcx</span>, [<span class=hljs-built_in>r8</span> + <span class=hljs-number>1</span>]
+        <span class=hljs-keyword>cmp</span>     <span class=hljs-built_in>rcx</span>, <span class=hljs-number>9</span>
+        <span class=hljs-keyword>jb</span>      .LBB1_2
+        <span class=hljs-keyword>mov</span>     <span class=hljs-built_in>rcx</span>, <span class=hljs-built_in>qword</span> <span class=hljs-built_in>ptr</span> [<span class=hljs-built_in>rdi</span>]
+        <span class=hljs-keyword>imul</span>    <span class=hljs-built_in>rcx</span>, <span class=hljs-built_in>rax</span>
+        <span class=hljs-keyword>add</span>     <span class=hljs-built_in>rcx</span>, <span class=hljs-built_in>rsi</span>
+        <span class=hljs-keyword>mov</span>     <span class=hljs-built_in>qword</span> <span class=hljs-built_in>ptr</span> [<span class=hljs-built_in>rdi</span>], <span class=hljs-built_in>rcx</span>
+        <span class=hljs-keyword>mov</span>     <span class=hljs-built_in>ecx</span>, <span class=hljs-built_in>r8d</span>
+        <span class=hljs-keyword>add</span>     <span class=hljs-built_in>r8</span>, -<span class=hljs-number>7</span>
+        <span class=hljs-keyword>neg</span>     <span class=hljs-built_in>cl</span>
+        <span class=hljs-keyword>shl</span>     <span class=hljs-built_in>cl</span>, <span class=hljs-number>3</span>
+        <span class=hljs-keyword>shrx</span>    <span class=hljs-built_in>rsi</span>, <span class=hljs-built_in>r9</span>, <span class=hljs-built_in>rcx</span>
+        <span class=hljs-keyword>mov</span>     <span class=hljs-built_in>rcx</span>, <span class=hljs-built_in>r8</span>
+<span class=hljs-symbol>.LBB1_2:</span>
+        <span class=hljs-keyword>lea</span>     <span class=hljs-built_in>r8d</span>, [<span class=hljs-number>8</span>*<span class=hljs-built_in>rcx</span>]
+        <span class=hljs-keyword>movzx</span>   <span class=hljs-built_in>edx</span>, <span class=hljs-built_in>dl</span>
+        <span class=hljs-keyword>shlx</span>    <span class=hljs-built_in>r8</span>, <span class=hljs-built_in>rdx</span>, <span class=hljs-built_in>r8</span>
+        <span class=hljs-keyword>or</span>      <span class=hljs-built_in>r8</span>, <span class=hljs-built_in>rsi</span>
+        <span class=hljs-keyword>lea</span>     <span class=hljs-built_in>rsi</span>, [<span class=hljs-built_in>rcx</span> + <span class=hljs-number>1</span>]
+        <span class=hljs-keyword>mov</span>     <span class=hljs-built_in>qword</span> <span class=hljs-built_in>ptr</span> [<span class=hljs-built_in>rdi</span> + <span class=hljs-number>8</span>], <span class=hljs-built_in>r8</span>
+        <span class=hljs-keyword>mov</span>     <span class=hljs-built_in>qword</span> <span class=hljs-built_in>ptr</span> [<span class=hljs-built_in>rdi</span> + <span class=hljs-number>16</span>], <span class=hljs-built_in>rsi</span>
+        <span class=hljs-keyword>cmp</span>     <span class=hljs-built_in>rcx</span>, <span class=hljs-number>8</span>
+        <span class=hljs-keyword>jb</span>      .LBB1_4
+        <span class=hljs-keyword>imul</span>    <span class=hljs-built_in>rax</span>, <span class=hljs-built_in>qword</span> <span class=hljs-built_in>ptr</span> [<span class=hljs-built_in>rdi</span>]
+        <span class=hljs-keyword>add</span>     <span class=hljs-built_in>rax</span>, <span class=hljs-built_in>r8</span>
+        <span class=hljs-keyword>mov</span>     <span class=hljs-built_in>qword</span> <span class=hljs-built_in>ptr</span> [<span class=hljs-built_in>rdi</span>], <span class=hljs-built_in>rax</span>
+        <span class=hljs-keyword>mov</span>     <span class=hljs-built_in>eax</span>, <span class=hljs-built_in>ecx</span>
+        <span class=hljs-keyword>add</span>     <span class=hljs-built_in>rcx</span>, -<span class=hljs-number>7</span>
+        <span class=hljs-keyword>neg</span>     <span class=hljs-built_in>al</span>
+        <span class=hljs-keyword>shl</span>     <span class=hljs-built_in>al</span>, <span class=hljs-number>3</span>
+        <span class=hljs-keyword>shrx</span>    <span class=hljs-built_in>rax</span>, <span class=hljs-built_in>rdx</span>, <span class=hljs-built_in>rax</span>
+        <span class=hljs-keyword>mov</span>     <span class=hljs-built_in>qword</span> <span class=hljs-built_in>ptr</span> [<span class=hljs-built_in>rdi</span> + <span class=hljs-number>8</span>], <span class=hljs-built_in>rax</span>
+        <span class=hljs-keyword>mov</span>     <span class=hljs-built_in>qword</span> <span class=hljs-built_in>ptr</span> [<span class=hljs-built_in>rdi</span> + <span class=hljs-number>16</span>], <span class=hljs-built_in>rcx</span>
+<span class=hljs-symbol>.LBB1_4:</span>
+        <span class=hljs-keyword>ret</span>
+</div><label for=expansible5>Expand</label></div></code></pre><p>Waow. So elegant. What went wrong?<p>In retrospect, the reason is obvious. The two writes can “tear” if the first write fills the buffer to the end. The optimizer does not realize the writes can be combined, so we’re left with this monstrosity.<p>More generally, the problem is that <code>write_*</code> methods cannot predict the current state of the buffer, so the branches and variable-index accesses cannot be optimized out. And if <code>write</code>s forced the state to a fixed one? Well, that’s equivalent to padding the data to a full block. Eugh.<p class=next-group><span aria-level=3 class=side-header role=heading><span>Inlining</span></span>Okay, but hear me out, <em>surely</em> the state can be predicted if the <code>hash</code> and <code>write_*</code> calls are inlined? Here:<pre><code class=language-rust><span class=hljs-keyword>fn</span> <span class="hljs-title function_">hash_u8_pair</span>(pair: (<span class=hljs-type>u8</span>, <span class=hljs-type>u8</span>)) <span class=hljs-punctuation>-></span> <span class=hljs-type>u64</span> {
+    <span class=hljs-keyword>let</span> <span class=hljs-keyword>mut </span><span class=hljs-variable>hasher</span> = <span class=hljs-keyword>Self</span>::<span class="hljs-title function_ invoke__">new</span>();
+    hasher.<span class="hljs-title function_ invoke__">write_u8</span>(pair.<span class=hljs-number>0</span>);
+    hasher.<span class="hljs-title function_ invoke__">write_u8</span>(pair.<span class=hljs-number>1</span>);
+    hasher.<span class="hljs-title function_ invoke__">finish</span>()
+}
+</code></pre><pre><code class=language-x86asm><span class=hljs-symbol>hash_u8_pair:</span>
+        <span class=hljs-keyword>movzx</span>   <span class=hljs-built_in>eax</span>, <span class=hljs-built_in>sil</span>
+        <span class=hljs-keyword>movzx</span>   <span class=hljs-built_in>ecx</span>, <span class=hljs-built_in>dil</span>
+        <span class=hljs-keyword>shl</span>     <span class=hljs-built_in>eax</span>, <span class=hljs-number>8</span>
+        <span class=hljs-keyword>or</span>      <span class=hljs-built_in>eax</span>, <span class=hljs-built_in>ecx</span>
+        <span class=hljs-keyword>ret</span>
+</code></pre><p>That’s a nice argument, but let me introduce to you: variable-length collections. <code>Vec&LTT></code> is hashed by writing the length and then hashing the elements one by one. Even if the element hashing is somehow vectorized (it’s not, LLVM is a dumdum), nothing <em>after</em> this variable-length collection can be hashed efficiently.<p class=next-group><span aria-level=3 class=side-header role=heading><span>std</span></span><em>Surely</em> someone thought of this problem before? C’mere, take a look at how slices of integers <a href=https://doc.rust-lang.org/src/core/hash/mod.rs.html#818-827>are hashed</a>:<pre><code class=language-rust><span class=hljs-meta>#[inline]</span>
+<span class=hljs-keyword>fn</span> <span class="hljs-title function_">hash_slice</span>&LTH: Hasher>(data: &[$ty], state: &<span class=hljs-keyword>mut</span> H) {
+    <span class=hljs-keyword>let</span> <span class=hljs-variable>newlen</span> = mem::<span class="hljs-title function_ invoke__">size_of_val</span>(data);
+    <span class=hljs-keyword>let</span> <span class=hljs-variable>ptr</span> = data.<span class="hljs-title function_ invoke__">as_ptr</span>() <span class=hljs-keyword>as</span> *<span class=hljs-keyword>const</span> <span class=hljs-type>u8</span>;
+    <span class=hljs-comment>// SAFETY: `ptr` is valid and aligned, as this macro is only used</span>
+    <span class=hljs-comment>// for numeric primitives which have no padding. The new slice only</span>
+    <span class=hljs-comment>// spans across `data` and is never mutated, and its total size is the</span>
+    <span class=hljs-comment>// same as the original `data` so it can't be over `isize::MAX`.</span>
+    state.<span class="hljs-title function_ invoke__">write</span>(<span class=hljs-keyword>unsafe</span> { slice::<span class="hljs-title function_ invoke__">from_raw_parts</span>(ptr, newlen) })
+}
+</code></pre><p>So that’s good.<p>Meanwhile newtypes are crying in the corner, as <code>#[derive(Hash)]</code> understandably does not apply this optimization to them (nor to structs with multiple fields and tuples), the built-in hasher uses <eq><math><mrow><mn>2.5</mn><mo>×</mo></mrow></math></eq> worse code than it could even today, which <em>also</em> takes way more space in your instruction cache than necessary.<h2>It can’t be that bad</h2><p class=next-group><span aria-level=3 class=side-header role=heading><span>It can</span></span>Shall we benchmark some code?<pre><code class=language-rust><span class=hljs-keyword>use</span> std::any::type_name;
+<span class=hljs-keyword>use</span> std::hash::{DefaultHasher, Hash, Hasher};
+<span class=hljs-keyword>use</span> std::time::Instant;
+
+<span class=hljs-keyword>fn</span> <span class="hljs-title function_">time</span>&LTT: Hash>(obj: T) {
+    <span class=hljs-keyword>let</span> <span class=hljs-variable>start</span> = Instant::<span class="hljs-title function_ invoke__">now</span>();
+    <span class=hljs-keyword>let</span> <span class=hljs-keyword>mut </span><span class=hljs-variable>hasher</span> = DefaultHasher::<span class="hljs-title function_ invoke__">new</span>();
+    obj.<span class="hljs-title function_ invoke__">hash</span>(&<span class=hljs-keyword>mut</span> hasher);
+    <span class=hljs-keyword>let</span> <span class=hljs-variable>h</span> = hasher.<span class="hljs-title function_ invoke__">finish</span>();
+    <span class=hljs-built_in>println!</span>(<span class=hljs-string>"{}: {:?} (-> {h:0x})"</span>, type_name::&LTT>(), start.<span class="hljs-title function_ invoke__">elapsed</span>());
+}
+
+<span class=hljs-meta>#[derive(Hash)]</span>
+<span class=hljs-keyword>struct</span> <span class="hljs-title class_">NewType</span>(<span class=hljs-type>i32</span>);
+
+<span class=hljs-keyword>fn</span> <span class="hljs-title function_">main</span>() {
+    <span class=hljs-keyword>let</span> <span class=hljs-variable>n</span> = <span class=hljs-number>100000000</span>;
+    <span class="hljs-title function_ invoke__">time</span>((<span class=hljs-number>0</span>..n).collect::<<span class=hljs-type>Vec</span><<span class=hljs-type>i32</span>>>());
+    <span class="hljs-title function_ invoke__">time</span>((<span class=hljs-number>0</span>..n).<span class="hljs-title function_ invoke__">map</span>(NewType).collect::<<span class=hljs-type>Vec</span>&LTNewType>>());
+}
+</code></pre><p>Hashing <code>[i32]</code> transmutes the slice into <code>[u8]</code> and performs a single <code>write</code> call, while hashing <code>[NewType]</code> hashes the elements one by one. This benchmark thus measures the cost of individual calls. Note also that we hash almost <eq><math><mn>400</mn></math></eq> MiB of memory. This doesn’t fit in cache, which might <em>hide</em> some inefficiencies. I’m feeling generous.<pre><code>alloc::vec::Vec&LTi32>: 117.756736ms (-> 1984796e743a33f5)
+alloc::vec::Vec&LTruined_portal::NewType>: 469.774204ms (-> 1984796e743a33f5)
+</code></pre><p><s>Huh, literally 1984.</s><p>We get <eq><math><mrow><mn>5</mn><mo>×</mo></mrow></math></eq> slower code, even though it computes the exact same hash. Let’s try the <code>siphasher</code> crate:<pre><code>alloc::vec::Vec&LTi32>: 196.330253ms (-> 95697c476562afec)
+alloc::vec::Vec&LTruined_portal::NewType>: 243.031408ms (-> 95697c476562afec)
+</code></pre><p>That’s better, though admittedly a <eq><math><mrow><mn>25</mn><mi>%</mi></mrow></math></eq> difference is still eugh. But keep in mind that this is a <em>cryptographic</em> hash, which takes <em>a lot</em> of time to hash a block. This difference will be exacerbated on non-cryptographic hashes.<p><code>rapidhash</code>:<pre><code>alloc::vec::Vec&LTi32>: 54.224434ms (-> 1908e25736ad8479)
+alloc::vec::Vec&LTruined_portal::NewType>: 278.101368ms (-> 949efa02155c336a)
+</code></pre><p><code>ahash</code>:<pre><code>alloc::vec::Vec&LTi32>: 56.262629ms (-> 217325aa736f75a8)
+alloc::vec::Vec&LTruined_portal::NewType>: 177.900032ms (-> 4ae6133ab0e0fe9f)
+</code></pre><p><code>highway</code>:<pre><code>alloc::vec::Vec&LTi32>: 53.843217ms (-> f2e68b031ff10c02)
+alloc::vec::Vec&LTruined_portal::NewType>: 547.520541ms (-> f2e68b031ff10c02)
+</code></pre><p>That’s not good. Note that all hashers have about the same performance on <code>Vec&LTi32></code>. That’s about the speed of RAM. For small arrays that fits in cache, the difference is even more prominent. (I didn’t verify this, but I am the smartest person in the room and thus am obviously right.)<h2>My goal</h2><p class=next-group><span aria-level=3 class=side-header role=heading><span>(Kinda)</span></span>What I really want is a general-purpose hash that’s good for most practical purposes and kinda DoS-resistant but not necessarily cryptographic. It needs to perform fast on short inputs, so it can’t be a “real” block hash, but rather something close to <code>rapidhash</code>.<p>We want:<section><eqn><math style="display:block math;"class=tml-display display=block><mrow><mrow><mtext></mtext><mi>consume</mi></mrow><mo form=prefix stretchy=false>(</mo><mi>a</mi><mo separator=true>,</mo><mi>x</mi><mo separator=true>,</mo><mi>y</mi><mo form=postfix stretchy=false>)</mo><mo>=</mo><mrow><mtext></mtext><mi>mix</mi></mrow><mo form=prefix stretchy=false>(</mo><mi>x</mi><mo>⊕︎</mo><mi>a</mi><mo separator=true>,</mo><mi>y</mi><mo>⊕︎</mo><mi>C</mi><mo form=postfix stretchy=false>)</mo><mi>.</mi></mrow></math></eqn></section><p>Right, Rust doesn’t support this. Okay, let’s try another relatively well-known scheme that might be easier to implement. It’s parallel, surely that’ll help?<p>To hash a <eq><math><mn>64</mn></math></eq>-bit word sequence <eq><math><mrow><mo form=prefix stretchy=false>(</mo><msub><mi>x</mi><mn>1</mn></msub><mo separator=true>,</mo><mo>…</mo><mo separator=true>,</mo><msub><mi>x</mi><mrow><mn>2</mn><mi>n</mi></mrow></msub><mo form=postfix stretchy=false>)</mo></mrow></math></eq>, we compute<section><eqn><math style="display:block math;"class=tml-display display=block><mrow><mrow><mtext></mtext><mi>mix</mi></mrow><mo form=prefix stretchy=false>(</mo><msub><mi>x</mi><mn>1</mn></msub><mo>⊕︎</mo><msub><mi>a</mi><mn>1</mn></msub><mo separator=true>,</mo><msub><mi>x</mi><mn>2</mn></msub><mo>⊕︎</mo><msub><mi>a</mi><mn>2</mn></msub><mo form=postfix stretchy=false>)</mo><mo>+</mo><mo>⋯</mo><mo>+</mo><mrow><mtext></mtext><mi>mix</mi></mrow><mo form=prefix stretchy=false>(</mo><msub><mi>x</mi><mrow><mn>2</mn><mi>n</mi><mo>−</mo><mn>1</mn></mrow></msub><mo>⊕︎</mo><msub><mi>a</mi><mrow><mn>2</mn><mi>n</mi><mo>−</mo><mn>1</mn></mrow></msub><mo separator=true>,</mo><msub><mi>x</mi><mrow><mn>2</mn><mi>n</mi></mrow></msub><mo>⊕︎</mo><msub><mi>a</mi><mrow><mn>2</mn><mi>n</mi></mrow></msub><mo form=postfix stretchy=false>)</mo><mo separator=true>,</mo></mrow></math></eqn></section><p>where <eq><math><mrow><mo form=prefix stretchy=false>(</mo><msub><mi>a</mi><mn>1</mn></msub><mo separator=true>,</mo><mo>…</mo><mo separator=true>,</mo><msub><mi>a</mi><mrow><mn>2</mn><mi>n</mi></mrow></msub><mo form=postfix stretchy=false>)</mo></mrow></math></eq> is random data (possibly generated from the seed once), and<section><eqn><math style="display:block math;"class=tml-display display=block><mrow><mrow><mtext></mtext><mi>mix</mi></mrow><mo form=prefix stretchy=false>(</mo><mi>x</mi><mo separator=true>,</mo><mi>y</mi><mo form=postfix stretchy=false>)</mo><mo>=</mo><mo form=prefix stretchy=false>(</mo><mi>x</mi><mo>⋅</mo><mi>y</mi><mo lspace=0.2222em rspace=0.2222em>mod</mo><msup><mn>2</mn><mn>64</mn></msup><mo form=postfix stretchy=false>)</mo><mo>⊕︎</mo><mo form=prefix stretchy=false>(</mo><mi>x</mi><mo>⋅</mo><mi>y</mi><mo lspace=0.1667em rspace=0.1667em><mi>d</mi><mi>i</mi><mi>v</mi></mo><msup><mn>2</mn><mn>64</mn></msup><mo form=postfix stretchy=false>)</mo><mi>.</mi></mrow></math></eqn></section><p>This is a combination of certain well-known primitives. The problem here is that <eq><math><msub><mi>a</mi><mi>i</mi></msub></math></eq> needs to be precomputed beforehand. This is not a problem for fixed-length keys, like structs of integers – something often used in, say, <code>rustc</code>.<p>Unfortunately, Rust forces each hasher to handle <em>all</em> possible inputs, including of different lengths, so this scheme can’t work. The hasher isn’t even parametrized by the type of the hashed object. Four well-layouted 64-bit integers that can easily be mixed together with just two full-width multiplications? Nah, <code>write_u64</code> goes brrrrrrrrrrrr-<p class=next-group><span aria-level=3 class=side-header role=heading><span>Stop bitching</span></span>I’ve been designing fast hash-based data structures for several months before realizing they are almost unusable because of these design decisions. <em>Surely</em> something that isn’t a problem in C++ and Python won’t be a problem in Rust, I thought. I deserve a little bitching, okay?<p class=next-group><span aria-level=3 class=side-header role=heading><span>Actually how</span></span>The obvious way forward is to bring the structure of the data back into the picture. If the hasher knew it’s hashing fixed-size data, it could use the <eq><math><msub><mi>a</mi><mi>i</mi></msub></math></eq> approach. If the hasher knew it’s hashing an array, it could vectorize the computation of individual hashes. If the hasher knew the types of the fields in the structure it’s hashing, it could prevent tearing, or perhaps merge small fields into 64-bit blocks efficiently. Alas, the hasher is clueless…<p>In my opinion, <code>Hasher</code> and <code>Hash</code> are a wrong abstraction. Instead of the <code>Hash</code> driving the <code>Hasher</code> <s>insane</s>, it should be the other way round: <code>Hash</code> providing introspection facilities and <code>Hasher</code> navigating the hashed objects recursively.<p>How this API should look like and whether it can be shoehorned into the existing interfaces remains to be seen. I have not started work on the design yet, and perhaps this article might be a bit premature, but I’d love to hear your thoughts on how I missed something really obvious (or, indeed, on how Rust is fast enough and no one cares).</div></section><footer><div class=viewport-container><h2>Made with my own bare hands (why.)</h2></div></footer><script>window.addEventListener("keydown", e => {
+				if (e.key === "Enter") {
+					if (e.ctrlKey) {
+						window.open("https://github.com/purplesyringa/site/edit/master/blog/thoughts-on-rust-hashing/index.md", "_blank");
+					} else if (
+						e.target.type === "checkbox"
+						&& e.target.parentNode
+						&& e.target.parentNode.className === "expansible-code"
+					) {
+						e.target.click();
+					}
+				}
+			});</script>
\ No newline at end of file
diff --git a/blog/thoughts-on-rust-hashing/index.md b/blog/thoughts-on-rust-hashing/index.md
new file mode 100644
index 0000000..b3ba906
--- /dev/null
+++ b/blog/thoughts-on-rust-hashing/index.md
@@ -0,0 +1,531 @@
+---
+title: Thoughts on Rust hashing
+time: December 11, 2024
+intro: |
+    In languages like Python, Java, or C++, values are hashed by calling a "hash me" method on them, implemented by the type author. This fixed-hash size is then immediately used by the hash table or what have you. This design suffers from some obvious problems, like:
+
+    How do you hash an integer? If you use a no-op hasher (booo), DoS attacks on hash tables are inevitable. If you hash it thoroughly, consumers that only cache hashes to optimize equality checks lose out of performance.
+---
+
+### Intro
+
+In languages like Python, Java, or C++, values are hashed by calling a "hash me" method on them, implemented by the type author. This fixed-hash size is then immediately used by the hash table or what have you. This design suffers from some obvious problems, like:
+
+How do you hash an integer? If you use a no-op hasher (booo), DoS attacks on hash tables are inevitable. If you hash it thoroughly, consumers that only cache hashes to optimize equality checks lose out of performance.
+
+How do you mix hashes? You can:
+
+- Leave that to the users. Everyone will then invent their own terrible mixers, like `x ^ y`. Indeed, both arguments are pseudo-random, what could possibly go wrong?
+- Provide a good-enough mixer for most use cases, like `a * x + y`. Cue CVEs because people used `mix(x, mix(y, z))` instead of `mix(mix(x, y), z)`.
+- Provide a quality mixer, missing out on performance in common simple cases.
+
+What if the input data is already random? Then you're just wasting cycles.
+
+What guarantees do you provide regarding the hash values?
+
+- Do you require the avalanche effect? Your hash is suboptimal even for simple power-of-two-sized hash tables.
+- Do you require a half-avalanche effect instead? Congrats, you broke either those or prime-sized hash tables.
+- Do you require the hash table to perform finalization manually? Using strings as keys is now suboptimal, because computing a non-finalized hash of a string is of good enough quality already.
+
+Is your hash function seeded?
+
+- If not, hi DoS.
+- If yes, but you reuse the same seed between different hash tables, [your tables are now quadratic](https://accidentallyquadratic.tumblr.com/post/153545455987/rust-hash-iteration-reinsertion).
+- If the seed is explicitly passed to each hasher, how do you ensure different hashers don't accidentally cancel out?
+
+
+### In Rust
+
+Rust learnt from these mistakes by splitting the responsibilities:
+
+- Objects implement the `Hash` trait, allowing them to write underlying data into a `Hasher`.
+- Hashers implement the `Hasher` trait, which hashes the data written by `Hash` objects.
+
+Objects turn the structured data into a stream of integers; hashers turn the stream into a numeric hash.
+
+On paper, this is a good solution:
+
+- Hashing an integer is as simple as sending the integer to the hasher. Consumers can choose hashers that provide the necessary guarantees.
+- Users don't have to mix hashes. Hashers can do that optimally.
+- If the data is known to be random, a fast simple hasher can be used without changing the `Hash` implementation.
+- Different hash tables can use different hashers, efficiently providing only as much avalanche as necessary.
+- The hasher can be seeded per-table. Only the hasher has access to the seed, so safely using the seed during mixing is easy.
+
+Surely this enables optimal and performant hashing in practice, right?
+
+
+### No
+
+Let's take a look at the `Hasher` API:
+
+```rust
+pub trait Hasher {
+    // Required methods
+    fn finish(&self) -> u64;
+    fn write(&mut self, bytes: &[u8]);
+
+    // Provided methods
+    fn write_u8(&mut self, i: u8) { ... }
+    fn write_u16(&mut self, i: u16) { ... }
+    fn write_u32(&mut self, i: u32) { ... }
+    fn write_u64(&mut self, i: u64) { ... }
+    fn write_u128(&mut self, i: u128) { ... }
+    fn write_usize(&mut self, i: usize) { ... }
+    fn write_i8(&mut self, i: i8) { ... }
+    fn write_i16(&mut self, i: i16) { ... }
+    fn write_i32(&mut self, i: i32) { ... }
+    fn write_i64(&mut self, i: i64) { ... }
+    fn write_i128(&mut self, i: i128) { ... }
+    fn write_isize(&mut self, i: isize) { ... }
+    fn write_length_prefix(&mut self, len: usize) { ... }
+    fn write_str(&mut self, s: &str) { ... }
+}
+```
+
+This API is tuned to *streaming hashes*, like the polynomial hash and its various knock-offs. But just like in encryption, hashing is block-wise these days.
+
+Block hashes have some internal state that iteratively "absorbs" input blocks of a fixed length. When the data runs out, the last block is padded with length and absorbed as a fixed-length block too. A finalization step then reduces the internal state to 64 bits (or more, depending on the use case).
+
+That's how SHA-2 and many other cryptographic hashes work, but you might be surprised to know that the top hashes [in the SMHasher list](https://gitlab.com/fwojcik/smhasher3/-/tree/main/results) all use the same approach.
+
+The block-wise design is objectively superior to streaming. Consuming as much data as possible at once reduces the amortized avalanche cost, enabling safer hash functions at greater speed than streaming hashes can achieve. Block-wise hashes have a lower latency, as the latency is accumulated per-block, not per-stream-input.
+
+
+## Block hash support in Rust
+
+### :ferrisClueless:
+
+The `Hasher` API makes no effort to suit block hashes. The hasher is not informed of the length of the data or of its structure. It must *always* be able to absorb just one more `u8`, bro, I promise. There's only two ways to deal with this:
+
+- Either you pad all individual inputs, even very short ones, to the full block width,
+- Or you accumulate a block and occasionally flush it to the underlying block hasher.
+
+Let's see what's wrong with these approaches.
+
+
+### Padding
+
+Let's consider a very simple block-wise hash:
+
+```rust
+fn absorb(state: &mut u64, block: &[u8; 8]) {
+    let block = u64::from_ne_bytes(*block);
+    *state = state.wrapping_mul(K).wrapping_add(block);
+}
+```
+
+This is just a multiplicative hash, not unlike FNV-1, but consuming $8$ bytes at a time instead of $1$.
+
+Now what happens if you try to hash two 32-bit integers with this hash? With padding, that will compile to two multiplications even though one would work. This halves throughput and increases latency.
+
+Practical hashes uses much larger blocks. `rapidhash` has a $24$-byte state and can absorb $48$ bytes at once. `ahash` has a $48$-byte state and absorbs $64$-byte blocks. `meowhash` has a $128$-byte state and absorbs $256$ bytes. (I only selected these particular hashes because I'm familiar with their kernels; others have similar designs.)
+
+These are some of the fastest non-cryptographic hashes in the world. Do you really want to nuke their performance by padding $8$-byte inputs to $48$, $64$, or $256$ bytes? Probably not.
+
+
+### Chains
+
+Okay, but what if we cheated and modified the hash functions to absorb small data somewhat more efficiently than absorbing a full block?
+
+Say, the `rapidhash` kernel is effectively *this*:
+
+```rust
+fn absorb(state: &mut [u64; 3], seed: &[u64; 3], block: &[u64; 6]) {
+    for i in 0..3 {
+        state[i] = mix(block[i] ^ state[i], block[i + 3] ^ seed[i]);
+    }
+}
+```
+
+That's three independent iterations, so *surely* we can absorb a smaller 64-bit block like this instead:
+
+```rust
+fn absorb_64bit(state: &mut [u64; 3], seed: &[u64; 3], block: u64) {
+    state[0] = mix(block ^ state[0], seed[0]);
+}
+```
+
+Surely this is going to reduce the $6 \times$ slowdown to at least something like $2 \times$, right?
+
+Why does `rapidhash` even use three independent chains in the first place? That's right, latency!
+
+`mix` has a $5$ tick latency on modern x86 processors, but a throughput of $1$. Chain independence allows a $16$-byte block to be consumed without waiting for the previous $16$ bytes to be mixed in. We just threw this optimization out.
+
+
+### Accumulation
+
+Okay, so padding is a terrible idea. Can we accumulate a buffer instead? How much hashes I had to scroll through in SMHasher before I found *one* Rust implementation that took this approach is a warning bell.
+
+[The implementation I found](https://docs.rs/farmhash/1.1.5/src/farmhash/lib.rs.html#92-110), of course, stores a `Vec<u8>` and passes it to the underlying hasher in `finish`. I believe I don't need to explain why allocating during hash function is not the brightest idea.
+
+Let's consider [another implementation](https://docs.rs/highway/1.2.0/src/highway/portable.rs.html#272-288) that stores a fixed-size buffer instead. Huh, that's a lot of `if`s and `for`s. I wonder what Godbolt will say about this. Let's try something very simple:
+
+```rust expansible
+struct StreamingHasher {
+    block_hasher: BlockHasher,
+    buffer: [u8; 8],
+    length: usize,
+}
+
+impl StreamingHasher {
+    fn write(&mut self, input: &[u8]) {
+        // If the input fits in the free space in the buffer, just copy it.
+        let rest = unsafe { self.buffer.get_unchecked_mut(self.length..) };
+        if input.len() < rest.len() {
+            rest[..input.len()].copy_from_slice(input);
+            self.length += input.len();
+            return;
+        }
+
+        // Otherwise, copy whatever fits and hash the chunk.
+        let (head, tail) = input.split_at(rest.len());
+        rest.copy_from_slice(head);
+        self.block_hasher.feed(self.buffer);
+
+        // Split the rest of the input into blocks and hash them individually, move the last one
+        // to the buffer.
+        let chunks = tail.array_chunks();
+        let remainder = chunks.remainder();
+        self.buffer[..remainder.len()].copy_from_slice(remainder);
+        self.length = remainder.len();
+        for chunk in chunks {
+            self.block_hasher.feed(*chunk);
+        }
+    }
+}
+```
+
+Surely this will compile to good code? :ferrisClueless:
+
+Here's what writing 1 (one) byte into this hasher compiles to:
+
+```x86asm expansible
+write_u8:
+        push    r15
+        push    r14
+        push    r13
+        push    r12
+        push    rbx
+        sub     rsp, 16
+        mov     rbx, rdi
+        mov     byte ptr [rsp + 15], sil
+        mov     r14, qword ptr [rdi + 16]
+        add     rdi, r14
+        add     rdi, 8
+        cmp     r14, 6
+        ja      .LBB0_2
+        mov     byte ptr [rdi], sil
+        mov     r14, qword ptr [rbx + 16]
+        inc     r14
+        jmp     .LBB0_3
+.LBB0_2:
+        lea     r15, [rbx + 8]
+        mov     edx, 8
+        sub     rdx, r14
+        lea     r12, [rsp + rdx]
+        add     r12, 15
+        add     r14, -7
+        lea     rsi, [rsp + 15]
+        mov     r13, qword ptr [rip + memcpy@GOTPCREL]
+        call    r13
+        movabs  rax, 5512829513697402577
+        imul    rax, qword ptr [rbx]
+        add     rax, qword ptr [rbx + 8]
+        mov     qword ptr [rbx], rax
+        mov     rsi, r14
+        and     rsi, -8
+        add     rsi, r12
+        mov     rdi, r15
+        mov     rdx, r14
+        call    r13
+.LBB0_3:
+        mov     qword ptr [rbx + 16], r14
+        add     rsp, 16
+        pop     rbx
+        pop     r12
+        pop     r13
+        pop     r14
+        pop     r15
+        ret
+```
+
+Waow, what happened? That's right, `copy_from_slice` did! LLVM *cannot* compile a variable-length copy into anything other than `memcpy`. Did you write a loop with a guaranteed bound on the iteration count by hand? Too bad, that goes in the `memcpy` hole.
+
+
+### SipHasher
+
+So crates in the wild do this wrong. How does the built-in Rust hasher handle this? [It conveniently doesn't define `write_*`](https://github.com/rust-lang/rust/pull/69152) -- by design, because this important optimization leads to a small increase in compile time. Riiiiiight.
+
+The `siphasher` *crate*, though, optimizes the short-length `memcpy` with [bitwise operations](https://docs.rs/siphasher/latest/src/siphasher/sip.rs.html#330-354). Let's try it out:
+
+```rust expansible
+fn write(&mut self, input: u64, input_len: usize) {
+    assert!(input_len <= 8);
+    if input_len != 8 {
+        assert!(input >> (8 * input_len) == 0);
+    }
+
+    // Consume as many inputs as fit.
+    let old_length = self.length;
+    self.buffer |= input << (8 * self.length);
+    self.length += input_len;
+
+    // On overflow, feed the buffer block hasher and initialize the buffer with the tail.
+    if self.length > 8 {
+        self.block_hasher.feed(self.buffer);
+        self.buffer = input >> (8 * (8 - old_length));
+        self.length -= 8;
+    }
+}
+```
+
+```x86asm expansible
+write_u8:
+        mov     rax, qword ptr [rdi + 16]
+        movzx   ecx, sil
+        lea     edx, [8*rax]
+        lea     rsi, [rax + 1]
+        shlx    rdx, rcx, rdx
+        or      rdx, qword ptr [rdi + 8]
+        mov     qword ptr [rdi + 8], rdx
+        mov     qword ptr [rdi + 16], rsi
+        cmp     rsi, 9
+        jb      .LBB0_2
+        movabs  rsi, 5512829513697402577
+        imul    rsi, qword ptr [rdi]
+        add     rsi, rdx
+        mov     edx, eax
+        add     rax, -7
+        neg     dl
+        mov     qword ptr [rdi], rsi
+        shl     dl, 3
+        shrx    rcx, rcx, rdx
+        mov     qword ptr [rdi + 8], rcx
+        mov     qword ptr [rdi + 16], rax
+.LBB0_2:
+        ret
+```
+
+This is kind of better? Now let's try hashing `(u8, u8)` like Rust would do:
+
+```rust
+#[no_mangle]
+fn write_u8_pair(&mut self, pair: (u8, u8)) {
+    self.write(&[pair.0]);
+    self.write(&[pair.1]);
+}
+```
+
+```x86asm expansible
+write_u8_pair:
+        mov     r8, qword ptr [rdi + 16]
+        movzx   r9d, sil
+        movabs  rax, 5512829513697402577
+        lea     ecx, [8*r8]
+        shlx    rsi, r9, rcx
+        or      rsi, qword ptr [rdi + 8]
+        lea     rcx, [r8 + 1]
+        cmp     rcx, 9
+        jb      .LBB1_2
+        mov     rcx, qword ptr [rdi]
+        imul    rcx, rax
+        add     rcx, rsi
+        mov     qword ptr [rdi], rcx
+        mov     ecx, r8d
+        add     r8, -7
+        neg     cl
+        shl     cl, 3
+        shrx    rsi, r9, rcx
+        mov     rcx, r8
+.LBB1_2:
+        lea     r8d, [8*rcx]
+        movzx   edx, dl
+        shlx    r8, rdx, r8
+        or      r8, rsi
+        lea     rsi, [rcx + 1]
+        mov     qword ptr [rdi + 8], r8
+        mov     qword ptr [rdi + 16], rsi
+        cmp     rcx, 8
+        jb      .LBB1_4
+        imul    rax, qword ptr [rdi]
+        add     rax, r8
+        mov     qword ptr [rdi], rax
+        mov     eax, ecx
+        add     rcx, -7
+        neg     al
+        shl     al, 3
+        shrx    rax, rdx, rax
+        mov     qword ptr [rdi + 8], rax
+        mov     qword ptr [rdi + 16], rcx
+.LBB1_4:
+        ret
+```
+
+Waow. So elegant. What went wrong?
+
+In retrospect, the reason is obvious. The two writes can "tear" if the first write fills the buffer to the end. The optimizer does not realize the writes can be combined, so we're left with this monstrosity.
+
+More generally, the problem is that `write_*` methods cannot predict the current state of the buffer, so the branches and variable-index accesses cannot be optimized out. And if `write`s forced the state to a fixed one? Well, that's equivalent to padding the data to a full block. Eugh.
+
+
+### Inlining
+
+Okay, but hear me out, *surely* the state can be predicted if the `hash` and `write_*` calls are inlined? Here:
+
+```rust
+fn hash_u8_pair(pair: (u8, u8)) -> u64 {
+    let mut hasher = Self::new();
+    hasher.write_u8(pair.0);
+    hasher.write_u8(pair.1);
+    hasher.finish()
+}
+```
+
+```x86asm
+hash_u8_pair:
+        movzx   eax, sil
+        movzx   ecx, dil
+        shl     eax, 8
+        or      eax, ecx
+        ret
+```
+
+That's a nice argument, but let me introduce to you: variable-length collections. `Vec<T>` is hashed by writing the length and then hashing the elements one by one. Even if the element hashing is somehow vectorized (it's not, LLVM is a dumdum), nothing *after* this variable-length collection can be hashed efficiently.
+
+
+### std
+
+*Surely* someone thought of this problem before? C'mere, take a look at how slices of integers [are hashed](https://doc.rust-lang.org/src/core/hash/mod.rs.html#818-827):
+
+```rust
+#[inline]
+fn hash_slice<H: Hasher>(data: &[$ty], state: &mut H) {
+    let newlen = mem::size_of_val(data);
+    let ptr = data.as_ptr() as *const u8;
+    // SAFETY: `ptr` is valid and aligned, as this macro is only used
+    // for numeric primitives which have no padding. The new slice only
+    // spans across `data` and is never mutated, and its total size is the
+    // same as the original `data` so it can't be over `isize::MAX`.
+    state.write(unsafe { slice::from_raw_parts(ptr, newlen) })
+}
+```
+
+So that's good.
+
+Meanwhile newtypes are crying in the corner, as `#[derive(Hash)]` understandably does not apply this optimization to them (nor to structs with multiple fields and tuples), the built-in hasher uses $2.5 \times$ worse code than it could even today, which *also* takes way more space in your instruction cache than necessary.
+
+
+## It can't be that bad
+
+### It can
+
+Shall we benchmark some code?
+
+```rust
+use std::any::type_name;
+use std::hash::{DefaultHasher, Hash, Hasher};
+use std::time::Instant;
+
+fn time<T: Hash>(obj: T) {
+    let start = Instant::now();
+    let mut hasher = DefaultHasher::new();
+    obj.hash(&mut hasher);
+    let h = hasher.finish();
+    println!("{}: {:?} (-> {h:0x})", type_name::<T>(), start.elapsed());
+}
+
+#[derive(Hash)]
+struct NewType(i32);
+
+fn main() {
+    let n = 100000000;
+    time((0..n).collect::<Vec<i32>>());
+    time((0..n).map(NewType).collect::<Vec<NewType>>());
+}
+```
+
+Hashing `[i32]` transmutes the slice into `[u8]` and performs a single `write` call, while hashing `[NewType]` hashes the elements one by one. This benchmark thus measures the cost of individual calls. Note also that we hash almost $400$ MiB of memory. This doesn't fit in cache, which might *hide* some inefficiencies. I'm feeling generous.
+
+```
+alloc::vec::Vec<i32>: 117.756736ms (-> 1984796e743a33f5)
+alloc::vec::Vec<ruined_portal::NewType>: 469.774204ms (-> 1984796e743a33f5)
+```
+
+~~Huh, literally 1984.~~
+
+We get $5 \times$ slower code, even though it computes the exact same hash. Let's try the `siphasher` crate:
+
+```
+alloc::vec::Vec<i32>: 196.330253ms (-> 95697c476562afec)
+alloc::vec::Vec<ruined_portal::NewType>: 243.031408ms (-> 95697c476562afec)
+```
+
+That's better, though admittedly a $25\%$ difference is still eugh. But keep in mind that this is a *cryptographic* hash, which takes *a lot* of time to hash a block. This difference will be exacerbated on non-cryptographic hashes.
+
+`rapidhash`:
+
+```
+alloc::vec::Vec<i32>: 54.224434ms (-> 1908e25736ad8479)
+alloc::vec::Vec<ruined_portal::NewType>: 278.101368ms (-> 949efa02155c336a)
+```
+
+`ahash`:
+
+```
+alloc::vec::Vec<i32>: 56.262629ms (-> 217325aa736f75a8)
+alloc::vec::Vec<ruined_portal::NewType>: 177.900032ms (-> 4ae6133ab0e0fe9f)
+```
+
+`highway`:
+
+```
+alloc::vec::Vec<i32>: 53.843217ms (-> f2e68b031ff10c02)
+alloc::vec::Vec<ruined_portal::NewType>: 547.520541ms (-> f2e68b031ff10c02)
+```
+
+That's not good. Note that all hashers have about the same performance on `Vec<i32>`. That's about the speed of RAM. For small arrays that fits in cache, the difference is even more prominent. (I didn't verify this, but I am the smartest person in the room and thus am obviously right.)
+
+
+## My goal
+
+### (Kinda)
+
+What I really want is a general-purpose hash that's good for most practical purposes and kinda DoS-resistant but not necessarily cryptographic. It needs to perform fast on short inputs, so it can't be a "real" block hash, but rather something close to `rapidhash`.
+
+We want:
+
+$$
+\mathrm{consume}(a, x, y) = \mathrm{mix}(x \oplus a, y \oplus C).
+$$
+
+Right, Rust doesn't support this. Okay, let's try another relatively well-known scheme that might be easier to implement. It's parallel, surely that'll help?
+
+To hash a $64$-bit word sequence $(x_1, \dots, x_{2n})$, we compute
+
+$$
+\mathrm{mix}(x_1 \oplus a_1, x_2 \oplus a_2) + \dots + \mathrm{mix}(x_{2n - 1} \oplus a_{2n - 1}, x_{2n} \oplus a_{2n}),
+$$
+
+where $(a_1, \dots, a_{2n})$ is random data (possibly generated from the seed once), and
+
+$$
+\mathrm{mix}(x, y) = (x \cdot y \bmod 2^{64}) \oplus (x \cdot y \mathop{div} 2^{64}).
+$$
+
+This is a combination of certain well-known primitives. The problem here is that $a_i$ needs to be precomputed beforehand. This is not a problem for fixed-length keys, like structs of integers -- something often used in, say, `rustc`.
+
+Unfortunately, Rust forces each hasher to handle *all* possible inputs, including of different lengths, so this scheme can't work. The hasher isn't even parametrized by the type of the hashed object. Four well-layouted 64-bit integers that can easily be mixed together with just two full-width multiplications? Nah, `write_u64` goes brrrrrrrrrrrr-
+
+
+### Stop bitching
+
+I've been designing fast hash-based data structures for several months before realizing they are almost unusable because of these design decisions. *Surely* something that isn't a problem in C++ and Python won't be a problem in Rust, I thought. I deserve a little bitching, okay?
+
+
+### Actually how
+
+The obvious way forward is to bring the structure of the data back into the picture. If the hasher knew it's hashing fixed-size data, it could use the $a_i$ approach. If the hasher knew it's hashing an array, it could vectorize the computation of individual hashes. If the hasher knew the types of the fields in the structure it's hashing, it could prevent tearing, or perhaps merge small fields into 64-bit blocks efficiently. Alas, the hasher is clueless...
+
+In my opinion, `Hasher` and `Hash` are a wrong abstraction. Instead of the `Hash` driving the `Hasher` ~~insane~~, it should be the other way round: `Hash` providing introspection facilities and `Hasher` navigating the hashed objects recursively.
+
+How this API should look like and whether it can be shoehorned into the existing interfaces remains to be seen. I have not started work on the design yet, and perhaps this article might be a bit premature, but I'd love to hear your thoughts on how I missed something really obvious (or, indeed, on how Rust is fast enough and no one cares).
diff --git a/blog/thoughts-on-rust-hashing/og.png b/blog/thoughts-on-rust-hashing/og.png
new file mode 100644
index 0000000..1897d7c
Binary files /dev/null and b/blog/thoughts-on-rust-hashing/og.png differ