Compare to strength_reduce

purplesyringa · Aug 24, 2024 · b3aa955 · b3aa955
1 parent c25bd00
commit b3aa955
Show file tree

Hide file tree

Showing 2 changed files with 12 additions and 9 deletions.
diff --git a/blog/division-is-hard-but-it-does-not-have-to-be/index.html b/blog/division-is-hard-but-it-does-not-have-to-be/index.html
@@ -53,7 +53,7 @@
     <span class=hljs-keyword>lea</span>     <span class=hljs-built_in>rcx</span>, [<span class=hljs-built_in>rax</span> + <span class=hljs-number>59</span>]
     <span class=hljs-keyword>cmovb</span>   <span class=hljs-built_in>rax</span>, <span class=hljs-built_in>rcx</span>
     <span class=hljs-keyword>ret</span>
-</code></pre><hr><p>Oh, and it’s not like hard-coding <eq><math><mrow><msup><mn>2</mn><mn>64</mn></msup><mo>−</mo></mrow><mrow><mn>59</mn></mrow></math></eq> was necessary. Two iterations suffice for any divisor <eq><math><mrow><mo>≥</mo></mrow><mrow><msup><mn>2</mn><mn>64</mn></msup><mo>−</mo></mrow><mrow><msup><mn>2</mn><mn>32</mn></msup><mo>+</mo></mrow><mrow><mn>1</mn></mrow></math></eq>. Need more primes? Choose away, there’s a lot of them in the <eq><math><msup><mn>2</mn><mn>32</mn></msup></math></eq>-long region.<p>Need a smaller divisor? Three iterations work for <eq><math><mrow><mi>n</mi><mo>≥</mo></mrow><mrow><msup><mn>2</mn><mn>64</mn></msup><mo>−</mo></mrow><mrow><mn>6981461082631</mn></mrow></math></eq> (42.667 bits compared to 32 for two iterations), four for <eq><math><mrow><mi>n</mi><mo>≥</mo></mrow><mrow><msup><mn>2</mn><mn>64</mn></msup><mo>−</mo></mrow><mrow><mn>281472113362716</mn></mrow></math></eq> (48 bits). Sounds like a lot? That’s still better than <code>__umodti3</code>.<p>And this method works for division too, not just modulo:<pre><code class=language-rust><span class=hljs-keyword>fn</span> <span class="hljs-title function_">divide</span>(<span class=hljs-keyword>mut</span> n: <span class=hljs-type>u128</span>) <span class=hljs-punctuation>-></span> <span class=hljs-type>u128</span> {
+</code></pre><hr><p>Oh, and it’s not like hard-coding <eq><math><mrow><msup><mn>2</mn><mn>64</mn></msup><mo>−</mo></mrow><mrow><mn>59</mn></mrow></math></eq> was necessary. Two iterations suffice for any divisor <eq><math><mrow><mo>≥</mo></mrow><mrow><msup><mn>2</mn><mn>64</mn></msup><mo>−</mo></mrow><mrow><msup><mn>2</mn><mn>32</mn></msup><mo>+</mo></mrow><mrow><mn>1</mn></mrow></math></eq>. Need more primes? Choose away, there’s a lot of them in the <eq><math><msup><mn>2</mn><mn>32</mn></msup></math></eq>-long region.<p>Need a smaller divisor? Three iterations work for <eq><math><mrow><mi>n</mi><mo>≥</mo></mrow><mrow><msup><mn>2</mn><mn>64</mn></msup><mo>−</mo></mrow><mrow><mn>6981461082631</mn></mrow></math></eq> (42.667 bits compared to 32 for two iterations), four for <eq><math><mrow><mi>n</mi><mo>≥</mo></mrow><mrow><msup><mn>2</mn><mn>64</mn></msup><mo>−</mo></mrow><mrow><mn>281472113362716</mn></mrow></math></eq> (48 bits). Sounds like a lot? That’s still better than <code>__umodti3</code>. Sure, it’s not universal, but still covers important usecases.<p>And this method works for division too, not just modulo:<pre><code class=language-rust><span class=hljs-keyword>fn</span> <span class="hljs-title function_">divide</span>(<span class=hljs-keyword>mut</span> n: <span class=hljs-type>u128</span>) <span class=hljs-punctuation>-></span> <span class=hljs-type>u128</span> {
     <span class=hljs-keyword>let</span> <span class=hljs-keyword>mut </span><span class=hljs-variable>quotient</span> = n >> <span class=hljs-number>64</span>;
     n = n % (<span class=hljs-number>1</span> << <span class=hljs-number>64</span>) + (n >> <span class=hljs-number>64</span>) * <span class=hljs-number>59</span>;
     quotient += n >> <span class=hljs-number>64</span>;
@@ -135,7 +135,7 @@
     }
     quotient
 }
-</code></pre><div class=table-wrapper><table><thead><tr><th>Test<th>Time/iteration (ns)<th>Speedup<tbody><tr><td><code>modulo_naive</code><td>25.421<td>(base)<tr><td><code>modulo_optimized</code><td>2.6755<td>9.5x<tr><td><code>reduce</code><td>2.2016<td>11.5x<tr><td><code>divide_naive</code><td>25.366<td>(base)<tr><td><code>divide_optimized</code><td>2.8677<td>8.8x</table></div><p class=next-group><span class=side-header><span>So what?</span></span>In all honesty, this is not immediately useful when applied to rolling hashes. <code>reduce</code> is still a little slower than two <code>u64 % u32</code> computations, so if calculating the hash modulo two 32-bit primes rather than one 64-bit prime suffices for you, do that. Still, if you need the best guaranteed collision rate as fast as possible, this is the way.<p>It’s a free optimization for compilers to perform too. It’s quite possible that I’m not just unfamiliar with practical applications. Also, hey, it’s one more trick you might be able to apply elsewhere now that you’ve seen it.</div></section><footer><div class=viewport-container><h2>Made with my own bare hands (why.)</h2></div></footer><script>window.addEventListener("keydown", e => {
+</code></pre><p>I’m also going to compare to the <a href=https://lib.rs/strength_reduce/><code>strength_reduce</code> crate</a> to simulate the same optimizations that compilers perform with <code>u64 % u32</code>. I’m compiling with <code>-C target-cpu=native</code>.<div class=table-wrapper><table><thead><tr><th>Test<th>Time/iteration (ns)<th>Speedup<tbody><tr><td><code>modulo_naive</code><td>25.440<td>(base)<tr><td><code>modulo_strength_reduce</code><td>4.9672<td>5.1x<tr><td><code>modulo_optimized</code><td>2.5847<td>9.8x<tr><td><code>reduce</code><td>2.1746<td>11.7x<tr><td><code>divide_naive</code><td>25.460<td>(base)<tr><td><code>divide_strength_reduce</code><td>5.4451<td>4.7x<tr><td><code>divide_optimized</code><td>2.7730<td>9.2x</table></div><p class=next-group><span class=side-header><span>So what?</span></span>In all honesty, this is not immediately useful when applied to rolling hashes. <code>reduce</code> is still a little slower than two <code>u64 % u32</code> computations, so if calculating the hash modulo two 32-bit primes rather than one 64-bit prime suffices for you, do that. Still, if you need the best guaranteed collision rate as fast as possible, this is the way.<p>It’s a free optimization for compilers to perform too. It’s quite possible that I’m not just unfamiliar with practical applications. Also, hey, it’s one more trick you might be able to apply elsewhere now that you’ve seen it.</div></section><footer><div class=viewport-container><h2>Made with my own bare hands (why.)</h2></div></footer><script>window.addEventListener("keydown", e => {
 				if (e.ctrlKey && e.key === "Enter") {
 					window.open("https://github.com/purplesyringa/site/edit/master/blog/division-is-hard-but-it-does-not-have-to-be/index.md", "_blank");
 				}

diff --git a/blog/division-is-hard-but-it-does-not-have-to-be/index.md b/blog/division-is-hard-but-it-does-not-have-to-be/index.md
@@ -105,7 +105,7 @@ modulo:
 
 Oh, and it's not like hard-coding $2^{64} - 59$ was necessary. Two iterations suffice for any divisor $\ge 2^{64} - 2^{32} + 1$. Need more primes? Choose away, there's a lot of them in the $2^{32}$-long region.
 
-Need a smaller divisor? Three iterations work for $n \ge 2^{64} - 6981461082631$ (42.667 bits compared to 32 for two iterations), four for $n \ge 2^{64} - 281472113362716$ (48 bits). Sounds like a lot? That's still better than `__umodti3`.
+Need a smaller divisor? Three iterations work for $n \ge 2^{64} - 6981461082631$ (42.667 bits compared to 32 for two iterations), four for $n \ge 2^{64} - 281472113362716$ (48 bits). Sounds like a lot? That's still better than `__umodti3`. Sure, it's not universal, but still covers important usecases.
 
 And this method works for division too, not just modulo:
 
@@ -236,14 +236,17 @@ fn divide_optimized(mut n: u128) -> u128 {
 }
 ```
 
+I'm also going to compare to the [`strength_reduce` crate](https://lib.rs/strength_reduce/) to simulate the same optimizations that compilers perform with `u64 % u32`. I'm compiling with `-C target-cpu=native`.
+
 |Test                    |Time/iteration (ns)|Speedup                  |
 |------------------------|-------------------|-------------------------|
-|`modulo_naive`          |25.421             |(base)                   |
-|`modulo_optimized`      |2.6755             |9.5x                     |
-|`reduce`                |2.2016             |11.5x                    |
-|`divide_naive`          |25.366             |(base)                   |
-|`divide_optimized`      |2.8677             |8.8x                     |
-
+|`modulo_naive`          |25.440             |(base)                   |
+|`modulo_strength_reduce`|4.9672             |5.1x                     |
+|`modulo_optimized`      |2.5847             |9.8x                     |
+|`reduce`                |2.1746             |11.7x                    |
+|`divide_naive`          |25.460             |(base)                   |
+|`divide_strength_reduce`|5.4451             |4.7x                     |
+|`divide_optimized`      |2.7730             |9.2x                     |
 
 
 ### So what?