Skip to content

Commit

Permalink
Compare to strength_reduce
Browse files Browse the repository at this point in the history
  • Loading branch information
purplesyringa committed Aug 24, 2024
1 parent c25bd00 commit b3aa955
Show file tree
Hide file tree
Showing 2 changed files with 12 additions and 9 deletions.
4 changes: 2 additions & 2 deletions blog/division-is-hard-but-it-does-not-have-to-be/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@
<span class=hljs-keyword>lea</span> <span class=hljs-built_in>rcx</span>, [<span class=hljs-built_in>rax</span> + <span class=hljs-number>59</span>]
<span class=hljs-keyword>cmovb</span> <span class=hljs-built_in>rax</span>, <span class=hljs-built_in>rcx</span>
<span class=hljs-keyword>ret</span>
</code></pre><hr><p>Oh, and it’s not like hard-coding <eq><math><mrow><msup><mn>2</mn><mn>64</mn></msup><mo></mo></mrow><mrow><mn>59</mn></mrow></math></eq> was necessary. Two iterations suffice for any divisor <eq><math><mrow><mo></mo></mrow><mrow><msup><mn>2</mn><mn>64</mn></msup><mo></mo></mrow><mrow><msup><mn>2</mn><mn>32</mn></msup><mo>+</mo></mrow><mrow><mn>1</mn></mrow></math></eq>. Need more primes? Choose away, there’s a lot of them in the <eq><math><msup><mn>2</mn><mn>32</mn></msup></math></eq>-long region.<p>Need a smaller divisor? Three iterations work for <eq><math><mrow><mi>n</mi><mo></mo></mrow><mrow><msup><mn>2</mn><mn>64</mn></msup><mo></mo></mrow><mrow><mn>6981461082631</mn></mrow></math></eq> (42.667 bits compared to 32 for two iterations), four for <eq><math><mrow><mi>n</mi><mo></mo></mrow><mrow><msup><mn>2</mn><mn>64</mn></msup><mo></mo></mrow><mrow><mn>281472113362716</mn></mrow></math></eq> (48 bits). Sounds like a lot? That’s still better than <code>__umodti3</code>.<p>And this method works for division too, not just modulo:<pre><code class=language-rust><span class=hljs-keyword>fn</span> <span class="hljs-title function_">divide</span>(<span class=hljs-keyword>mut</span> n: <span class=hljs-type>u128</span>) <span class=hljs-punctuation>-></span> <span class=hljs-type>u128</span> {
</code></pre><hr><p>Oh, and it’s not like hard-coding <eq><math><mrow><msup><mn>2</mn><mn>64</mn></msup><mo></mo></mrow><mrow><mn>59</mn></mrow></math></eq> was necessary. Two iterations suffice for any divisor <eq><math><mrow><mo></mo></mrow><mrow><msup><mn>2</mn><mn>64</mn></msup><mo></mo></mrow><mrow><msup><mn>2</mn><mn>32</mn></msup><mo>+</mo></mrow><mrow><mn>1</mn></mrow></math></eq>. Need more primes? Choose away, there’s a lot of them in the <eq><math><msup><mn>2</mn><mn>32</mn></msup></math></eq>-long region.<p>Need a smaller divisor? Three iterations work for <eq><math><mrow><mi>n</mi><mo></mo></mrow><mrow><msup><mn>2</mn><mn>64</mn></msup><mo></mo></mrow><mrow><mn>6981461082631</mn></mrow></math></eq> (42.667 bits compared to 32 for two iterations), four for <eq><math><mrow><mi>n</mi><mo></mo></mrow><mrow><msup><mn>2</mn><mn>64</mn></msup><mo></mo></mrow><mrow><mn>281472113362716</mn></mrow></math></eq> (48 bits). Sounds like a lot? That’s still better than <code>__umodti3</code>. Sure, it’s not universal, but still covers important usecases.<p>And this method works for division too, not just modulo:<pre><code class=language-rust><span class=hljs-keyword>fn</span> <span class="hljs-title function_">divide</span>(<span class=hljs-keyword>mut</span> n: <span class=hljs-type>u128</span>) <span class=hljs-punctuation>-></span> <span class=hljs-type>u128</span> {
<span class=hljs-keyword>let</span> <span class=hljs-keyword>mut </span><span class=hljs-variable>quotient</span> = n >> <span class=hljs-number>64</span>;
n = n % (<span class=hljs-number>1</span> << <span class=hljs-number>64</span>) + (n >> <span class=hljs-number>64</span>) * <span class=hljs-number>59</span>;
quotient += n >> <span class=hljs-number>64</span>;
Expand Down Expand Up @@ -135,7 +135,7 @@
}
quotient
}
</code></pre><div class=table-wrapper><table><thead><tr><th>Test<th>Time/iteration (ns)<th>Speedup<tbody><tr><td><code>modulo_naive</code><td>25.421<td>(base)<tr><td><code>modulo_optimized</code><td>2.6755<td>9.5x<tr><td><code>reduce</code><td>2.2016<td>11.5x<tr><td><code>divide_naive</code><td>25.366<td>(base)<tr><td><code>divide_optimized</code><td>2.8677<td>8.8x</table></div><p class=next-group><span class=side-header><span>So what?</span></span>In all honesty, this is not immediately useful when applied to rolling hashes. <code>reduce</code> is still a little slower than two <code>u64 % u32</code> computations, so if calculating the hash modulo two 32-bit primes rather than one 64-bit prime suffices for you, do that. Still, if you need the best guaranteed collision rate as fast as possible, this is the way.<p>It’s a free optimization for compilers to perform too. It’s quite possible that I’m not just unfamiliar with practical applications. Also, hey, it’s one more trick you might be able to apply elsewhere now that you’ve seen it.</div></section><footer><div class=viewport-container><h2>Made with my own bare hands (why.)</h2></div></footer><script>window.addEventListener("keydown", e => {
</code></pre><p>I’m also going to compare to the <a href=https://lib.rs/strength_reduce/><code>strength_reduce</code> crate</a> to simulate the same optimizations that compilers perform with <code>u64 % u32</code>. I’m compiling with <code>-C target-cpu=native</code>.<div class=table-wrapper><table><thead><tr><th>Test<th>Time/iteration (ns)<th>Speedup<tbody><tr><td><code>modulo_naive</code><td>25.440<td>(base)<tr><td><code>modulo_strength_reduce</code><td>4.9672<td>5.1x<tr><td><code>modulo_optimized</code><td>2.5847<td>9.8x<tr><td><code>reduce</code><td>2.1746<td>11.7x<tr><td><code>divide_naive</code><td>25.460<td>(base)<tr><td><code>divide_strength_reduce</code><td>5.4451<td>4.7x<tr><td><code>divide_optimized</code><td>2.7730<td>9.2x</table></div><p class=next-group><span class=side-header><span>So what?</span></span>In all honesty, this is not immediately useful when applied to rolling hashes. <code>reduce</code> is still a little slower than two <code>u64 % u32</code> computations, so if calculating the hash modulo two 32-bit primes rather than one 64-bit prime suffices for you, do that. Still, if you need the best guaranteed collision rate as fast as possible, this is the way.<p>It’s a free optimization for compilers to perform too. It’s quite possible that I’m not just unfamiliar with practical applications. Also, hey, it’s one more trick you might be able to apply elsewhere now that you’ve seen it.</div></section><footer><div class=viewport-container><h2>Made with my own bare hands (why.)</h2></div></footer><script>window.addEventListener("keydown", e => {
if (e.ctrlKey && e.key === "Enter") {
window.open("https://github.com/purplesyringa/site/edit/master/blog/division-is-hard-but-it-does-not-have-to-be/index.md", "_blank");
}
Expand Down
17 changes: 10 additions & 7 deletions blog/division-is-hard-but-it-does-not-have-to-be/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ modulo:

Oh, and it's not like hard-coding $2^{64} - 59$ was necessary. Two iterations suffice for any divisor $\ge 2^{64} - 2^{32} + 1$. Need more primes? Choose away, there's a lot of them in the $2^{32}$-long region.

Need a smaller divisor? Three iterations work for $n \ge 2^{64} - 6981461082631$ (42.667 bits compared to 32 for two iterations), four for $n \ge 2^{64} - 281472113362716$ (48 bits). Sounds like a lot? That's still better than `__umodti3`.
Need a smaller divisor? Three iterations work for $n \ge 2^{64} - 6981461082631$ (42.667 bits compared to 32 for two iterations), four for $n \ge 2^{64} - 281472113362716$ (48 bits). Sounds like a lot? That's still better than `__umodti3`. Sure, it's not universal, but still covers important usecases.

And this method works for division too, not just modulo:

Expand Down Expand Up @@ -236,14 +236,17 @@ fn divide_optimized(mut n: u128) -> u128 {
}
```

I'm also going to compare to the [`strength_reduce` crate](https://lib.rs/strength_reduce/) to simulate the same optimizations that compilers perform with `u64 % u32`. I'm compiling with `-C target-cpu=native`.

|Test |Time/iteration (ns)|Speedup |
|------------------------|-------------------|-------------------------|
|`modulo_naive` |25.421 |(base) |
|`modulo_optimized` |2.6755 |9.5x |
|`reduce` |2.2016 |11.5x |
|`divide_naive` |25.366 |(base) |
|`divide_optimized` |2.8677 |8.8x |

|`modulo_naive` |25.440 |(base) |
|`modulo_strength_reduce`|4.9672 |5.1x |
|`modulo_optimized` |2.5847 |9.8x |
|`reduce` |2.1746 |11.7x |
|`divide_naive` |25.460 |(base) |
|`divide_strength_reduce`|5.4451 |4.7x |
|`divide_optimized` |2.7730 |9.2x |


### So what?
Expand Down

0 comments on commit b3aa955

Please sign in to comment.