-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize GF(2^16) #48
Comments
We might also explore GF(2^12) handled with either 4k log and exp tables, one 4kb 64x64 multiplication table for GF(2^6) and an irreducible polynomial of degree 2, or one 256 byte 16x16 multiplication table for GF(2^4) and an irreducible polynomial of degree 3. We'd encode 3 bytes from chunks into two elements in GF(2^12), meaning either four polynomial coefficients of GF(2^6) or six polynomial coefficients of GF(2^4). |
We'll need to measure cache miss rate first before jumping to any conclusion anyway. But the change does seem to reduce cache misses for the library when built in pure Rust, so I'm in favour of investigating in this direction. Also if you know anyone who has the required expertise you have in mind, their contribution is very welcomed. |
I just saw the binary fields FFT trick paper http://www.math.clemson.edu/~sgao/papers/GM10.pdf via https://vitalik.ca/general/2019/05/12/fft.html I think the 2016 article https://arxiv.org/pdf/1503.05761.pdf provides more of the theoretical background, especially via many more references. In that post, there are python examples by @vbuterin that show quite a dramatic speed up even around 2^10. I've no idea if this remains true in optimized Rust code, but 2^16 gives considerable catch up time. |
I happened to come across this topic, and thought I'd drop my thoughts here in case you're interested.
Typical L1 data cache on CPUs is 16-64KB, L2 is typically 128-1024KB. 64KB won't fit in L1 cache, but usually will in L2. Because it's such a small table, and is cache efficient, it's unlikely you'll be limited by cache size. Rather, you'll be limited by L1 read/write ops throughput. Many CPUs can issue 2 reads and 1 write per cycle, so a naiive lookup and store approach like Because of this, imposing more lookups will likely reduce performance. It's also the reason why a log/exp lookup strategy usually performs poorly. You can try to bypass this limit by aggregating reads/writes across two bytes, for example:
The idea works because the read/write limit is based on the number of operations, not bytes, so reading 2 bytes (or 16 bytes, for that matter) takes exactly the same amount of resources as reading 1 byte. You may notice by now, to exceed this limit of 2 bytes/clock, you need to be able to process more than a byte at a time, which requires a completely different approach - SIMD.
The use of a 16 entry lookup is special, because it's what a 128-bit SIMD register can hold, which also allows the use of vectorized 'byte shuffle' instructions to effectively perform lookups without going to memory/cache. If interested, here's a list of fast techniques for performing multiplication. |
Thanks for the resources @animetosho ! I have been wondering if there would be a good starting article to reference to when I start looking into this a few weeks later, your detailed response was a very pleasant surprise. |
Sorry for the long downtime guys, I've finally got a small breathing room, and will try to wrap most of the issues up in next 2-3 weeks. |
I exploring #80 covers the arithmetic side of this, but we should open a separate issue for the FFT side. |
I doubt this 256 x 256 MUL_TABLE approach for GF(2^8) provides optimal performance because 64kb eats lots of cache. We should ask someone who knows CPU caches better. Assuming cache causes the poor performance though, I two main approaches:
In GF(2^8), we implement a*b = exp(log a ++ log b) using two 256 byte tables, where ++ is the integer +. In GF(2^16) = GF(2^8)[x]/I(x), we implement
but directly apply the reduction
In this, we compute log four times and exp five times, so nine look ups from 512 bytes, and do seven additions.
In gf-complete, there are numerous implementations, but their paper mostly talks about doing GF(2^16) = GF(2^4)[x]/I(x) with I(x) irreducible of degree four, which gives them one 16 x 16 multiplication table from which they do 16 look ups, but everything else is bitwise xor (+), ands, and shifts.
We might imagine 16 look ups from such a small table to be faster than 9 look ups from a bigger table plus addition. We must reduce a degree six polynomial by a degree four polynomial I(x), which sounds like 8 more look ups, but maybe choosing I(x) well reduces this.
I suspect the fastest approach might be treating this as some quotient of GF(2^4)[x][y] so that we cheaply map back into GF(2^8)[x]/I(x) and actually reduce there. I would not be overly surprised if this massive slew of bitwise operations beat the twice larger table and seven additions.
The text was updated successfully, but these errors were encountered: