Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize GF(2^16) #48

Closed
burdges opened this issue May 29, 2019 · 7 comments
Closed

Optimize GF(2^16) #48

burdges opened this issue May 29, 2019 · 7 comments

Comments

@burdges
Copy link
Collaborator

burdges commented May 29, 2019

I doubt this 256 x 256 MUL_TABLE approach for GF(2^8) provides optimal performance because 64kb eats lots of cache. We should ask someone who knows CPU caches better. Assuming cache causes the poor performance though, I two main approaches:

In GF(2^8), we implement a*b = exp(log a ++ log b) using two 256 byte tables, where ++ is the integer +. In GF(2^16) = GF(2^8)[x]/I(x), we implement

(a_1 x + a_0) (b_1 x + b_0)
 = a_1 b_1 x^2 + (a_1 b_0 + a_0 b_1) x + a_0 b_0
 = exp(log a_1 ++ log b_1) x^2
   + (exp(log a_1 ++ log b_0) + exp(log a_0 ++ log b_1)) x
   + exp(log a_0 ++ log b_0)

but directly apply the reduction

(a_1 x + a_0) (b_1 x + b_0)
 =  x^2 + (2 a_1 b_1 + a_1 b_0 + a_0 b_1) x + (128 a_1 b_1 + a_0 b_0)
 = (exp(log 2 ++ log a_1 ++ log b_1) + exp(log a_1 ++ log b_0) + exp(log a_0 ++ log b_1)) x
   + exp(log 2 ++ log a_1 ++ log b_1) + exp(log a_0 ++ log b_0)

In this, we compute log four times and exp five times, so nine look ups from 512 bytes, and do seven additions.

In gf-complete, there are numerous implementations, but their paper mostly talks about doing GF(2^16) = GF(2^4)[x]/I(x) with I(x) irreducible of degree four, which gives them one 16 x 16 multiplication table from which they do 16 look ups, but everything else is bitwise xor (+), ands, and shifts.

(a_3 x^3 + a_2 x^2 + a_1 x + a_0) (b_3 x^3 + b_2 x^2 + b_1 x + b_0)
 = a_3 b_3 x^6 
   + (a_3 b_2 + a_2 b_3) x^5 
   + (a_3 b_1 + a_2 b_2 + a_1 b_3) x^4 
   + (a_3 b_0 + a_2 b_1 + a_1 b_2 x^3 + a_0 b_3) x^3 
   + (a_2 b_0 + a_1 b_1 + a_0 b_2) x^2 
   + (a_1 b_0 + a_0 b_1) x 
   + a_0 b_0

We might imagine 16 look ups from such a small table to be faster than 9 look ups from a bigger table plus addition. We must reduce a degree six polynomial by a degree four polynomial I(x), which sounds like 8 more look ups, but maybe choosing I(x) well reduces this.

I suspect the fastest approach might be treating this as some quotient of GF(2^4)[x][y] so that we cheaply map back into GF(2^8)[x]/I(x) and actually reduce there. I would not be overly surprised if this massive slew of bitwise operations beat the twice larger table and seven additions.

@burdges
Copy link
Collaborator Author

burdges commented May 29, 2019

We might also explore GF(2^12) handled with either 4k log and exp tables, one 4kb 64x64 multiplication table for GF(2^6) and an irreducible polynomial of degree 2, or one 256 byte 16x16 multiplication table for GF(2^4) and an irreducible polynomial of degree 3. We'd encode 3 bytes from chunks into two elements in GF(2^12), meaning either four polynomial coefficients of GF(2^6) or six polynomial coefficients of GF(2^4).

@darrenldl
Copy link
Contributor

I doubt this 256 x 256 MUL_TABLE approach for GF(2^8) provides optimal performance because 64kb eats lots of cache. We should ask someone who knows CPU caches better.

We'll need to measure cache miss rate first before jumping to any conclusion anyway. But the change does seem to reduce cache misses for the library when built in pure Rust, so I'm in favour of investigating in this direction.

Also if you know anyone who has the required expertise you have in mind, their contribution is very welcomed.

@burdges
Copy link
Collaborator Author

burdges commented Jun 2, 2019

I just saw the binary fields FFT trick paper http://www.math.clemson.edu/~sgao/papers/GM10.pdf via https://vitalik.ca/general/2019/05/12/fft.html I think the 2016 article https://arxiv.org/pdf/1503.05761.pdf provides more of the theoretical background, especially via many more references.

In that post, there are python examples by @vbuterin that show quite a dramatic speed up even around 2^10. I've no idea if this remains true in optimized Rust code, but 2^16 gives considerable catch up time.

@animetosho
Copy link

I happened to come across this topic, and thought I'd drop my thoughts here in case you're interested.

because 64kb eats lots of cache

Typical L1 data cache on CPUs is 16-64KB, L2 is typically 128-1024KB. 64KB won't fit in L1 cache, but usually will in L2.
However, that's not so important, as it's the access pattern which ultimately matters. When multiplying a region by a constant, you're only looking at a 256-byte slice of the table, which easily fits in L1 cache, so lookups remain cache efficient despite the seemingly large 64KB table.

Because it's such a small table, and is cache efficient, it's unlikely you'll be limited by cache size. Rather, you'll be limited by L1 read/write ops throughput. Many CPUs can issue 2 reads and 1 write per cycle, so a naiive lookup and store approach like output[i] = table[input[i]] will probably hit the write limit. This means that the approach will never exceed 1 byte per cycle, so a 3GHz processor could only give a theoretical maximum of 3GB/s when writing 1 byte at a time.

Because of this, imposing more lookups will likely reduce performance. It's also the reason why a log/exp lookup strategy usually performs poorly.

You can try to bypass this limit by aggregating reads/writes across two bytes, for example:

input_tmp = input_as_16bit[i]
output_as_16bit[i] = table[input_tmp & 0xff] | (table[input_tmp >> 8] << 8)

The idea works because the read/write limit is based on the number of operations, not bytes, so reading 2 bytes (or 16 bytes, for that matter) takes exactly the same amount of resources as reading 1 byte.
However, you will eventually hit the read limit of 2 reads/clock. Because each lookup is done on a byte, the theoretical maximum speed here is 2 bytes/clock. It still is up to twice as fast as the naiive method though.

You may notice by now, to exceed this limit of 2 bytes/clock, you need to be able to process more than a byte at a time, which requires a completely different approach - SIMD.

their paper mostly talks about doing GF(2^16) = GF(2^4)[x]/I(x) with I(x) irreducible of degree four, which gives them one 16 x 16 multiplication table from which they do 16 look ups

The use of a 16 entry lookup is special, because it's what a 128-bit SIMD register can hold, which also allows the use of vectorized 'byte shuffle' instructions to effectively perform lookups without going to memory/cache.

If interested, here's a list of fast techniques for performing multiplication.

@darrenldl
Copy link
Contributor

Thanks for the resources @animetosho ! I have been wondering if there would be a good starting article to reference to when I start looking into this a few weeks later, your detailed response was a very pleasant surprise.

@darrenldl
Copy link
Contributor

Sorry for the long downtime guys, I've finally got a small breathing room, and will try to wrap most of the issues up in next 2-3 weeks.

@darrenldl darrenldl added this to the 4.0.2 (?) release milestone Sep 17, 2019
@burdges
Copy link
Collaborator Author

burdges commented Feb 4, 2021

I exploring #80 covers the arithmetic side of this, but we should open a separate issue for the FFT side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants