Optimize GF(2^16) #48

burdges · 2019-05-29T20:31:59Z

I doubt this 256 x 256 MUL_TABLE approach for GF(2^8) provides optimal performance because 64kb eats lots of cache. We should ask someone who knows CPU caches better. Assuming cache causes the poor performance though, I two main approaches:

In GF(2^8), we implement a*b = exp(log a ++ log b) using two 256 byte tables, where ++ is the integer +. In GF(2^16) = GF(2^8)[x]/I(x), we implement

(a_1 x + a_0) (b_1 x + b_0)
 = a_1 b_1 x^2 + (a_1 b_0 + a_0 b_1) x + a_0 b_0
 = exp(log a_1 ++ log b_1) x^2
   + (exp(log a_1 ++ log b_0) + exp(log a_0 ++ log b_1)) x
   + exp(log a_0 ++ log b_0)

but directly apply the reduction

(a_1 x + a_0) (b_1 x + b_0)
 =  x^2 + (2 a_1 b_1 + a_1 b_0 + a_0 b_1) x + (128 a_1 b_1 + a_0 b_0)
 = (exp(log 2 ++ log a_1 ++ log b_1) + exp(log a_1 ++ log b_0) + exp(log a_0 ++ log b_1)) x
   + exp(log 2 ++ log a_1 ++ log b_1) + exp(log a_0 ++ log b_0)

In this, we compute log four times and exp five times, so nine look ups from 512 bytes, and do seven additions.

In gf-complete, there are numerous implementations, but their paper mostly talks about doing GF(2^16) = GF(2^4)[x]/I(x) with I(x) irreducible of degree four, which gives them one 16 x 16 multiplication table from which they do 16 look ups, but everything else is bitwise xor (+), ands, and shifts.

(a_3 x^3 + a_2 x^2 + a_1 x + a_0) (b_3 x^3 + b_2 x^2 + b_1 x + b_0)
 = a_3 b_3 x^6 
   + (a_3 b_2 + a_2 b_3) x^5 
   + (a_3 b_1 + a_2 b_2 + a_1 b_3) x^4 
   + (a_3 b_0 + a_2 b_1 + a_1 b_2 x^3 + a_0 b_3) x^3 
   + (a_2 b_0 + a_1 b_1 + a_0 b_2) x^2 
   + (a_1 b_0 + a_0 b_1) x 
   + a_0 b_0

We might imagine 16 look ups from such a small table to be faster than 9 look ups from a bigger table plus addition. We must reduce a degree six polynomial by a degree four polynomial I(x), which sounds like 8 more look ups, but maybe choosing I(x) well reduces this.

I suspect the fastest approach might be treating this as some quotient of GF(2^4)[x][y] so that we cheaply map back into GF(2^8)[x]/I(x) and actually reduce there. I would not be overly surprised if this massive slew of bitwise operations beat the twice larger table and seven additions.

The text was updated successfully, but these errors were encountered:

burdges · 2019-05-29T23:54:47Z

We might also explore GF(2^12) handled with either 4k log and exp tables, one 4kb 64x64 multiplication table for GF(2^6) and an irreducible polynomial of degree 2, or one 256 byte 16x16 multiplication table for GF(2^4) and an irreducible polynomial of degree 3. We'd encode 3 bytes from chunks into two elements in GF(2^12), meaning either four polynomial coefficients of GF(2^6) or six polynomial coefficients of GF(2^4).

darrenldl · 2019-05-30T01:12:21Z

I doubt this 256 x 256 MUL_TABLE approach for GF(2^8) provides optimal performance because 64kb eats lots of cache. We should ask someone who knows CPU caches better.

We'll need to measure cache miss rate first before jumping to any conclusion anyway. But the change does seem to reduce cache misses for the library when built in pure Rust, so I'm in favour of investigating in this direction.

Also if you know anyone who has the required expertise you have in mind, their contribution is very welcomed.

burdges · 2019-06-02T23:18:57Z

I just saw the binary fields FFT trick paper http://www.math.clemson.edu/~sgao/papers/GM10.pdf via https://vitalik.ca/general/2019/05/12/fft.html I think the 2016 article https://arxiv.org/pdf/1503.05761.pdf provides more of the theoretical background, especially via many more references.

In that post, there are python examples by @vbuterin that show quite a dramatic speed up even around 2^10. I've no idea if this remains true in optimized Rust code, but 2^16 gives considerable catch up time.

animetosho · 2019-07-28T14:20:07Z

I happened to come across this topic, and thought I'd drop my thoughts here in case you're interested.

because 64kb eats lots of cache

Typical L1 data cache on CPUs is 16-64KB, L2 is typically 128-1024KB. 64KB won't fit in L1 cache, but usually will in L2.
However, that's not so important, as it's the access pattern which ultimately matters. When multiplying a region by a constant, you're only looking at a 256-byte slice of the table, which easily fits in L1 cache, so lookups remain cache efficient despite the seemingly large 64KB table.

Because it's such a small table, and is cache efficient, it's unlikely you'll be limited by cache size. Rather, you'll be limited by L1 read/write ops throughput. Many CPUs can issue 2 reads and 1 write per cycle, so a naiive lookup and store approach like output[i] = table[input[i]] will probably hit the write limit. This means that the approach will never exceed 1 byte per cycle, so a 3GHz processor could only give a theoretical maximum of 3GB/s when writing 1 byte at a time.

Because of this, imposing more lookups will likely reduce performance. It's also the reason why a log/exp lookup strategy usually performs poorly.

You can try to bypass this limit by aggregating reads/writes across two bytes, for example:

input_tmp = input_as_16bit[i]
output_as_16bit[i] = table[input_tmp & 0xff] | (table[input_tmp >> 8] << 8)

The idea works because the read/write limit is based on the number of operations, not bytes, so reading 2 bytes (or 16 bytes, for that matter) takes exactly the same amount of resources as reading 1 byte.
However, you will eventually hit the read limit of 2 reads/clock. Because each lookup is done on a byte, the theoretical maximum speed here is 2 bytes/clock. It still is up to twice as fast as the naiive method though.

You may notice by now, to exceed this limit of 2 bytes/clock, you need to be able to process more than a byte at a time, which requires a completely different approach - SIMD.

their paper mostly talks about doing GF(2^16) = GF(2^4)[x]/I(x) with I(x) irreducible of degree four, which gives them one 16 x 16 multiplication table from which they do 16 look ups

The use of a 16 entry lookup is special, because it's what a 128-bit SIMD register can hold, which also allows the use of vectorized 'byte shuffle' instructions to effectively perform lookups without going to memory/cache.

If interested, here's a list of fast techniques for performing multiplication.

darrenldl · 2019-07-28T14:39:24Z

Thanks for the resources @animetosho ! I have been wondering if there would be a good starting article to reference to when I start looking into this a few weeks later, your detailed response was a very pleasant surprise.

darrenldl · 2019-09-05T09:52:05Z

Sorry for the long downtime guys, I've finally got a small breathing room, and will try to wrap most of the issues up in next 2-3 weeks.

burdges · 2021-02-04T11:40:10Z

I exploring #80 covers the arithmetic side of this, but we should open a separate issue for the FFT side.

darrenldl added the enhancement label Jun 5, 2019

darrenldl mentioned this issue Jun 8, 2019

Reintroduce rayon via Output trait #35

Open

darrenldl added on hold and removed on hold labels Jun 8, 2019

darrenldl added this to the 4.0.2 (?) release milestone Sep 17, 2019

burdges closed this as completed Feb 4, 2021

burdges mentioned this issue Feb 4, 2021

SIMD for GF(2^16) #80

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize GF(2^16) #48

Optimize GF(2^16) #48

burdges commented May 29, 2019 •

edited

Loading

burdges commented May 29, 2019

darrenldl commented May 30, 2019

burdges commented Jun 2, 2019 •

edited

Loading

animetosho commented Jul 28, 2019

darrenldl commented Jul 28, 2019

darrenldl commented Sep 5, 2019

burdges commented Feb 4, 2021 •

edited

Loading

Optimize GF(2^16) #48

Optimize GF(2^16) #48

Comments

burdges commented May 29, 2019 • edited Loading

burdges commented May 29, 2019

darrenldl commented May 30, 2019

burdges commented Jun 2, 2019 • edited Loading

animetosho commented Jul 28, 2019

darrenldl commented Jul 28, 2019

darrenldl commented Sep 5, 2019

burdges commented Feb 4, 2021 • edited Loading

burdges commented May 29, 2019 •

edited

Loading

burdges commented Jun 2, 2019 •

edited

Loading

burdges commented Feb 4, 2021 •

edited

Loading