Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize select_bit #52

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Optimize select_bit #52

wants to merge 3 commits into from

Conversation

jmr
Copy link
Contributor

@jmr jmr commented Aug 21, 2023

  1. AArch64 byte-wise popcount optimization

ARM NEON has a byte-wise popcount instruction, which helps to optimize
select_bit and PopCount::count. Use it for AArch64 (64-bit ARM).

15% speedup for Rank1, 4% for Select0 and 3% for Select1.
(60% for PopCount::count itself.)

  1. byte-serial 32-bit version

This gives a 9% speedup on select0 and 7% on select1.
(Tested on Pixel 3 in armeabi-v7a mode.)

This is likely because the branches of this unrolled linear
search are more predictable than the binary search that was
used previously.

  1. Use lookup table

Instead of computing (counts | MASK_80) - ((i + 1) * MASK_01),
we pre-compute a lookup table

PREFIX_SUM_OVERFLOW[i] = (0x80 - (i + 1)) * MASK_01 = (0x7F - i) * MASK_01

then use counts + PREFIX_SUM_OVERFLOW[i].

This uses a UInt64[64] or 0.5kiB lookup table. The trick is from:
Gog, Simon and Matthias Petri. “Optimized succinct data structures for
massive data.” Software: Practice and Experience 44 (2014): 1287 - 1314.

https://www.semanticscholar.org/paper/Optimized-succinct-data-structures-for-massive-data-Gog-Petri/c7e7f02f441ebcc0aeffdcad2964185926551ec3

This gives a 2-3% speedup for BitVector::select0/select1.

jmr added 3 commits August 21, 2023 17:21
ARM NEON has a byte-wise popcount instruction, which helps to optimize
`select_bit` and `PopCount::count`.  Use it for AArch64 (64-bit ARM).

15% speedup for `Rank1`, 4% for `Select0` and 3% for `Select1`.
(60% for `PopCount::count` itself.)
This gives a 9% speedup on `select0` and 7% on `select1`.
(Tested on Pixel 3 in armeabi-v7a mode.)

This is likely because the branches of this unrolled linear
search are more predictable than the binary search that was
used previously.
Instead of computing `(counts | MASK_80) - ((i + 1) * MASK_01)`,
we pre-compute a lookup table
```
PREFIX_SUM_OVERFLOW[i] = (0x80 - (i + 1)) * MASK_01 = (0x7F - i) * MASK_01
```
then use `counts + PREFIX_SUM_OVERFLOW[i]`.

This uses a `UInt64[64]` or 0.5kiB lookup table. The trick is from:
Gog, Simon and Matthias Petri. “Optimized succinct data structures for
massive data.” Software: Practice and Experience 44 (2014): 1287 - 1314.

https://www.semanticscholar.org/paper/Optimized-succinct-data-structures-for-massive-data-Gog-Petri/c7e7f02f441ebcc0aeffdcad2964185926551ec3

This gives a 2-3% speedup for `BitVector::select0`/`select1`.
@jmr
Copy link
Contributor Author

jmr commented Aug 21, 2023

@glebm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants