Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix undefined behaviour of char bit shifting when combining classic i… #19

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

Zhicheng-Liu
Copy link

…ndices

Previously when combining multiple classic indices into a single classic
index, the contents of source indices are read in as char. During the
interleaving process, depending on the current position of the destination
index, both left and right shifts on the next char could be performed.

However, there are a few undefined behaviours that could affect the results
depending on the platform:

  1. The signedness of a char is an undefined behaviour. Hence when
    bit shifting, the usual arithmetic conversion performed on the char is
    undefined. The char could be promoted to either signed int or unsigned
    int.
  2. If the char is treated as signed int, the bit shifting (both left and
    right) is also undefined in pre-c++20 standards. The behaviour is platform
    dependent.

This change fixes the issue by declare the contents read from source
indices as unsigned char.

giang-nghg and others added 7 commits March 30, 2021 22:13
* Expose classic_combine function

* Add exposition code

* Fix syntax error

* Expose calc_signature_size
* Expose classic_combine function

* Add exposition code

* Fix syntax error

* Expose calc_signature_size

* Expose classic combine as cmd instead
…ndices

Previously when combining multiple classic indices into a single classic
index, the contents of source indices are read in as `char`. During the
interleaving process, depending on the current position of the destination
index, both left and right shifts on the next char could be performed.

However, there are a few undefined behaviours that could affect the results
depending on the platform:
1. The signedness of a `char` is an undefined behaviour. Hence when
bit shifting, the usual arithmetic conversion performed on the char is
undefined. The char could be promoted to either signed int or unsigned
int.
2. If the char is treated as signed int, the bit shifting (both left and
right) is also undefined in pre-c++20 standards. The behaviour is platform
dependent.

This change fixes the issue by declare the contents read from source
indices as `unsigned char`.
@Zhicheng-Liu
Copy link
Author

@bingmann and @leoisl, could you please review this pull request?

…ingmann#5)

When combining classic indices, for each batch the combinations of
rows from each constituent index are written to an output block.
The output block is reused for next batch.

As we use bitwise OR operation to combine rows from the constituent
indices, the output block should be reset to all 0s before being
reused. Otherwise, previous set bits will be carried over to next
batch and accumulating false positives till the end of the batch
processing loop.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants