Skip to content

Commit

Permalink
Merge pull request #282 from Modalities/fix_num_bytes_per_token_power…
Browse files Browse the repository at this point in the history
…_of_2

bug fix: Enforce power of 2 number of bytes per token
  • Loading branch information
le1nux authored Dec 16, 2024
2 parents 8fea73a + 373c99a commit 7cd60e2
Show file tree
Hide file tree
Showing 3 changed files with 32 additions and 5 deletions.
23 changes: 20 additions & 3 deletions CHANGELOG_DEV.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,7 +121,7 @@ We can also now configure the encoding used for reading the documents. If encodi
* None


## PR #280 Bugfix: the number of bytes per token were wrongly calculated
## PR #280 Bug fix: the number of bytes per token were wrongly calculated

This PR fixes the bytes per token calculation.
Generally, we estimate how many bytes are needed to encode the full range of the vocabulary.
Expand All @@ -131,9 +131,26 @@ The calculation was wrong but coincidentally correct for the GPT2 tokenizer.



## PR #281: The char-based index is not always consistent with the byte-based index.
## PR #281: Bug fix: The char-based index is not always consistent with the byte-based index.

The first character of the string "ø This is..." is written on disc as two bytes, namely \xc3\xb8, when encoded as utf-8.
Therefore, the byte-based index has one more byte/char than the char-based index.

For consistency, we don't consider any char-based indexes anymore and always refer to byte-based indexes.
For consistency, we don't consider any char-based indexes anymore and always refer to byte-based indexes.


## PR #282: Bug fix: Enforce power of 2 number of bytes per token


Previously, the number of bytes per token was calculated by `math.ceil(log_2(vocab_size)/8)`, leading to ranges between 1 and 4 bytes.
However, the dataset implementation only support 1, 2 and 4 bytes per token, as defined here

https://github.com/Modalities/modalities/blob/0483362abac93e45850e56adaea7921e96836d59/src/modalities/dataloader/dataset.py#L202-L206

and

https://github.com/Modalities/modalities/blob/0483362abac93e45850e56adaea7921e96836d59/src/modalities/dataloader/dataset.py#L233-L234

I added a switch case that maps to the respective byte sizes, when packing the data.

This adds some inefficiencies as a vobabulary size > 65536 already requires 4 bytes per token, effectively doubling the storage requirements.
12 changes: 11 additions & 1 deletion src/modalities/dataloader/create_packed_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,17 @@ def _get_required_num_of_bytes_to_repr(int_to_get_repr: int) -> int:
Returns:
int: The number of bytes required to represent the integer.
"""
return math.ceil(math.log2(int_to_get_repr) / 8)
# we currently only support token sizes of 1, 2 and 4 bytes, as implemented here:
# https://github.com/Modalities/modalities/blob/fix_char_bytes_indexation_mismatch/src/modalities/dataloader/dataset.py#L202
num_bytes = math.ceil(math.log2(int_to_get_repr) / 8)
if num_bytes == 1:
return 1
elif num_bytes == 2:
return 2
elif num_bytes <= 4:
return 4
else:
raise ValueError("Currently only support token byte sizes of 1, 2, and 4.")

def _encoded_token_to_bytes(self, encoded_token: int) -> bytes:
"""
Expand Down
2 changes: 1 addition & 1 deletion tests/dataloader/test_packed_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -263,7 +263,7 @@ def test_continuously_packed_index(token_size_in_bytes: int, block_size: int, to

@pytest.mark.parametrize(
"vocab_size, expected_num_bytes",
[(254, 1), (255, 1), (256, 1), (257, 2), (65534, 2), (65535, 2), (65536, 2), (65537, 3)],
[(254, 1), (255, 1), (256, 1), (257, 2), (65534, 2), (65535, 2), (65536, 2), (65537, 4), (65538, 4), (10000000, 4)],
)
def test__get_required_num_of_bytes_to_repr(vocab_size: int, expected_num_bytes: int):
num_bytes = PackedDataGenerator._get_required_num_of_bytes_to_repr(int_to_get_repr=vocab_size)
Expand Down

0 comments on commit 7cd60e2

Please sign in to comment.