Merge pull request #282 from Modalities/fix_num_bytes_per_token_power…

…_of_2 bug fix: Enforce power of 2 number of bytes per token
Modalities · Dec 16, 2024 · 7cd60e2 · 7cd60e2
2 parents 8fea73a + 373c99a
commit 7cd60e2
Show file tree

Hide file tree

Showing 3 changed files with 32 additions and 5 deletions.
diff --git a/CHANGELOG_DEV.md b/CHANGELOG_DEV.md
@@ -121,7 +121,7 @@ We can also now configure the encoding used for reading the documents. If encodi
 * None
 
 
-## PR #280 Bugfix: the number of bytes per token were wrongly calculated
+## PR #280 Bug fix: the number of bytes per token were wrongly calculated
 
 This PR fixes the bytes per token calculation.
 Generally, we estimate how many bytes are needed to encode the full range of the vocabulary. 
@@ -131,9 +131,26 @@ The calculation was wrong but coincidentally correct for the GPT2 tokenizer.
 
 
 
-## PR #281: The char-based index is not always consistent with the byte-based index.
+## PR #281: Bug fix: The char-based index is not always consistent with the byte-based index.
 
 The first character of the string "ø This is..." is written on disc as two bytes, namely \xc3\xb8, when encoded as utf-8. 
 Therefore, the byte-based index has one more byte/char than the char-based index. 
 
-For consistency, we don't consider any char-based indexes anymore and always refer to byte-based indexes. 
+For consistency, we don't consider any char-based indexes anymore and always refer to byte-based indexes. 
+
+
+## PR #282: Bug fix: Enforce power of 2 number of bytes per token
+
+
+Previously, the number of bytes per token was calculated by `math.ceil(log_2(vocab_size)/8)`, leading to ranges between 1 and 4 bytes. 
+However, the dataset implementation only support 1, 2 and 4 bytes per token, as defined here
+
+https://github.com/Modalities/modalities/blob/0483362abac93e45850e56adaea7921e96836d59/src/modalities/dataloader/dataset.py#L202-L206
+
+and 
+
+https://github.com/Modalities/modalities/blob/0483362abac93e45850e56adaea7921e96836d59/src/modalities/dataloader/dataset.py#L233-L234
+
+I added a switch case that maps to the respective byte sizes, when packing the data.
+
+This adds some inefficiencies as a vobabulary size > 65536 already requires 4 bytes per token, effectively doubling the storage requirements. 
diff --git a/src/modalities/dataloader/create_packed_data.py b/src/modalities/dataloader/create_packed_data.py
@@ -85,7 +85,17 @@ def _get_required_num_of_bytes_to_repr(int_to_get_repr: int) -> int:
         Returns:
             int: The number of bytes required to represent the integer.
         """
-        return math.ceil(math.log2(int_to_get_repr) / 8)
+        # we currently only support token sizes of 1, 2 and 4 bytes, as implemented here:
+        # https://github.com/Modalities/modalities/blob/fix_char_bytes_indexation_mismatch/src/modalities/dataloader/dataset.py#L202
+        num_bytes = math.ceil(math.log2(int_to_get_repr) / 8)
+        if num_bytes == 1:
+            return 1
+        elif num_bytes == 2:
+            return 2
+        elif num_bytes <= 4:
+            return 4
+        else:
+            raise ValueError("Currently only support token byte sizes of 1, 2, and 4.")
 
     def _encoded_token_to_bytes(self, encoded_token: int) -> bytes:
         """

diff --git a/tests/dataloader/test_packed_dataset.py b/tests/dataloader/test_packed_dataset.py
@@ -263,7 +263,7 @@ def test_continuously_packed_index(token_size_in_bytes: int, block_size: int, to
 
 @pytest.mark.parametrize(
     "vocab_size, expected_num_bytes",
-    [(254, 1), (255, 1), (256, 1), (257, 2), (65534, 2), (65535, 2), (65536, 2), (65537, 3)],
+    [(254, 1), (255, 1), (256, 1), (257, 2), (65534, 2), (65535, 2), (65536, 2), (65537, 4), (65538, 4), (10000000, 4)],
 )
 def test__get_required_num_of_bytes_to_repr(vocab_size: int, expected_num_bytes: int):
     num_bytes = PackedDataGenerator._get_required_num_of_bytes_to_repr(int_to_get_repr=vocab_size)