Tokenizer tree creation is incompatible with BPE pretokenization used for Llama3. #146

JoshC8C7 · 2024-10-09T12:39:13Z

Due to Llama3 using BPE pretokenization, some tokens (usually unicode characters like Ł) are represented in tokenizer.json (e.g. Ł is 253) but are mapped to a series of bytes instead, i.e. Ł is encoded as [129, 223] AKA ['Å','ģ'] (with Llama-3.2-3B-instruct).

It seems that when generating the tokenizer tree, the _build_regular_tokens_list makes use of decode on individual tokens, however calling decode on [253] (or indeed other tokens that start with Ł) yields a special (�) presumably due to the fact that pretokenization represents Ł as [129, 223] instead. This conversion is done in the convert_tokens_to_string part of decode - if we just decode into tokens rather than a string then the Ł survives. As decode goes to a string, no Ł is seen by the tree builder and thus it is not listed as a valid next character / child of the root node of the tree, leading to the above error.

Whilst you could switch to building the tree from decoding to tokens rather than strings, forcing the model to generate Ł via [253] instead of [129, 223] could degrade performance; 253 never seems to be generated by the model (by anecdotal evidence) so having it output could throw the model off.

Instead, we want it so that tokens are added to the tree in the way they would be generated; in most cases we can use (id, decode(id)) to get this (i.e. the current behaviour), but for Ł we get (253, �) which is discarded; we actually want to add leaves as the model would generate them, so when we see convert_ids_to_tokens([253]) yields [Ł] and that Ł is in the pretokenization dictionary, rather than adding (253, convert_tokens_to_string( [Ł])) we want to add a unary node (129, "") whose child is (223, "Ł"). We'd do the same thing every time we encounter Ł or a similarly pretokenized character.

The text was updated successfully, but these errors were encountered:

JoshC8C7 · 2024-10-09T15:16:42Z

The more pressing implication of this is the first part - not having "Ł" in the root's children means its also absent from the tokenizer_alphabet and means a generated "Ł" in a string will be rejected by the StringParsingState causing the generation to terminate early and return invalid JSON. Due to pretokenization, the model can generate Ł and _apply_new_tokens handles the empty string produced when decoding 129 just fine, but then when adding 223 and comparing it to the last decode a new token is produced. This ability to get the output consisting of two tokens mismatches with the way the tree is generated - there is no way for two tokens to combine and form a character that is in the alphabet. The tokenizer_alphabet should thus contain any pretokenization inputs (characters like "Ł" that need more than one token to represent them).

noamgat · 2024-10-16T12:22:08Z

There is an inherent difficulty with supporting out-of-tokenizer characters in LMFE. If the LMFE approach consists of treating each token as a node on a prefix tree of the tokenizer's characters, these out of tokenizer characters requires a tree in which a character is a node where the tokens are the path to that character. I did not get the chance to approach this yet, but if anyone wants to have a go at it, it will be a great addition to LM Format Enforcer!

noamgat added enhancement New feature or request help wanted Extra attention is needed labels Oct 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer tree creation is incompatible with BPE pretokenization used for Llama3. #146

Tokenizer tree creation is incompatible with BPE pretokenization used for Llama3. #146

JoshC8C7 commented Oct 9, 2024 •

edited

Loading

JoshC8C7 commented Oct 9, 2024

noamgat commented Oct 16, 2024

Tokenizer tree creation is incompatible with BPE pretokenization used for Llama3. #146

Tokenizer tree creation is incompatible with BPE pretokenization used for Llama3. #146

Comments

JoshC8C7 commented Oct 9, 2024 • edited Loading

JoshC8C7 commented Oct 9, 2024

noamgat commented Oct 16, 2024

JoshC8C7 commented Oct 9, 2024 •

edited

Loading