Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer tree creation is incompatible with BPE pretokenization used for Llama3. #146

Open
JoshC8C7 opened this issue Oct 9, 2024 · 2 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@JoshC8C7
Copy link
Contributor

JoshC8C7 commented Oct 9, 2024

Reproduction here.

Due to Llama3 using BPE pretokenization, some tokens (usually unicode characters like Ł) are represented in tokenizer.json (e.g. Ł is 253) but are mapped to a series of bytes instead, i.e. Ł is encoded as [129, 223] AKA ['Å','ģ'] (with Llama-3.2-3B-instruct).

It seems that when generating the tokenizer tree, the _build_regular_tokens_list makes use of decode on individual tokens, however calling decode on [253] (or indeed other tokens that start with Ł) yields a special (�) presumably due to the fact that pretokenization represents Ł as [129, 223] instead. This conversion is done in the convert_tokens_to_string part of decode - if we just decode into tokens rather than a string then the Ł survives. As decode goes to a string, no Ł is seen by the tree builder and thus it is not listed as a valid next character / child of the root node of the tree, leading to the above error.

Whilst you could switch to building the tree from decoding to tokens rather than strings, forcing the model to generate Ł via [253] instead of [129, 223] could degrade performance; 253 never seems to be generated by the model (by anecdotal evidence) so having it output could throw the model off.

Instead, we want it so that tokens are added to the tree in the way they would be generated; in most cases we can use (id, decode(id)) to get this (i.e. the current behaviour), but for Ł we get (253, �) which is discarded; we actually want to add leaves as the model would generate them, so when we see convert_ids_to_tokens([253]) yields [Ł] and that Ł is in the pretokenization dictionary, rather than adding (253, convert_tokens_to_string( [Ł])) we want to add a unary node (129, "") whose child is (223, "Ł"). We'd do the same thing every time we encounter Ł or a similarly pretokenized character.

@JoshC8C7
Copy link
Contributor Author

JoshC8C7 commented Oct 9, 2024

The more pressing implication of this is the first part - not having "Ł" in the root's children means its also absent from the tokenizer_alphabet and means a generated "Ł" in a string will be rejected by the StringParsingState causing the generation to terminate early and return invalid JSON. Due to pretokenization, the model can generate Ł and _apply_new_tokens handles the empty string produced when decoding 129 just fine, but then when adding 223 and comparing it to the last decode a new token is produced. This ability to get the output consisting of two tokens mismatches with the way the tree is generated - there is no way for two tokens to combine and form a character that is in the alphabet. The tokenizer_alphabet should thus contain any pretokenization inputs (characters like "Ł" that need more than one token to represent them).

@noamgat
Copy link
Owner

noamgat commented Oct 16, 2024

There is an inherent difficulty with supporting out-of-tokenizer characters in LMFE. If the LMFE approach consists of treating each token as a node on a prefix tree of the tokenizer's characters, these out of tokenizer characters requires a tree in which a character is a node where the tokens are the path to that character. I did not get the chance to approach this yet, but if anyone wants to have a go at it, it will be a great addition to LM Format Enforcer!

@noamgat noamgat added enhancement New feature or request help wanted Extra attention is needed labels Oct 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants