-
Notifications
You must be signed in to change notification settings - Fork 27.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mangled tokenization with Llama 3.1 for string sequences containing<space>'m #35938
Comments
Note: we encountered this issue first from TGI, as reported huggingface/text-generation-inference#2927 |
Hey! I think this is related to |
Thanks for the reply @ArthurZucker ! I can confirm that if I add I'm not sure if this means this behavior is "normal" (or let's say expected at least), and what will be the consequences for setting this option to False. I tested updating to v4.48.1 (actually I forgot to report I had already tested with v4.48.0 before, sorry), and the behavior is still the same. From the doc it seems indeed Is there something to improve here or shall will be left at that? |
In #31938 we should have set it to |
But for some tokenizers on the hub, it's explicitly set to |
We observed that trying to tokenize/detokenize strings containing the sequence
<space>'m
would not give back the initial string, but would "eat" the leading whitespace.For example, the string "for 'manual'" will be transformed into "for'manual'"
Investigating further, we also observed issue with strings containing
<space>'s
, making us think the issue may be related to trying to handle sequences such as "I'm".System Info
transformers==4.46.2
Who can help?
I guess it's for @ArthurZucker and @itazap
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Running:
prints
(missing whitespace before the leading ')
Expected behavior
It should output the following
The text was updated successfully, but these errors were encountered: