Mangled tokenization with Llama 3.1 for string sequences containing<space>'m #35938

tomjorquera · 2025-01-28T16:05:22Z

We observed that trying to tokenize/detokenize strings containing the sequence <space>'m would not give back the initial string, but would "eat" the leading whitespace.

For example, the string "for 'manual'" will be transformed into "for'manual'"

Investigating further, we also observed issue with strings containing <space>'s, making us think the issue may be related to trying to handle sequences such as "I'm".

System Info

transformers==4.46.2

Who can help?

I guess it's for @ArthurZucker and @itazap

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Running:

from transformers import AutoTokenizer

prompt = """for 'manual'"""

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
tokenizer.batch_decode(tokenizer([prompt])["input_ids"], skip_special_tokens=True)[0]

prints

"for'manual'"

(missing whitespace before the leading ')

Expected behavior

It should output the following

"for'manual'"

The text was updated successfully, but these errors were encountered:

tomjorquera · 2025-01-28T16:06:41Z

Note: we encountered this issue first from TGI, as reported huggingface/text-generation-inference#2927

ArthurZucker · 2025-01-29T09:41:32Z

Hey! I think this is related to cleanup_tokenization_space that default to Truein the version of transformers that you are using

tom-jorquera-pfx · 2025-01-29T18:13:22Z

Thanks for the reply @ArthurZucker !

I can confirm that if I add clean_up_tokenization_spaces=False to the call to batch_decode, the issue disappears.

I'm not sure if this means this behavior is "normal" (or let's say expected at least), and what will be the consequences for setting this option to False.

I tested updating to v4.48.1 (actually I forgot to report I had already tested with v4.48.0 before, sorry), and the behavior is still the same. From the doc it seems indeed True is still the default in the latest version.

Is there something to improve here or shall will be left at that?

ArthurZucker · 2025-02-12T15:14:47Z

In #31938 we should have set it to False by default

ArthurZucker · 2025-02-12T15:15:41Z

But for some tokenizers on the hub, it's explicitly set to True : https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct/blob/main/tokenizer_config.json#L2054. We can't do much against this appart from raising awarness here!

tomjorquera added the bug label Jan 28, 2025

tomjorquera mentioned this issue Jan 28, 2025

Mangled generation for string sequences containing<space>'m with Llama 3.1 huggingface/text-generation-inference#2927

Open

4 tasks

ArthurZucker closed this as completed Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mangled tokenization with Llama 3.1 for string sequences containing<space>'m #35938

Mangled tokenization with Llama 3.1 for string sequences containing<space>'m #35938

tomjorquera commented Jan 28, 2025 •

edited

Loading

tomjorquera commented Jan 28, 2025

ArthurZucker commented Jan 29, 2025

tom-jorquera-pfx commented Jan 29, 2025 •

edited

Loading

ArthurZucker commented Feb 12, 2025

ArthurZucker commented Feb 12, 2025

Mangled tokenization with Llama 3.1 for string sequences containing<space>'m #35938

Mangled tokenization with Llama 3.1 for string sequences containing<space>'m #35938

Comments

tomjorquera commented Jan 28, 2025 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

tomjorquera commented Jan 28, 2025

ArthurZucker commented Jan 29, 2025

tom-jorquera-pfx commented Jan 29, 2025 • edited Loading

ArthurZucker commented Feb 12, 2025

ArthurZucker commented Feb 12, 2025

tomjorquera commented Jan 28, 2025 •

edited

Loading

tom-jorquera-pfx commented Jan 29, 2025 •

edited

Loading