Idefics: remove double BOS token #35950

zucchini-nlp · 2025-01-29T08:31:52Z

What does this PR do?

Idefics models have special tokens added to the prompt in their chat templates, but the processor adds another BOS token because by default add_special_tokens=True. This happens because, unlike tokenizers, in VLMs we had to format the prompt first and let users call the processors. Users usually don't disable special tokens when processing

I guess the easiest decision is to add a check for double BOS within processors, as we cannot go over all idefics finetunes in the hub and ask to fix the template. I didn't raise an unactionable warning to not flood the CMD, but the long term goal will be to make this standard for VLMs

Btw, @Rocketknight1 , the same should happen if one tries to use LLM template for formatting only, and then perform special manipulations on the prompt before tokenizing it. Do you think we need to explicitly mention this in the docs or anything?

Reproducer:

from transformers import AutoProcessor
from transformers.image_utils import load_image

image1 = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
processor = AutoProcessor.from_pretrained("HuggingFaceM4/Idefics3-8B-Llama3")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Can you describe the image?"}
        ]
    },
]

prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image1], return_tensors="pt")
print(inputs.input_ids[:3]) # starts with two BOS tokens [128000, 128000, 1502, ...]

HuggingFaceDocBuilderDev · 2025-01-29T08:57:29Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Rocketknight1 · 2025-01-29T15:25:24Z

@zucchini-nlp we do mention it in the docs here, but I think checking for a double BOS token is probably a good idea. llama.cpp automatically adds BOS, and raises a warning if the user inputs a sequence that already has it, and I think it would make sense for us to have a similar warning, or just to automatically skip adding BOS if it already exists.

zucchini-nlp · 2025-01-29T15:37:05Z

@Rocketknight1 Oh cool, didn't know that we had it in docs

adds BOS, and raises a warning if the user inputs a sequence that already has it,

I think this should be raised only from withing apply_chat_template when tokenizing, since the processor can be called with any type of input, not necessarily chat formatted. We can't assume all ckpts for a model class are Instruct-trained. So I think we should't raise warning here, but I got #35953 where we can add a warning. It is blocked for now by processor standardization

fix

f48ed6e

zucchini-nlp requested review from andimarafioti and Rocketknight1 January 29, 2025 08:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idefics: remove double BOS token #35950

Idefics: remove double BOS token #35950

zucchini-nlp commented Jan 29, 2025 •

edited

Loading

HuggingFaceDocBuilderDev commented Jan 29, 2025

Rocketknight1 commented Jan 29, 2025

zucchini-nlp commented Jan 29, 2025 •

edited

Loading

Idefics: remove double BOS token #35950

Are you sure you want to change the base?

Idefics: remove double BOS token #35950

Conversation

zucchini-nlp commented Jan 29, 2025 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Jan 29, 2025

Rocketknight1 commented Jan 29, 2025

zucchini-nlp commented Jan 29, 2025 • edited Loading

zucchini-nlp commented Jan 29, 2025 •

edited

Loading

zucchini-nlp commented Jan 29, 2025 •

edited

Loading