Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idefics: remove double BOS token #35950

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

zucchini-nlp
Copy link
Member

@zucchini-nlp zucchini-nlp commented Jan 29, 2025

What does this PR do?

Idefics models have special tokens added to the prompt in their chat templates, but the processor adds another BOS token because by default add_special_tokens=True. This happens because, unlike tokenizers, in VLMs we had to format the prompt first and let users call the processors. Users usually don't disable special tokens when processing

I guess the easiest decision is to add a check for double BOS within processors, as we cannot go over all idefics finetunes in the hub and ask to fix the template. I didn't raise an unactionable warning to not flood the CMD, but the long term goal will be to make this standard for VLMs

Btw, @Rocketknight1 , the same should happen if one tries to use LLM template for formatting only, and then perform special manipulations on the prompt before tokenizing it. Do you think we need to explicitly mention this in the docs or anything?


Reproducer:

from transformers import AutoProcessor
from transformers.image_utils import load_image

image1 = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
processor = AutoProcessor.from_pretrained("HuggingFaceM4/Idefics3-8B-Llama3")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Can you describe the image?"}
        ]
    },
]

prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image1], return_tensors="pt")
print(inputs.input_ids[:3]) # starts with two BOS tokens [128000, 128000, 1502, ...]

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@Rocketknight1
Copy link
Member

@zucchini-nlp we do mention it in the docs here, but I think checking for a double BOS token is probably a good idea. llama.cpp automatically adds BOS, and raises a warning if the user inputs a sequence that already has it, and I think it would make sense for us to have a similar warning, or just to automatically skip adding BOS if it already exists.

@zucchini-nlp
Copy link
Member Author

zucchini-nlp commented Jan 29, 2025

@Rocketknight1 Oh cool, didn't know that we had it in docs

adds BOS, and raises a warning if the user inputs a sequence that already has it,

I think this should be raised only from withing apply_chat_template when tokenizing, since the processor can be called with any type of input, not necessarily chat formatted. We can't assume all ckpts for a model class are Instruct-trained. So I think we should't raise warning here, but I got #35953 where we can add a warning. It is blocked for now by processor standardization

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants