Add: num_additional_image_tokens to models #35052

jp1924 · 2024-12-03T01:12:21Z

What does this PR do?

In PR #33424, we resolved issue #34447 by adding num_additional_image_tokens to the processor.

However, the additional tokens are only considered in the processor, and since they are not accounted for in the modeling code, some users are still encountering an "img token mismatch error".

To address this problem, i have added code to the modeling code to also consider num_additional_image_tokens.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@zucchini-nlp

…on files

…ature selection logic

jp1924 · 2024-12-03T02:48:56Z

@zucchini-nlp

Shall we add a test case for num_additional_image_tokens?

Also, the model used in the test has the default cls token, so it seems difficult to detect errors that occur in models without a cls token.

Shouldn't we add a test without the cls token as well?

zucchini-nlp

@jp1924 hey! The num_additional_special_tokens in the processor code should not be added to the model. The processor already handles the slicing by looking at vision select strategy and removing the extra CLS token if needed.

In fact num_additional_special_tokens means if there are other tokens added to the image embedding while running the vision backbone. The most common case is that some models add CLS token, so the num_additional_special_tokens = 1. If the model has no CLS or any other token num_additional_special_tokens = 0

Can you share which version of transformers is failing to run generation in llava models? I have checked that main should be able to run the models with no error/ The prev versions might fail for llava-interleave model only, which was the only ckpt on Siglip as backbone and Siglip has no CLS added to images.

jp1924 · 2024-12-04T06:25:09Z

@zucchini-nlp Here's my transformers version and reproduction code:

import requests
from PIL import Image

from transformers import LlavaForConditionalGeneration, LlavaProcessor


IMG_TOKEN = "<|image|>"
model_path = "jp1924/koGemma2-it-KoLLaVA-9b-stage1.0"
model, processor = (
    LlavaForConditionalGeneration.from_pretrained(model_path),
    LlavaProcessor.from_pretrained(model_path),
)

device = "cpu"
prompts = [
    f"USER: {IMG_TOKEN}\nWhat are the things I should be cautious about when I visit this place? What should I bring with me? ASSISTANT:",
    f"USER: {IMG_TOKEN}\nWhat is this? ASSISTANT:",
]
image1 = Image.open(requests.get("https://llava-vl.github.io/static/images/view.jpg", stream=True).raw)
image2 = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw)

inputs = processor(images=[image1, image2], text=prompts, return_tensors="pt", padding=True)
inputs["labels"] = inputs["input_ids"]
inputs = inputs.to(device)

outputs = model(**inputs)

trasnformers_version: git clone git+https://github.com/jp1924/transformers.git@f9c7e6021e9a9a9fd3fc8bb291da9451066aeb8d
This was actually an issue occurring with the original llava model, not llava-interleave. And for models that don't use cls tokens like siglip, should we run all of them in full mode, whether it's llava or llava next?

zucchini-nlp · 2024-12-04T09:44:16Z

@jp1924 ah you are using your own model id, not the official ones. I can confirm that official ones work correctly, but for the model if you provided I cannot check yet as it is gated. I'll wait until the request is approved or you can also check if your config files match with the official configs

import requests
from PIL import Image
import torch
from transformers import LlavaForConditionalGeneration, LlavaProcessor


IMG_TOKEN = "<image>"
model_path = "llava-hf/llava-1.5-7b-hf"
device = "cuda:0"
model, processor = (
    LlavaForConditionalGeneration.from_pretrained(model_path, device_map=device, torch_dtype="float16"),
    LlavaProcessor.from_pretrained(model_path),
)

prompts = [
    f"USER: {IMG_TOKEN}\nWhat are the things I should be cautious about when I visit this place? What should I bring with me? ASSISTANT:",
    f"USER: {IMG_TOKEN}\nWhat is this? ASSISTANT:",
]
image1 = Image.open(requests.get("https://llava-vl.github.io/static/images/view.jpg", stream=True).raw)
image2 = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw)

inputs = processor(images=[image1, image2], text=prompts, return_tensors="pt", padding=True).to(model.device, torch.float16)
inputs["labels"] = inputs["input_ids"]

outputs = model(**inputs)

jp1924 · 2024-12-05T01:19:31Z

@zucchini-nlp
Ah, I see that the gate was manually set in the model hub, that was my mistake.
When you run the code again, it will be re-downloaded.

So for models where the vision encoder automatically inserts a cls token,
the vision_feature_select_strategy should be set to default
and the processor's num_additional_image_tokens should be set to 1 for it to run normally without errors.

And for vision encoders like siglip that don't insert cls tokens, you designed it intentionally so that vision_feature_select_strategy must be set to full for training to run properly, right?

If that's the case, I think we need to supplement the content of vision_feature_select_strategy in the official documentation.
It currently just states "The feature selection strategy used to select the vision feature from the vision backbone. Can be one of 'default' or 'full'",
which doesn't really convey why default and full were created and how they impact llava.

Here's my suggestion:

The feature selection strategy used to select the vision feature from the vision backbone. Can be one of:

- 'default': Selects vision features excluding the CLS token from the vision encoder. When using this option, num_additional_image_tokens must be set to 1.
- 'full': Selects all vision features including the CLS token. This option must also be used for models that don't insert CLS tokens (like siglip) to ensure proper

What do you think?

zucchini-nlp · 2024-12-05T08:51:08Z

@jp1924 Ah I see now, my bad. I think we should have been doing

if self.vision_feature_select_strategy == "default":
    num_image_tokens -= 1

instead of

if self.vision_feature_select_strategy == "default":
    num_image_tokens -= self.num_additional_image_tokens

in the processing code. It is wrong to assume that "default" strategy removes all additional token because the modeling code is clearly removing one token only. Can you please change the PR and use the suggested fix?

…files

zucchini-nlp · 2024-12-06T09:04:49Z

Thanks a lot! Can you also add video llava and llava next video processors, as they use the same code?

jp1924 · 2024-12-09T08:21:06Z

@zucchini-nlp
Resolved conflicts. Please check everything once!

zucchini-nlp

Thanks a lot, LGTM! Btw, there is also "video llava" model that has similar processing and should be changed accordingly

Feel free to @ ArthurZucker when the video llava is modified, and the PR can be merged after his review

HuggingFaceDocBuilderDev · 2024-12-09T12:59:12Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

jp1924 · 2024-12-09T13:08:34Z

@ArthurZucker, could you please review this?

jp1924 · 2024-12-18T23:47:41Z

@ArthurZucker

jp1924 · 2025-01-06T00:09:31Z

@ArthurZucker

zucchini-nlp · 2025-01-06T13:11:42Z

Since the core maintainer has a lot to review, I think we can ask for a second review from @qubvel and merge if he's happy with the changes

qubvel · 2025-01-06T13:56:18Z

src/transformers/models/llava/processing_llava.py

                ) + self.num_additional_image_tokens
                if self.vision_feature_select_strategy == "default":
-                    num_image_tokens -= self.num_additional_image_tokens
+                    num_image_tokens -= 1


L163: we add self.num_additional_image_tokens
L165: we subtract 1

Might that lead to a discrepancy?

@qubvel
In the existing model code, selected_image_feature[:, 1:] is also hardcoded with 1, so there won't be any discrepancy.

If I got it right the idea is that "default" behaviour always slice one feature, while "full" does not slice anything. Why do we need self.num_additional_image_tokens then? Would it be more correct to fix the modeling code instead to make it consistent? @zucchini-nlp wdyt?

We add num_additional_image_tokens to account for the ViTs with/without CLS, so currently it is either 0 or 1. Then the modeling code has two options to select the features, either crop 1 token or take the full embeddings. So, the main reason why we added num_additional_image_tokens in the first place was to make processor flexible for different types of ViT and calculate image patch length from patch_size.

@qubvel hmm, not sure I got what you mean by modifying the model code?

I just trying to unedrstand why do we need both num_additional_image_tokens and vision_feature_select_strategy in processor. As far as I understand num_additional_image_tokens is enough to compute num_image_tokens, or am I missing something?

Not sure I got the whole picture 🥲 but yeah, introducing two dependent parameters might be confusing, am I right that we should only use it as follows?

num_additional_image_tokens = 1 AND vision_feature_select_strategy = "default"

num_additional_image_tokens = 0 AND vision_feature_select_strategy = "full"

I see, confusing picture here. Yes, currently I think those two are the combinations used but other combinations should not be a problem. For ex, if one wants to keep CLS for any reason and experiment with that

num_additional_image_tokens = 1 AND vision_feature_select_strategy = "full"

Ok, I hope I got it now 😄 let's ensure we have some docs and comments regarding it, cause it looks not obvious

Maybe "default" is not the best name for it, cause I expected it to be like "remove cls token if it exists", but it looks like it is "remove first token in any case"

hehe, the naming comes from original implementation. I will update the docstring for more clarity then, in a subsequent PR since this one is about fixing a bug

zucchini-nlp · 2025-01-08T14:43:18Z

Thanks everyone, rebasing main and will merge. Apparently the problem self-resolved when I merged a related PR but some details were removed, so I will bring them back when rebasing

ArthurZucker

good decision to merge and sorry everyone for the late merge!

jp1924 · 2025-01-08T23:43:17Z

thank you!

jp1924 added 3 commits December 3, 2024 01:02

Add: num_additional_image_tokens to models

f7b66a9

docs: update docstring for num_additional_image_tokens in configurati…

17ca285

…on files

Add num_additional_image_tokens to LlavaNextVideo model and update fe…

5093d49

…ature selection logic

zucchini-nlp reviewed Dec 3, 2024

View reviewed changes

jp1924 added 5 commits December 6, 2024 06:42

revert

264f848

Fix: adjust num_image_tokens calculation in LlavaProcessor

b666b29

Remove num_additional_image_tokens initialization from configuration …

e17d95e

…files

Fix test error

925eaef

revert

0eb8415

jp1924 added 3 commits December 8, 2024 09:22

Fix: adjust num_image_tokens calculation in LlavaNextVideoProcessor

301a669

fix conflict

8ee4024

Merge branch 'main' into img_mismatch_error

139e6af

jp1924 requested a review from zucchini-nlp December 9, 2024 08:23

zucchini-nlp approved these changes Dec 9, 2024

View reviewed changes

Fix: adjust num_image_tokens calculation in VideoLlavaProcessor

114283b

zucchini-nlp requested a review from ArthurZucker December 9, 2024 13:28

zucchini-nlp requested a review from qubvel January 6, 2025 13:11

qubvel reviewed Jan 6, 2025

View reviewed changes

qubvel approved these changes Jan 7, 2025

View reviewed changes

Merge branch 'main' into img_mismatch_error

bd3765d

make style

bb73040

zucchini-nlp merged commit 29e74b7 into huggingface:main Jan 8, 2025
10 checks passed

ArthurZucker reviewed Jan 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add: num_additional_image_tokens to models #35052

Add: num_additional_image_tokens to models #35052

jp1924 commented Dec 3, 2024 •

edited

Loading

jp1924 commented Dec 3, 2024

zucchini-nlp left a comment

jp1924 commented Dec 4, 2024 •

edited

Loading

zucchini-nlp commented Dec 4, 2024

jp1924 commented Dec 5, 2024 •

edited

Loading

zucchini-nlp commented Dec 5, 2024

zucchini-nlp commented Dec 6, 2024

jp1924 commented Dec 9, 2024

zucchini-nlp left a comment

HuggingFaceDocBuilderDev commented Dec 9, 2024

jp1924 commented Dec 9, 2024

jp1924 commented Dec 18, 2024

jp1924 commented Jan 6, 2025

zucchini-nlp commented Jan 6, 2025

qubvel Jan 6, 2025

jp1924 Jan 7, 2025

qubvel Jan 7, 2025

zucchini-nlp Jan 7, 2025

qubvel Jan 7, 2025

qubvel Jan 7, 2025

zucchini-nlp Jan 7, 2025

qubvel Jan 7, 2025

qubvel Jan 7, 2025 •

edited

Loading

zucchini-nlp Jan 7, 2025

zucchini-nlp commented Jan 8, 2025

ArthurZucker left a comment

jp1924 commented Jan 8, 2025

Add: num_additional_image_tokens to models #35052

Add: num_additional_image_tokens to models #35052

Conversation

jp1924 commented Dec 3, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

jp1924 commented Dec 3, 2024

zucchini-nlp left a comment

Choose a reason for hiding this comment

jp1924 commented Dec 4, 2024 • edited Loading

zucchini-nlp commented Dec 4, 2024

jp1924 commented Dec 5, 2024 • edited Loading

zucchini-nlp commented Dec 5, 2024

zucchini-nlp commented Dec 6, 2024

jp1924 commented Dec 9, 2024

zucchini-nlp left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Dec 9, 2024

jp1924 commented Dec 9, 2024

jp1924 commented Dec 18, 2024

jp1924 commented Jan 6, 2025

zucchini-nlp commented Jan 6, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qubvel Jan 7, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zucchini-nlp commented Jan 8, 2025

ArthurZucker left a comment

Choose a reason for hiding this comment

jp1924 commented Jan 8, 2025

jp1924 commented Dec 3, 2024 •

edited

Loading

jp1924 commented Dec 4, 2024 •

edited

Loading

jp1924 commented Dec 5, 2024 •

edited

Loading

qubvel Jan 7, 2025 •

edited

Loading