Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add: num_additional_image_tokens to models #35052

Merged
merged 14 commits into from
Jan 8, 2025

Conversation

jp1924
Copy link
Contributor

@jp1924 jp1924 commented Dec 3, 2024

What does this PR do?

In PR #33424, we resolved issue #34447 by adding num_additional_image_tokens to the processor.

However, the additional tokens are only considered in the processor, and since they are not accounted for in the modeling code, some users are still encountering an "img token mismatch error".

To address this problem, i have added code to the modeling code to also consider num_additional_image_tokens.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@zucchini-nlp

@jp1924
Copy link
Contributor Author

jp1924 commented Dec 3, 2024

@zucchini-nlp

Shall we add a test case for num_additional_image_tokens?

Also, the model used in the test has the default cls token, so it seems difficult to detect errors that occur in models without a cls token.

Shouldn't we add a test without the cls token as well?

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jp1924 hey! The num_additional_special_tokens in the processor code should not be added to the model. The processor already handles the slicing by looking at vision select strategy and removing the extra CLS token if needed.

In fact num_additional_special_tokens means if there are other tokens added to the image embedding while running the vision backbone. The most common case is that some models add CLS token, so the num_additional_special_tokens = 1. If the model has no CLS or any other token num_additional_special_tokens = 0

Can you share which version of transformers is failing to run generation in llava models? I have checked that main should be able to run the models with no error/ The prev versions might fail for llava-interleave model only, which was the only ckpt on Siglip as backbone and Siglip has no CLS added to images.

@jp1924
Copy link
Contributor Author

jp1924 commented Dec 4, 2024

@zucchini-nlp Here's my transformers version and reproduction code:

import requests
from PIL import Image

from transformers import LlavaForConditionalGeneration, LlavaProcessor


IMG_TOKEN = "<|image|>"
model_path = "jp1924/koGemma2-it-KoLLaVA-9b-stage1.0"
model, processor = (
    LlavaForConditionalGeneration.from_pretrained(model_path),
    LlavaProcessor.from_pretrained(model_path),
)

device = "cpu"
prompts = [
    f"USER: {IMG_TOKEN}\nWhat are the things I should be cautious about when I visit this place? What should I bring with me? ASSISTANT:",
    f"USER: {IMG_TOKEN}\nWhat is this? ASSISTANT:",
]
image1 = Image.open(requests.get("https://llava-vl.github.io/static/images/view.jpg", stream=True).raw)
image2 = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw)

inputs = processor(images=[image1, image2], text=prompts, return_tensors="pt", padding=True)
inputs["labels"] = inputs["input_ids"]
inputs = inputs.to(device)

outputs = model(**inputs)

trasnformers_version: git clone git+https://github.com/jp1924/transformers.git@f9c7e6021e9a9a9fd3fc8bb291da9451066aeb8d
This was actually an issue occurring with the original llava model, not llava-interleave. And for models that don't use cls tokens like siglip, should we run all of them in full mode, whether it's llava or llava next?

@zucchini-nlp
Copy link
Member

@jp1924 ah you are using your own model id, not the official ones. I can confirm that official ones work correctly, but for the model if you provided I cannot check yet as it is gated. I'll wait until the request is approved or you can also check if your config files match with the official configs

import requests
from PIL import Image
import torch
from transformers import LlavaForConditionalGeneration, LlavaProcessor


IMG_TOKEN = "<image>"
model_path = "llava-hf/llava-1.5-7b-hf"
device = "cuda:0"
model, processor = (
    LlavaForConditionalGeneration.from_pretrained(model_path, device_map=device, torch_dtype="float16"),
    LlavaProcessor.from_pretrained(model_path),
)

prompts = [
    f"USER: {IMG_TOKEN}\nWhat are the things I should be cautious about when I visit this place? What should I bring with me? ASSISTANT:",
    f"USER: {IMG_TOKEN}\nWhat is this? ASSISTANT:",
]
image1 = Image.open(requests.get("https://llava-vl.github.io/static/images/view.jpg", stream=True).raw)
image2 = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw)

inputs = processor(images=[image1, image2], text=prompts, return_tensors="pt", padding=True).to(model.device, torch.float16)
inputs["labels"] = inputs["input_ids"]

outputs = model(**inputs)

@jp1924
Copy link
Contributor Author

jp1924 commented Dec 5, 2024

@zucchini-nlp
Ah, I see that the gate was manually set in the model hub, that was my mistake.
When you run the code again, it will be re-downloaded.

So for models where the vision encoder automatically inserts a cls token,
the vision_feature_select_strategy should be set to default
and the processor's num_additional_image_tokens should be set to 1 for it to run normally without errors.

And for vision encoders like siglip that don't insert cls tokens, you designed it intentionally so that vision_feature_select_strategy must be set to full for training to run properly, right?

If that's the case, I think we need to supplement the content of vision_feature_select_strategy in the official documentation.
It currently just states "The feature selection strategy used to select the vision feature from the vision backbone. Can be one of 'default' or 'full'",
which doesn't really convey why default and full were created and how they impact llava.

Here's my suggestion:

The feature selection strategy used to select the vision feature from the vision backbone. Can be one of:

- 'default': Selects vision features excluding the CLS token from the vision encoder. When using this option, num_additional_image_tokens must be set to 1.
- 'full': Selects all vision features including the CLS token. This option must also be used for models that don't insert CLS tokens (like siglip) to ensure proper

What do you think?

@zucchini-nlp
Copy link
Member

@jp1924 Ah I see now, my bad. I think we should have been doing

if self.vision_feature_select_strategy == "default":
    num_image_tokens -= 1

instead of

if self.vision_feature_select_strategy == "default":
    num_image_tokens -= self.num_additional_image_tokens

in the processing code. It is wrong to assume that "default" strategy removes all additional token because the modeling code is clearly removing one token only. Can you please change the PR and use the suggested fix?

@zucchini-nlp
Copy link
Member

Thanks a lot! Can you also add video llava and llava next video processors, as they use the same code?

@jp1924
Copy link
Contributor Author

jp1924 commented Dec 9, 2024

@zucchini-nlp
Resolved conflicts. Please check everything once!

@jp1924 jp1924 requested a review from zucchini-nlp December 9, 2024 08:23
Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, LGTM! Btw, there is also "video llava" model that has similar processing and should be changed accordingly

Feel free to @ ArthurZucker when the video llava is modified, and the PR can be merged after his review

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@jp1924
Copy link
Contributor Author

jp1924 commented Dec 9, 2024

@ArthurZucker, could you please review this?

@jp1924
Copy link
Contributor Author

jp1924 commented Dec 18, 2024

@ArthurZucker

1 similar comment
@jp1924
Copy link
Contributor Author

jp1924 commented Jan 6, 2025

@ArthurZucker

@zucchini-nlp
Copy link
Member

Since the core maintainer has a lot to review, I think we can ask for a second review from @qubvel and merge if he's happy with the changes

@zucchini-nlp zucchini-nlp requested a review from qubvel January 6, 2025 13:11
Comment on lines 163 to 165
) + self.num_additional_image_tokens
if self.vision_feature_select_strategy == "default":
num_image_tokens -= self.num_additional_image_tokens
num_image_tokens -= 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L163: we add self.num_additional_image_tokens
L165: we subtract 1

Might that lead to a discrepancy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qubvel
In the existing model code, selected_image_feature[:, 1:] is also hardcoded with 1, so there won't be any discrepancy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I got it right the idea is that "default" behaviour always slice one feature, while "full" does not slice anything. Why do we need self.num_additional_image_tokens then? Would it be more correct to fix the modeling code instead to make it consistent? @zucchini-nlp wdyt?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We add num_additional_image_tokens to account for the ViTs with/without CLS, so currently it is either 0 or 1. Then the modeling code has two options to select the features, either crop 1 token or take the full embeddings. So, the main reason why we added num_additional_image_tokens in the first place was to make processor flexible for different types of ViT and calculate image patch length from patch_size.

@qubvel hmm, not sure I got what you mean by modifying the model code?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just trying to unedrstand why do we need both num_additional_image_tokens and vision_feature_select_strategy in processor. As far as I understand num_additional_image_tokens is enough to compute num_image_tokens, or am I missing something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I got the whole picture 🥲 but yeah, introducing two dependent parameters might be confusing, am I right that we should only use it as follows?

  1. num_additional_image_tokens = 1 AND vision_feature_select_strategy = "default"
  2. num_additional_image_tokens = 0 AND vision_feature_select_strategy = "full"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, confusing picture here. Yes, currently I think those two are the combinations used but other combinations should not be a problem. For ex, if one wants to keep CLS for any reason and experiment with that

num_additional_image_tokens = 1 AND vision_feature_select_strategy = "full"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I hope I got it now 😄 let's ensure we have some docs and comments regarding it, cause it looks not obvious

Copy link
Member

@qubvel qubvel Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "default" is not the best name for it, cause I expected it to be like "remove cls token if it exists", but it looks like it is "remove first token in any case"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hehe, the naming comes from original implementation. I will update the docstring for more clarity then, in a subsequent PR since this one is about fixing a bug

@zucchini-nlp
Copy link
Member

Thanks everyone, rebasing main and will merge. Apparently the problem self-resolved when I merged a related PR but some details were removed, so I will bring them back when rebasing

@zucchini-nlp zucchini-nlp merged commit 29e74b7 into huggingface:main Jan 8, 2025
10 checks passed
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good decision to merge and sorry everyone for the late merge!

@jp1924
Copy link
Contributor Author

jp1924 commented Jan 8, 2025

thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants