-
Notifications
You must be signed in to change notification settings - Fork 27.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add: num_additional_image_tokens to models #35052
Conversation
Shall we add a test case for Also, the model used in the test has the default cls token, so it seems difficult to detect errors that occur in models without a cls token. Shouldn't we add a test without the cls token as well? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jp1924 hey! The num_additional_special_tokens
in the processor code should not be added to the model. The processor already handles the slicing by looking at vision select strategy
and removing the extra CLS token if needed.
In fact num_additional_special_tokens
means if there are other tokens added to the image embedding while running the vision backbone. The most common case is that some models add CLS token, so the num_additional_special_tokens = 1
. If the model has no CLS or any other token num_additional_special_tokens = 0
Can you share which version of transformers is failing to run generation in llava models? I have checked that main
should be able to run the models with no error/ The prev versions might fail for llava-interleave model only, which was the only ckpt on Siglip as backbone and Siglip has no CLS added to images.
@zucchini-nlp Here's my transformers version and reproduction code: import requests
from PIL import Image
from transformers import LlavaForConditionalGeneration, LlavaProcessor
IMG_TOKEN = "<|image|>"
model_path = "jp1924/koGemma2-it-KoLLaVA-9b-stage1.0"
model, processor = (
LlavaForConditionalGeneration.from_pretrained(model_path),
LlavaProcessor.from_pretrained(model_path),
)
device = "cpu"
prompts = [
f"USER: {IMG_TOKEN}\nWhat are the things I should be cautious about when I visit this place? What should I bring with me? ASSISTANT:",
f"USER: {IMG_TOKEN}\nWhat is this? ASSISTANT:",
]
image1 = Image.open(requests.get("https://llava-vl.github.io/static/images/view.jpg", stream=True).raw)
image2 = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw)
inputs = processor(images=[image1, image2], text=prompts, return_tensors="pt", padding=True)
inputs["labels"] = inputs["input_ids"]
inputs = inputs.to(device)
outputs = model(**inputs) trasnformers_version: |
@jp1924 ah you are using your own model id, not the official ones. I can confirm that official ones work correctly, but for the model if you provided I cannot check yet as it is gated. I'll wait until the request is approved or you can also check if your config files match with the official configs import requests
from PIL import Image
import torch
from transformers import LlavaForConditionalGeneration, LlavaProcessor
IMG_TOKEN = "<image>"
model_path = "llava-hf/llava-1.5-7b-hf"
device = "cuda:0"
model, processor = (
LlavaForConditionalGeneration.from_pretrained(model_path, device_map=device, torch_dtype="float16"),
LlavaProcessor.from_pretrained(model_path),
)
prompts = [
f"USER: {IMG_TOKEN}\nWhat are the things I should be cautious about when I visit this place? What should I bring with me? ASSISTANT:",
f"USER: {IMG_TOKEN}\nWhat is this? ASSISTANT:",
]
image1 = Image.open(requests.get("https://llava-vl.github.io/static/images/view.jpg", stream=True).raw)
image2 = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw)
inputs = processor(images=[image1, image2], text=prompts, return_tensors="pt", padding=True).to(model.device, torch.float16)
inputs["labels"] = inputs["input_ids"]
outputs = model(**inputs) |
@zucchini-nlp So for models where the vision encoder automatically inserts a cls token, And for vision encoders like siglip that don't insert cls tokens, you designed it intentionally so that If that's the case, I think we need to supplement the content of Here's my suggestion:
What do you think? |
@jp1924 Ah I see now, my bad. I think we should have been doing if self.vision_feature_select_strategy == "default":
num_image_tokens -= 1 instead of if self.vision_feature_select_strategy == "default":
num_image_tokens -= self.num_additional_image_tokens in the processing code. It is wrong to assume that "default" strategy removes all additional token because the modeling code is clearly removing one token only. Can you please change the PR and use the suggested fix? |
Thanks a lot! Can you also add video llava and llava next video processors, as they use the same code? |
@zucchini-nlp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot, LGTM! Btw, there is also "video llava" model that has similar processing and should be changed accordingly
Feel free to @ ArthurZucker when the video llava is modified, and the PR can be merged after his review
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
@ArthurZucker, could you please review this? |
1 similar comment
Since the core maintainer has a lot to review, I think we can ask for a second review from @qubvel and merge if he's happy with the changes |
) + self.num_additional_image_tokens | ||
if self.vision_feature_select_strategy == "default": | ||
num_image_tokens -= self.num_additional_image_tokens | ||
num_image_tokens -= 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
L163: we add self.num_additional_image_tokens
L165: we subtract 1
Might that lead to a discrepancy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@qubvel
In the existing model code, selected_image_feature[:, 1:]
is also hardcoded with 1
, so there won't be any discrepancy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I got it right the idea is that "default" behaviour always slice one feature, while "full" does not slice anything. Why do we need self.num_additional_image_tokens
then? Would it be more correct to fix the modeling code instead to make it consistent? @zucchini-nlp wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We add num_additional_image_tokens
to account for the ViTs with/without CLS, so currently it is either 0 or 1. Then the modeling code has two options to select the features, either crop 1 token or take the full embeddings. So, the main reason why we added num_additional_image_tokens
in the first place was to make processor flexible for different types of ViT and calculate image patch length from patch_size
.
@qubvel hmm, not sure I got what you mean by modifying the model code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just trying to unedrstand why do we need both num_additional_image_tokens
and vision_feature_select_strategy
in processor. As far as I understand num_additional_image_tokens
is enough to compute num_image_tokens
, or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I got the whole picture 🥲 but yeah, introducing two dependent parameters might be confusing, am I right that we should only use it as follows?
num_additional_image_tokens = 1
ANDvision_feature_select_strategy = "default"
num_additional_image_tokens = 0
ANDvision_feature_select_strategy = "full"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, confusing picture here. Yes, currently I think those two are the combinations used but other combinations should not be a problem. For ex, if one wants to keep CLS for any reason and experiment with that
num_additional_image_tokens = 1
AND vision_feature_select_strategy = "full"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I hope I got it now 😄 let's ensure we have some docs and comments regarding it, cause it looks not obvious
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe "default" is not the best name for it, cause I expected it to be like "remove cls token if it exists", but it looks like it is "remove first token in any case"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hehe, the naming comes from original implementation. I will update the docstring for more clarity then, in a subsequent PR since this one is about fixing a bug
Thanks everyone, rebasing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good decision to merge and sorry everyone for the late merge!
thank you! |
What does this PR do?
In PR #33424, we resolved issue #34447 by adding
num_additional_image_tokens
to the processor.However, the additional tokens are only considered in the processor, and since they are not accounted for in the modeling code, some users are still encountering an "img token mismatch error".
To address this problem, i have added code to the modeling code to also consider
num_additional_image_tokens
.Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@zucchini-nlp