You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for your work and the release of the d-cube dataset.
I was trying to run a pre-trained OWL-ViT model (e.g. "google/owlvit-base-patch32") on the dataset, and found the following sentences to yield a RuntimeError.
ID: 140, TEXT: "a person who wears a hat and holds a tennis racket on the tennis court",
ID: 146, TEXT: "the player who is ready to bat with both feet leaving the ground in the room",
ID: 253, TEXT: "a person who plays music with musical instrument surrounded by spectators on the street",
ID: 342, TEXT: "a fisher who stands on the shore and whose lower body is not submerged by water",
ID: 348, TEXT: "a person who stands on the stage for speech but don't open their mouths",
ID: 355, TEXT: "a person with a pen in one hand but not looking at the paper",
ID: 356, TEXT: "a billiard ball with no numbers or patterns on its surface on the table",
ID: 364, TEXT: "a person standing at the table of table tennis who is not waving table tennis rackets",
ID: 404, TEXT: "a water polo player who is in the water but does not hold the ball",
ID: 405, TEXT: "a barbell held by a weightlifter that has not been lifted above the head",
ID: 412, TEXT: "a person who wears a helmet and sling equipment but is not on the sling",
ID: 419, TEXT: "person who kneels on one knee and proposes but has nothing in his hand"
A typical error message is shown at the bottom. It seems that the pre-trained model uses max_position_embeddings = 16 in OwlViTTextConfig which is not long enough to accept the descriptions above as inputs. All the models available on Huggingface seem to use max_position_embeddings = 16. Did you encounter the same issue when running your experiments for the paper? If so, how did you handle it in the evaluation process?
Thanks in advance.
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[129], line 1
----> 1 results = get_prediction(processor, model, image, [text_list[0]])
Cell In[11], line 13, in get_prediction(processor, model, image, captions, cpu_only)
9 with torch.no_grad():
10 inputs = processor(text=[captions], images=image, return_tensors="pt").to(
11 device
12 )
---> 13 outputs = model(**inputs)
14 target_size = torch.Tensor([image.size[::-1]]).to(device)
15 results = processor.post_process_object_detection(
16 outputs=outputs, target_sizes=target_size, threshold=0.05
17 )
File /venv/lib/python3.9/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
1516 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1517 else:
-> 1518 return self._call_impl(*args, **kwargs)
File /venv/lib/python3.9/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
1522 # If we don't have any hooks, we want to skip the rest of the logic in
1523 # this function, and just call forward.
1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1525 or _global_backward_pre_hooks or _global_backward_hooks
1526 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527 return forward_call(*args, **kwargs)
1529 try:
1530 result = None
File /venv/lib/python3.9/site-packages/transformers/models/owlvit/modeling_owlvit.py:1640, in OwlViTForObjectDetection.forward(self, input_ids, pixel_values, attention_mask, output_attentions, output_hidden_states, return_dict)
1637 return_dict = return_dict if return_dict is not None else self.config.return_dict
1639 # Embed images and text queries
-> 1640 query_embeds, feature_map, outputs = self.image_text_embedder(
1641 input_ids=input_ids,
1642 pixel_values=pixel_values,
1643 attention_mask=attention_mask,
1644 output_attentions=output_attentions,
1645 output_hidden_states=output_hidden_states,
1646 )
1648 # Text and vision model outputs
1649 text_outputs = outputs.text_model_output
File /venv/lib/python3.9/site-packages/transformers/models/owlvit/modeling_owlvit.py:1385, in OwlViTForObjectDetection.image_text_embedder(self, input_ids, pixel_values, attention_mask, output_attentions, output_hidden_states)
1376 def image_text_embedder(
1377 self,
1378 input_ids: torch.Tensor,
(...)
1383 ) -> Tuple[torch.FloatTensor]:
1384 # Encode text and image
-> 1385 outputs = self.owlvit(
1386 pixel_values=pixel_values,
1387 input_ids=input_ids,
1388 attention_mask=attention_mask,
1389 output_attentions=output_attentions,
1390 output_hidden_states=output_hidden_states,
1391 return_dict=True,
1392 )
1394 # Get image embeddings
1395 last_hidden_state = outputs.vision_model_output[0]
File /venv/lib/python3.9/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
1516 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1517 else:
-> 1518 return self._call_impl(*args, **kwargs)
File /venv/lib/python3.9/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
1522 # If we don't have any hooks, we want to skip the rest of the logic in
1523 # this function, and just call forward.
1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1525 or _global_backward_pre_hooks or _global_backward_hooks
1526 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527 return forward_call(*args, **kwargs)
1529 try:
1530 result = None
File /venv/lib/python3.9/site-packages/transformers/models/owlvit/modeling_owlvit.py:1163, in OwlViTModel.forward(self, input_ids, pixel_values, attention_mask, return_loss, output_attentions, output_hidden_states, return_base_image_embeds, return_dict)
1155 vision_outputs = self.vision_model(
1156 pixel_values=pixel_values,
1157 output_attentions=output_attentions,
1158 output_hidden_states=output_hidden_states,
1159 return_dict=return_dict,
1160 )
1162 # Get embeddings for all text queries in all batch samples
-> 1163 text_outputs = self.text_model(
1164 input_ids=input_ids,
1165 attention_mask=attention_mask,
1166 output_attentions=output_attentions,
1167 output_hidden_states=output_hidden_states,
1168 return_dict=return_dict,
1169 )
1171 text_embeds = text_outputs[1]
1172 text_embeds = self.text_projection(text_embeds)
File /venv/lib/python3.9/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
1516 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1517 else:
-> 1518 return self._call_impl(*args, **kwargs)
File /venv/lib/python3.9/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
1522 # If we don't have any hooks, we want to skip the rest of the logic in
1523 # this function, and just call forward.
1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1525 or _global_backward_pre_hooks or _global_backward_hooks
1526 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527 return forward_call(*args, **kwargs)
1529 try:
1530 result = None
File /venv/lib/python3.9/site-packages/transformers/models/owlvit/modeling_owlvit.py:798, in OwlViTTextTransformer.forward(self, input_ids, attention_mask, position_ids, output_attentions, output_hidden_states, return_dict)
796 input_shape = input_ids.size()
797 input_ids = input_ids.view(-1, input_shape[-1])
--> 798 hidden_states = self.embeddings(input_ids=input_ids, position_ids=position_ids)
800 # num_samples, seq_len = input_shape where num_samples = batch_size * num_max_text_queries
801 # OWLVIT's text model uses causal mask, prepare it here.
802 # https://github.com/openai/CLIP/blob/cfcffb90e69f37bf2ff1e988237a0fbe41f33c04/clip/model.py#L324
803 causal_attention_mask = _create_4d_causal_attention_mask(
804 input_shape, hidden_states.dtype, device=hidden_states.device
805 )
File /venv/lib/python3.9/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
1516 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1517 else:
-> 1518 return self._call_impl(*args, **kwargs)
File /venv/lib/python3.9/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
1522 # If we don't have any hooks, we want to skip the rest of the logic in
1523 # this function, and just call forward.
1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1525 or _global_backward_pre_hooks or _global_backward_hooks
1526 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527 return forward_call(*args, **kwargs)
1529 try:
1530 result = None
File /venv/lib/python3.9/site-packages/transformers/models/owlvit/modeling_owlvit.py:332, in OwlViTTextEmbeddings.forward(self, input_ids, position_ids, inputs_embeds)
329 inputs_embeds = self.token_embedding(input_ids)
331 position_embeddings = self.position_embedding(position_ids)
--> 332 embeddings = inputs_embeds + position_embeddings
334 return embeddings
RuntimeError: The size of tensor a (18) must match the size of tensor b (16) at non-singleton dimension 1
The text was updated successfully, but these errors were encountered:
Thanks for your interest.
About your question, for OWL-ViT, we did skipped these several sentences with the try-except sentence during evaluation. Other methods we evaluated does not have this constraint on input length, so no need for this processing is required for them. I think simply truncating the input string length to 16 might be a better solution for this and we will have a try on this.
If you have further questions, please feel free to send me emails.
Thank you for clarification. So you omitted those sentences for the inter-scenario case as well?
Regards, Haruki
@HarukiNishimura-TRI Yes, I think so, for owl-vit. I think it would be better for inference on owl-vit to truncate the descriptions to 16 letters and use them for inference.
Dear authors,
Thank you for your work and the release of the d-cube dataset.
I was trying to run a pre-trained OWL-ViT model (e.g. "google/owlvit-base-patch32") on the dataset, and found the following sentences to yield a RuntimeError.
A typical error message is shown at the bottom. It seems that the pre-trained model uses
max_position_embeddings = 16
inOwlViTTextConfig
which is not long enough to accept the descriptions above as inputs. All the models available on Huggingface seem to usemax_position_embeddings = 16
. Did you encounter the same issue when running your experiments for the paper? If so, how did you handle it in the evaluation process?Thanks in advance.
The text was updated successfully, but these errors were encountered: