Skip to content

Commit

Permalink
Add granite vision docs
Browse files Browse the repository at this point in the history
Replace multimodal granite refs with granite vision

Add granite vision / llava next alias

Signed-off-by: Alex-Brooks <[email protected]>
  • Loading branch information
alex-jw-brooks committed Jan 23, 2025
1 parent c961662 commit bfd6166
Show file tree
Hide file tree
Showing 4 changed files with 91 additions and 1 deletion.
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -452,6 +452,8 @@
title: Granite
- local: model_doc/granitemoe
title: GraniteMoe
- local: model_doc/granitevision
title: GraniteVision
- local: model_doc/helium
title: Helium
- local: model_doc/herbert
Expand Down
85 changes: 85 additions & 0 deletions docs/source/en/model_doc/granitevision.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->

# Granite Vision

## Overview

The Granite Vision model is a variant of [LLaVA-NeXT](llava_next), leveraging a [Granite](granite) language model alongside a [SigLIP](SigLIP) visual encoder. It utilizes multiple concatenated vision hidden states as its image features, similar to [VipLlava](vipllava). It also uses a larger set of image grid pinpoints than the original LlaVa-NeXT models to support additional aspect ratios.

Tips:
- This model is loaded into Transformers as an instance of LlaVA-Next. The usage and tips from [LLaVA-NeXT](llava_next) apply to this model as well.

- You can apply the chat template on the tokenizer / processor in the same way as well. Example chat format:
```bash
"<|user|>\nWhat’s shown in this image?\n<|assistant|>\nThis image shows a red stop sign.<|end_of_text|><|user|>\nDescribe the image in more details.\n<|assistant|>\n"
```

Sample inference:
```python
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
from PIL import Image
import requests

# Note: These docs were written prior to the public model release,
# and this path is subject to change.
# Please see https://huggingface.co/ibm-granite for the current model list.
model_path = "ibm-granite/granite-3.1-2b-instruct-vision"
processor = LlavaNextProcessor.from_pretrained(model_path)

model = LlavaNextForConditionalGeneration.from_pretrained(model_path).to("cuda")

# prepare image and text prompt, using the appropriate prompt template
url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
image = Image.open(requests.get(url, stream=True).raw)

conversation = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "What is shown in this image?"},
],
},
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(image, prompt, return_tensors="pt").to("cuda")

# autoregressively complete prompt
output = model.generate(**inputs, max_new_tokens=100)

print(processor.decode(output[0], skip_special_tokens=True))
```

This model was contributed by [Alexander Brooks](https://huggingface.co/abrooks9944).

## LlavaNextConfig

[[autodoc]] LlavaNextConfig

## LlavaNextImageProcessor

[[autodoc]] LlavaNextImageProcessor
- preprocess

## LlavaNextProcessor

[[autodoc]] LlavaNextProcessor

## LlavaNextForConditionalGeneration

[[autodoc]] LlavaNextForConditionalGeneration
- forward
3 changes: 3 additions & 0 deletions src/transformers/models/auto/configuration_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -134,6 +134,7 @@
("gptsan-japanese", "GPTSanJapaneseConfig"),
("granite", "GraniteConfig"),
("granitemoe", "GraniteMoeConfig"),
("granitevision", "LlavaNextConfig"),
("graphormer", "GraphormerConfig"),
("grounding-dino", "GroundingDinoConfig"),
("groupvit", "GroupViTConfig"),
Expand Down Expand Up @@ -456,6 +457,7 @@
("gptsan-japanese", "GPTSAN-japanese"),
("granite", "Granite"),
("granitemoe", "GraniteMoeMoe"),
("granitevision", "LLaVA-NeXT"),
("graphormer", "Graphormer"),
("grounding-dino", "Grounding DINO"),
("groupvit", "GroupViT"),
Expand Down Expand Up @@ -725,6 +727,7 @@
("siglip_vision_model", "siglip"),
("chinese_clip_vision_model", "chinese_clip"),
("rt_detr_resnet", "rt_detr"),
("granitevision", "llava_next"),
]
)

Expand Down
2 changes: 1 addition & 1 deletion tests/models/vipllava/test_modeling_vipllava.py
Original file line number Diff line number Diff line change
Expand Up @@ -273,7 +273,7 @@ def test_vision_feature_layers(self, vision_feature_layers):
"""
# NOTE: vipllava uses vision_feature_layers instead of vision_feature_layer as the
# config key. The reason is that other llava classes supported one vision feature layer
# and added support for a list of layers with multimodal granite support, while vipllava
# and added support for a list of layers with granite vision support, while vipllava
# originally supported multiple feature layers, and added support for a single layer for
# for compatibility reasons.
config, input_dict = self.model_tester.prepare_config_and_inputs_for_common()
Expand Down

0 comments on commit bfd6166

Please sign in to comment.