Add granite vision docs

Replace multimodal granite refs with granite vision Add granite vision / llava next alias Signed-off-by: Alex-Brooks <[email protected]>
huggingface · Jan 23, 2025 · bfd6166 · bfd6166
1 parent c961662
commit bfd6166
Show file tree

Hide file tree

Showing 4 changed files with 91 additions and 1 deletion.
diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml
@@ -452,6 +452,8 @@
         title: Granite
       - local: model_doc/granitemoe
         title: GraniteMoe
+      - local: model_doc/granitevision
+        title: GraniteVision
       - local: model_doc/helium
         title: Helium
       - local: model_doc/herbert

diff --git a/docs/source/en/model_doc/granitevision.md b/docs/source/en/model_doc/granitevision.md
@@ -0,0 +1,85 @@
+<!--Copyright 2025 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# Granite Vision
+
+## Overview
+
+The Granite Vision model is a variant of [LLaVA-NeXT](llava_next), leveraging a [Granite](granite) language model alongside a [SigLIP](SigLIP) visual encoder. It utilizes multiple concatenated vision hidden states as its image features, similar to [VipLlava](vipllava). It also uses a larger set of image grid pinpoints than the original LlaVa-NeXT models to support additional aspect ratios.
+
+Tips:
+- This model is loaded into Transformers as an instance of LlaVA-Next. The usage and tips from [LLaVA-NeXT](llava_next) apply to this model as well.
+
+- You can apply the chat template on the tokenizer / processor in the same way as well. Example chat format:
+```bash
+"<|user|>\nWhat’s shown in this image?\n<|assistant|>\nThis image shows a red stop sign.<|end_of_text|><|user|>\nDescribe the image in more details.\n<|assistant|>\n"
+```
+
+Sample inference:
+```python
+from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
+from PIL import Image
+import requests
+
+# Note: These docs were written prior to the public model release,
+# and this path is subject to change.
+# Please see https://huggingface.co/ibm-granite for the current model list.
+model_path = "ibm-granite/granite-3.1-2b-instruct-vision"
+processor = LlavaNextProcessor.from_pretrained(model_path)
+
+model = LlavaNextForConditionalGeneration.from_pretrained(model_path).to("cuda")
+
+# prepare image and text prompt, using the appropriate prompt template
+url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
+image = Image.open(requests.get(url, stream=True).raw)
+
+conversation = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "image"},
+            {"type": "text", "text": "What is shown in this image?"},
+        ],
+    },
+]
+prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
+inputs = processor(image, prompt, return_tensors="pt").to("cuda")
+
+# autoregressively complete prompt
+output = model.generate(**inputs, max_new_tokens=100)
+
+print(processor.decode(output[0], skip_special_tokens=True))
+```
+
+This model was contributed by [Alexander Brooks](https://huggingface.co/abrooks9944).
+
+## LlavaNextConfig
+
+[[autodoc]] LlavaNextConfig
+
+## LlavaNextImageProcessor
+
+[[autodoc]] LlavaNextImageProcessor
+    - preprocess
+
+## LlavaNextProcessor
+
+[[autodoc]] LlavaNextProcessor
+
+## LlavaNextForConditionalGeneration
+
+[[autodoc]] LlavaNextForConditionalGeneration
+    - forward
diff --git a/src/transformers/models/auto/configuration_auto.py b/src/transformers/models/auto/configuration_auto.py
@@ -134,6 +134,7 @@
         ("gptsan-japanese", "GPTSanJapaneseConfig"),
         ("granite", "GraniteConfig"),
         ("granitemoe", "GraniteMoeConfig"),
+        ("granitevision", "LlavaNextConfig"),
         ("graphormer", "GraphormerConfig"),
         ("grounding-dino", "GroundingDinoConfig"),
         ("groupvit", "GroupViTConfig"),
@@ -456,6 +457,7 @@
         ("gptsan-japanese", "GPTSAN-japanese"),
         ("granite", "Granite"),
         ("granitemoe", "GraniteMoeMoe"),
+        ("granitevision", "LLaVA-NeXT"),
         ("graphormer", "Graphormer"),
         ("grounding-dino", "Grounding DINO"),
         ("groupvit", "GroupViT"),
@@ -725,6 +727,7 @@
         ("siglip_vision_model", "siglip"),
         ("chinese_clip_vision_model", "chinese_clip"),
         ("rt_detr_resnet", "rt_detr"),
+        ("granitevision", "llava_next"),
     ]
 )
 

diff --git a/tests/models/vipllava/test_modeling_vipllava.py b/tests/models/vipllava/test_modeling_vipllava.py
@@ -273,7 +273,7 @@ def test_vision_feature_layers(self, vision_feature_layers):
         """
         # NOTE: vipllava uses vision_feature_layers instead of vision_feature_layer as the
         # config key. The reason is that other llava classes supported one vision feature layer
-        # and added support for a list of layers with multimodal granite support, while vipllava
+        # and added support for a list of layers with granite vision support, while vipllava
         # originally supported multiple feature layers, and added support for a single layer for
         # for compatibility reasons.
         config, input_dict = self.model_tester.prepare_config_and_inputs_for_common()