Add support for Qwen2VL #10361

HimariO · 2024-11-17T12:08:31Z

This PR implements the Qwen2VL model as requested at #9246 .
The main changes include:

Add m-RoPE and vision RoPE mode to current RoPE OP in CPU and CUDA backend
Add llama_context.n_pos_per_token to support more than one position id per token
Add Qwen2VL llama architecture
Add Qwen2VL clip vision architecture
Add examples/llava/qwen2vl-cli.cpp to handle Qwen2VL data preprocess steps & prompts

TODO:

Fix CI errors cause by linter and unit tests
Remove code and build config only used for develop/debugging qwen2vl

Steps to convert model and inference

Download the official Qwen/Qwen2-VL-2B-Instruct checkpoint, then convert the LLM part of the model to GGUF format using convert_hf_to_gguf.py:
```
python3 convert_hf_to_gguf.py "/path/to/Qwen2-VL-2B-Instruct/model-dir"
```

Convert the vision encoder to GGUF format with qwen2_vl_surgery.py:

PYTHONPATH=$PYTHONPATH:$(pwd)/gguf-py python3 examples/llava/qwen2_vl_surgery.py "/path/to/Qwen2-VL-2B-Instruct/model-dir"

Build the llama-qwen2vl-cli in the same way you would build llama-llava-cli.
Run the command: (It's recommended to resize the image to a resolution below 640x640, so it won't take forever to run on CPU backend)
```
./llama-qwen2vl-cli -m qwen2-vl-decoder.gguf --mmproj qwen2vl-vision.gguf -p "Describe this image." --image "demo.jpg"
```

Future work:

Add MPS, Vulkan backend support

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

barinov274 · 2024-12-01T03:48:55Z

There is such a model, ShowUI is called. It is supposed to point to the element on the image so that you can control the computer with the mouse. But with your support it misses, and nothing does not work.
Here is the code I invented, you can test yourself. Here I have as an image screenshot with open firefox

import subprocess
from PIL import Image, ImageDraw

def detect_and_mark_element(image_path, element, output_image_path):
    # Run the model to get the coordinates of the element
    command = f"./llama-qwen2vl-cli -m ShowUI-2B/Qwen2-VL-2B-Instruct-F16.gguf --mmproj ShowUI-2B/qwen2vl-vision.gguf --image \"{image_path}\" --temp 0 -p \"<|im_start|>system\nBased on the screenshot, I give a text description and you give its corresponding location. The coordinate represents a clickable location [x, y] for an element, which is a relative coordinate on the screenshot, scaled from 0 to 1.<|im_end|>|im_start|>user\n<|vision_start|><image><|vision_end|>{element}<|im_end|>\n<|im_start|>assistant\n\""
    output = subprocess.check_output(command, shell=True)
    output = output.decode("utf-8").strip()

    # Remove the square brackets and split the string into coordinates
    coordinates = output.splitlines()[-1][1:-1].split(", ")
    x, y = float(coordinates[0]), float(coordinates[1])

    # Open the image and get its dimensions
    img = Image.open(image_path)
    width, height = img.size

    # Convert the relative coordinates to absolute coordinates
    x_abs = int(x * width)
    y_abs = int(y * height)

    # Draw a red circle on the detected element
    draw = ImageDraw.Draw(img)
    draw.ellipse([(x_abs-5, y_abs-5), (x_abs+5, y_abs+5)], fill=(255, 0, 0))

    # Save the output image
    img.save(output_image_path)

# Example usage:
detect_and_mark_element("screenshot.png", "Click on the address bar", "output.png")

Here is a link to the model https://huggingface.co/showlab/ShowUI-2B
Here you can test how it should work. https://huggingface.co/spaces/showlab/ShowUI

HimariO · 2024-12-03T12:32:29Z

@barinov274 While ShowUI is built on top of Qwen2VL, it employs a different image processing workflow. Therefore, I believe adding support for ShowUI should be addressed in a separate PR or issue.

ggml/src/ggml-metal/ggml-metal.m

ali0une · 2024-12-14T20:28:22Z

Testing right now to get it running.

if i git clone the Qwen/Qwen2-VL-2B-Instruct repo in /whatever/Qwen/Qwen2-VL-2B-Instruct/ and make a gguf out of it with convert_hf_to_gguf.py everything is fine and i get a Qwen-Qwen2-VL-2B-Instruct-F16.gguf

But when i try to convert the vision encoder to GGUF format with qwen2_vl_surgery.py :
python examples/llava/qwen2_vl_surgery.py "/whatever/Qwen/Qwen2-VL-2B-Instruct/"

i can't, python throws an error :

(venv) ali0une@Debian:~/compil/llama.cpp$ python examples/llava/qwen2_vl_surgery.py "/whatever/Qwen/Qwen2-VL-2B-Instruct"
model_name:  /whatever/Qwen/Qwen2-VL-2B-Instruct
`Qwen2VLRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.17it/s]
Qwen2VLVisionConfig {
  "depth": 32,
  "embed_dim": 1280,
  "hidden_act": "quick_gelu",
  "hidden_size": 1536,
  "in_channels": 3,
  "in_chans": 3,
  "mlp_ratio": 4,
  "model_type": "qwen2_vl",
  "num_heads": 16,
  "patch_size": 14,
  "spatial_merge_size": 2,
  "spatial_patch_size": 14,
  "temporal_patch_size": 2,
  "transformers_version": "4.47.0"
}

[to_gguf_name] vision_model.blocks.0.norm1.weight --> v.blk.0.ln1.weight

...

[to_gguf_name] merger.mlp.2.bias --> mm.2.bias
Traceback (most recent call last):

  File "/whatever/llama.cpp/venv/lib/python3.10/site-packages/huggingface_hub/utils/_http.py", line 406, in hf_raise_for_status
    response.raise_for_status()
  File "/whatever/llama.cpp/venv/lib/python3.10/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/Qwen2-VL-2B-Instruct/resolve/main/processor_config.json

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/whatever/llama.cpp/venv/lib/python3.10/site-packages/transformers/utils/hub.py", line 403, in cached_file
    resolved_file = hf_hub_download(
  File "/whatever/llama.cpp/venv/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/whatever/llama.cpp/venv/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 862, in hf_hub_download
    return _hf_hub_download_to_cache_dir(
  File "/whatever/llama.cpp/venv/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 969, in _hf_hub_download_to_cache_dir
    _raise_on_head_call_error(head_call_error, force_download, local_files_only)
  File "/whatever/llama.cpp/venv/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1484, in _raise_on_head_call_error
    raise head_call_error
  File "/whatever/llama.cpp/venv/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1376, in _get_metadata_or_catch_error
    metadata = get_hf_file_metadata(
  File "/whatever/llama.cpp/venv/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/whatever/llama.cpp/venv/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1296, in get_hf_file_metadata
    r = _request_wrapper(
  File "/whatever/llama.cpp/venv/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 277, in _request_wrapper
    response = _request_wrapper(
  File "/whatever/llama.cpp/venv/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 301, in _request_wrapper
    hf_raise_for_status(response)
  File "/whatever/llama.cpp/venv/lib/python3.10/site-packages/huggingface_hub/utils/_http.py", line 454, in hf_raise_for_status
    raise _format(RepositoryNotFoundError, message, response) from e
huggingface_hub.errors.RepositoryNotFoundError: 401 Client Error. (Request ID: Root=1-675de5d3-319c681e02ab26174d878b71;82815812-2512-468e-bba8-8beac819dd0c)

Repository Not Found for url: https://huggingface.co/Qwen2-VL-2B-Instruct/resolve/main/processor_config.json.
Please make sure you specified the correct `repo_id` and `repo_type`.
If you are trying to access a private or gated repo, make sure you are authenticated.
Invalid username or password.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/whatever/llama.cpp/examples/llava/qwen2_vl_surgery.py", line 158, in <module>
    main(args)
  File "/whatever/llama.cpp/examples/llava/qwen2_vl_surgery.py", line 142, in main
    processor: Qwen2VLProcessor = AutoProcessor.from_pretrained(model_name)
  File "/whatever/llama.cpp/venv/lib/python3.10/site-packages/transformers/models/auto/processing_auto.py", line 254, in from_pretrained
    processor_config_file = get_file_from_repo(
  File "/whatever/llama.cpp/venv/lib/python3.10/site-packages/transformers/utils/hub.py", line 557, in get_file_from_repo
    return cached_file(
  File "/whatever/llama.cpp/venv/lib/python3.10/site-packages/transformers/utils/hub.py", line 426, in cached_file
    raise EnvironmentError(
OSError: Qwen2-VL-2B-Instruct is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`

But it just works without specifying the path to the downloaded HF model repo :
(venv) ali0une@Debian:~/compil/llama.cpp$ python examples/llava/qwen2_vl_surgery.py
i get the /whatever/llama.cpp/qwen-qwen2-vl-2b-instruct-vision.gguf

bartowski1182 · 2024-12-14T21:09:22Z

@ali0une I found the same issue, made some minor changes to the qwen2_vl_surgery.py to fix it for already downloaded models:

#10833

The reason it works without specifying a model is because there's a default one set that it'll go and download

stduhpf · 2024-12-15T15:16:50Z

Maybe there should be an error or a warning when trying to run it with unsupported backends? I was confused by the completely made-up answers when running it on Vulkan.

ggerganov · 2024-12-15T16:28:46Z

Maybe there should be an error or a warning when trying to run it with unsupported backends? I was confused by the completely made-up answers when running it on Vulkan.

This indicates a bug. Unsupported operations should be automatically moved back to the CPU and the results would be correct, just the execution would be slow. But getting wrong results is not expected and should be investigated.

husnoo · 2024-12-15T17:22:42Z

Is it supposed to work with llama-server yet?

I used the gguf from this place: https://huggingface.co/tensorblock/Qwen2-VL-7B-Instruct-GGUF/blob/main/README.md

And it returns nonsense on the ui.

ali0une · 2024-12-15T19:00:43Z

@husnoo not yet. You have to compile llama.cpp and use the llama-qwenvl2 binary in command line. Read the doc above for details.

HimariO · 2024-12-16T16:33:25Z

Maybe there should be an error or a warning when trying to run it with unsupported backends? I was confused by the completely made-up answers when running it on Vulkan.

After some preliminary investigation, I believe the issue is that clip_model_load does not use ggml_backend_sched_reserve to schedule operations to the appropriate backend, as llama_new_context_with_model does. As a result, the entire vision model’s computation graph runs on a single backend using the unsupported RoPE Vulkan OP.

chigkim · 2024-12-19T01:10:22Z

On M3-Max 64GB
error: unsupported op 'IM2COL'
Other people are having the same error on Mac.
#9246

oursland · 2024-12-19T02:40:10Z

Looks like IM2COL is only implemented for f32 and f16 on Metal. I've not looked into it personally, but what is the data type of the Qwen2VL you're testing?

bunnyfu · 2024-12-19T17:05:33Z

Looks like IM2COL is only implemented for f32 and f16 on Metal. I've not looked into it personally, but what is the data type of the Qwen2VL you're testing?

Q4_k_m for me, is giving the error.

oursland · 2024-12-19T18:05:38Z

As reported in #9246, this issue is resolved with #10896, which has just been merged.

* Barebone Qwen2VL LLM convertor * Add Qwen2VL cli entrypoint * [WIP] add qwen2vl arch * Verify m-rope output * Add vl-rope/2d-rope support for qwen2vl ViT * update qwen2vl cli tool * update 5D tensor op workaround * [WIP] qwen2vl vision model * make batch and clip utils compatible with qwen2vl * [WIP] create inference workflow, gguf convert script but fix * correcting vision-rope behavior, add the missing last layer back to ViT * add arg parser to qwen2vl_surgery * replace variable size array with vector * cuda-gdb cmake preset * add fp32 mrope, vision rope kernel * add fp16 support for qwen2vl and m-rope * add `GGML_ROPE_TYPE_MROPE`, `GGML_ROPE_TYPE_VISION` * fix rope op mode switching, out dated func args * update `llama_hparams` * update to keep up stream changes * resolve linter, test errors * add makefile entry, update speical image padding token * add mrope unit test, fix few compiler warnings * rename `mrope` related function, params * minor updates on debug util, bug fixs * add `m-rope` testcase to `test-backend-ops` * Apply suggestions from code review Co-authored-by: Georgi Gerganov <[email protected]> * fix traililng whitespce * store `llama_hparams.rope_sections` with fixed size array * update position id tensor size check in GGML_OP_ROPE * minor updates * update `ggml_backend_*_supports_op` of unsupported backends * remote old `rope_section` compare operator --------- Co-authored-by: Georgi Gerganov <[email protected]>

github-actions bot added build Compilation issues Nvidia GPU Issues specific to Nvidia GPUs examples python python script changes ggml changes relating to the ggml tensor library for machine learning labels Nov 17, 2024

HimariO added 21 commits November 29, 2024 17:52

Barebone Qwen2VL LLM convertor

c17546f

Add Qwen2VL cli entrypoint

7c6f793

[WIP] add qwen2vl arch

b24bd89

Verify m-rope output

3541196

Add vl-rope/2d-rope support for qwen2vl ViT

9d389a0

update qwen2vl cli tool

f661483

update 5D tensor op workaround

3c3691e

[WIP] qwen2vl vision model

c13edfe

make batch and clip utils compatible with qwen2vl

7e9fc72

[WIP] create inference workflow, gguf convert script but fix

bcd49f5

correcting vision-rope behavior, add the missing last layer back to ViT

023f007

add arg parser to qwen2vl_surgery

3d19dd4

replace variable size array with vector

53480d2

cuda-gdb cmake preset

0882f57

add fp32 mrope, vision rope kernel

3237bb4

add fp16 support for qwen2vl and m-rope

201f704

add GGML_ROPE_TYPE_MROPE, GGML_ROPE_TYPE_VISION

f1fa60f

fix rope op mode switching, out dated func args

241bb45

update llama_hparams

07553cf

update to keep up stream changes

fac0345

resolve linter, test errors

cbd08b4

HimariO force-pushed the qwen2-vl branch from 1d8dea5 to cbd08b4 Compare November 29, 2024 14:19

HimariO marked this pull request as ready for review November 29, 2024 14:50

github-actions bot added SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language Kompute https://github.com/KomputeProject/kompute/ labels Dec 13, 2024

HimariO force-pushed the qwen2-vl branch from a1689ac to 0c27730 Compare December 13, 2024 14:38

ggerganov reviewed Dec 13, 2024

View reviewed changes

ggml/src/ggml-metal/ggml-metal.m Outdated Show resolved Hide resolved

HimariO force-pushed the qwen2-vl branch from 0c27730 to 03aced6 Compare December 13, 2024 14:56

ggerganov reviewed Dec 13, 2024

View reviewed changes

ggml/src/ggml-metal/ggml-metal.m Outdated Show resolved Hide resolved

HimariO force-pushed the qwen2-vl branch from 03aced6 to dd008a4 Compare December 13, 2024 15:13

update ggml_backend_*_supports_op of unsupported backends

19aba1d

HimariO force-pushed the qwen2-vl branch from dd008a4 to 19aba1d Compare December 13, 2024 15:17

remote old rope_section compare operator

f96909e

ggerganov merged commit ba1cb19 into ggerganov:master Dec 14, 2024
1 check passed

wqerrewetw mentioned this pull request Dec 14, 2024

Support Qwen VL ollama/ollama#2874

Open

Foxhunt mentioned this pull request Dec 15, 2024

Daily Hacker News 15-12-2024 Foxhunt/daily-hackernews#159

Open

xueyuanl mentioned this pull request Dec 15, 2024

Daily Hacker News 15-12-2024 xueyuanl/daily-hackernews#1539

Open

stduhpf mentioned this pull request Dec 15, 2024

Eval bug: Qwen2-VL Hallucinates image content on Vulkan backend #10843

Open

github-actions bot mentioned this pull request Dec 16, 2024

[2024-12-15] They see your photos — Humans are unreliable models of mouse disease jiacai2050/mofish#876

Open

ddpasa mentioned this pull request Dec 18, 2024

add Qwen2-VL ollama/ollama#6564

Open

chigkim mentioned this pull request Dec 19, 2024

Feature Request: Support for Qwen2-VL #9246

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Qwen2VL #10361

Add support for Qwen2VL #10361

HimariO commented Nov 17, 2024 •

edited

Loading

barinov274 commented Dec 1, 2024

HimariO commented Dec 3, 2024

ali0une commented Dec 14, 2024

bartowski1182 commented Dec 14, 2024

stduhpf commented Dec 15, 2024

ggerganov commented Dec 15, 2024

husnoo commented Dec 15, 2024

ali0une commented Dec 15, 2024 •

edited

Loading

HimariO commented Dec 16, 2024

chigkim commented Dec 19, 2024

oursland commented Dec 19, 2024

bunnyfu commented Dec 19, 2024

oursland commented Dec 19, 2024

Add support for Qwen2VL #10361

Add support for Qwen2VL #10361

Conversation

HimariO commented Nov 17, 2024 • edited Loading

barinov274 commented Dec 1, 2024

HimariO commented Dec 3, 2024

ali0une commented Dec 14, 2024

bartowski1182 commented Dec 14, 2024

stduhpf commented Dec 15, 2024

ggerganov commented Dec 15, 2024

husnoo commented Dec 15, 2024

ali0une commented Dec 15, 2024 • edited Loading

HimariO commented Dec 16, 2024

chigkim commented Dec 19, 2024

oursland commented Dec 19, 2024

bunnyfu commented Dec 19, 2024

oursland commented Dec 19, 2024

HimariO commented Nov 17, 2024 •

edited

Loading

ali0une commented Dec 15, 2024 •

edited

Loading