Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Qwen2VL #10361

Merged
merged 35 commits into from
Dec 14, 2024
Merged

Add support for Qwen2VL #10361

merged 35 commits into from
Dec 14, 2024

Conversation

HimariO
Copy link
Contributor

@HimariO HimariO commented Nov 17, 2024

This PR implements the Qwen2VL model as requested at #9246 .
The main changes include:

  • Add m-RoPE and vision RoPE mode to current RoPE OP in CPU and CUDA backend
  • Add llama_context.n_pos_per_token to support more than one position id per token
  • Add Qwen2VL llama architecture
  • Add Qwen2VL clip vision architecture
  • Add examples/llava/qwen2vl-cli.cpp to handle Qwen2VL data preprocess steps & prompts

TODO:

  • Fix CI errors cause by linter and unit tests
  • Remove code and build config only used for develop/debugging qwen2vl

Steps to convert model and inference

  1. Download the official Qwen/Qwen2-VL-2B-Instruct checkpoint, then convert the LLM part of the model to GGUF format using convert_hf_to_gguf.py:

    python3 convert_hf_to_gguf.py "/path/to/Qwen2-VL-2B-Instruct/model-dir"
  2. Convert the vision encoder to GGUF format with qwen2_vl_surgery.py:

    PYTHONPATH=$PYTHONPATH:$(pwd)/gguf-py python3 examples/llava/qwen2_vl_surgery.py "/path/to/Qwen2-VL-2B-Instruct/model-dir"
  3. Build the llama-qwen2vl-cli in the same way you would build llama-llava-cli.

  4. Run the command: (It's recommended to resize the image to a resolution below 640x640, so it won't take forever to run on CPU backend)

    ./llama-qwen2vl-cli -m qwen2-vl-decoder.gguf --mmproj qwen2vl-vision.gguf -p "Describe this image." --image "demo.jpg"

Future work:

  • Add MPS, Vulkan backend support

@github-actions github-actions bot added build Compilation issues Nvidia GPU Issues specific to Nvidia GPUs examples python python script changes ggml changes relating to the ggml tensor library for machine learning labels Nov 17, 2024
@HimariO HimariO marked this pull request as ready for review November 29, 2024 14:50
@barinov274
Copy link

There is such a model, ShowUI is called. It is supposed to point to the element on the image so that you can control the computer with the mouse. But with your support it misses, and nothing does not work.
Here is the code I invented, you can test yourself. Here I have as an image screenshot with open firefox

import subprocess
from PIL import Image, ImageDraw

def detect_and_mark_element(image_path, element, output_image_path):
    # Run the model to get the coordinates of the element
    command = f"./llama-qwen2vl-cli -m ShowUI-2B/Qwen2-VL-2B-Instruct-F16.gguf --mmproj ShowUI-2B/qwen2vl-vision.gguf --image \"{image_path}\" --temp 0 -p \"<|im_start|>system\nBased on the screenshot, I give a text description and you give its corresponding location. The coordinate represents a clickable location [x, y] for an element, which is a relative coordinate on the screenshot, scaled from 0 to 1.<|im_end|>|im_start|>user\n<|vision_start|><image><|vision_end|>{element}<|im_end|>\n<|im_start|>assistant\n\""
    output = subprocess.check_output(command, shell=True)
    output = output.decode("utf-8").strip()

    # Remove the square brackets and split the string into coordinates
    coordinates = output.splitlines()[-1][1:-1].split(", ")
    x, y = float(coordinates[0]), float(coordinates[1])

    # Open the image and get its dimensions
    img = Image.open(image_path)
    width, height = img.size

    # Convert the relative coordinates to absolute coordinates
    x_abs = int(x * width)
    y_abs = int(y * height)

    # Draw a red circle on the detected element
    draw = ImageDraw.Draw(img)
    draw.ellipse([(x_abs-5, y_abs-5), (x_abs+5, y_abs+5)], fill=(255, 0, 0))

    # Save the output image
    img.save(output_image_path)

# Example usage:
detect_and_mark_element("screenshot.png", "Click on the address bar", "output.png")

Here is a link to the model https://huggingface.co/showlab/ShowUI-2B
Here you can test how it should work. https://huggingface.co/spaces/showlab/ShowUI

@HimariO
Copy link
Contributor Author

HimariO commented Dec 3, 2024

@barinov274 While ShowUI is built on top of Qwen2VL, it employs a different image processing workflow. Therefore, I believe adding support for ShowUI should be addressed in a separate PR or issue.

@github-actions github-actions bot added SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language Kompute https://github.com/KomputeProject/kompute/ labels Dec 13, 2024
@ggerganov ggerganov merged commit ba1cb19 into ggerganov:master Dec 14, 2024
1 check passed
@ali0une
Copy link

ali0une commented Dec 14, 2024

Testing right now to get it running.

if i git clone the Qwen/Qwen2-VL-2B-Instruct repo in /whatever/Qwen/Qwen2-VL-2B-Instruct/ and make a gguf out of it with convert_hf_to_gguf.py everything is fine and i get a Qwen-Qwen2-VL-2B-Instruct-F16.gguf

But when i try to convert the vision encoder to GGUF format with qwen2_vl_surgery.py :
python examples/llava/qwen2_vl_surgery.py "/whatever/Qwen/Qwen2-VL-2B-Instruct/"

i can't, python throws an error :

(venv) ali0une@Debian:~/compil/llama.cpp$ python examples/llava/qwen2_vl_surgery.py "/whatever/Qwen/Qwen2-VL-2B-Instruct"
model_name:  /whatever/Qwen/Qwen2-VL-2B-Instruct
`Qwen2VLRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.17it/s]
Qwen2VLVisionConfig {
  "depth": 32,
  "embed_dim": 1280,
  "hidden_act": "quick_gelu",
  "hidden_size": 1536,
  "in_channels": 3,
  "in_chans": 3,
  "mlp_ratio": 4,
  "model_type": "qwen2_vl",
  "num_heads": 16,
  "patch_size": 14,
  "spatial_merge_size": 2,
  "spatial_patch_size": 14,
  "temporal_patch_size": 2,
  "transformers_version": "4.47.0"
}

[to_gguf_name] vision_model.blocks.0.norm1.weight --> v.blk.0.ln1.weight

...

[to_gguf_name] merger.mlp.2.bias --> mm.2.bias
Traceback (most recent call last):

  File "/whatever/llama.cpp/venv/lib/python3.10/site-packages/huggingface_hub/utils/_http.py", line 406, in hf_raise_for_status
    response.raise_for_status()
  File "/whatever/llama.cpp/venv/lib/python3.10/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/Qwen2-VL-2B-Instruct/resolve/main/processor_config.json

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/whatever/llama.cpp/venv/lib/python3.10/site-packages/transformers/utils/hub.py", line 403, in cached_file
    resolved_file = hf_hub_download(
  File "/whatever/llama.cpp/venv/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/whatever/llama.cpp/venv/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 862, in hf_hub_download
    return _hf_hub_download_to_cache_dir(
  File "/whatever/llama.cpp/venv/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 969, in _hf_hub_download_to_cache_dir
    _raise_on_head_call_error(head_call_error, force_download, local_files_only)
  File "/whatever/llama.cpp/venv/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1484, in _raise_on_head_call_error
    raise head_call_error
  File "/whatever/llama.cpp/venv/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1376, in _get_metadata_or_catch_error
    metadata = get_hf_file_metadata(
  File "/whatever/llama.cpp/venv/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/whatever/llama.cpp/venv/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1296, in get_hf_file_metadata
    r = _request_wrapper(
  File "/whatever/llama.cpp/venv/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 277, in _request_wrapper
    response = _request_wrapper(
  File "/whatever/llama.cpp/venv/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 301, in _request_wrapper
    hf_raise_for_status(response)
  File "/whatever/llama.cpp/venv/lib/python3.10/site-packages/huggingface_hub/utils/_http.py", line 454, in hf_raise_for_status
    raise _format(RepositoryNotFoundError, message, response) from e
huggingface_hub.errors.RepositoryNotFoundError: 401 Client Error. (Request ID: Root=1-675de5d3-319c681e02ab26174d878b71;82815812-2512-468e-bba8-8beac819dd0c)

Repository Not Found for url: https://huggingface.co/Qwen2-VL-2B-Instruct/resolve/main/processor_config.json.
Please make sure you specified the correct `repo_id` and `repo_type`.
If you are trying to access a private or gated repo, make sure you are authenticated.
Invalid username or password.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/whatever/llama.cpp/examples/llava/qwen2_vl_surgery.py", line 158, in <module>
    main(args)
  File "/whatever/llama.cpp/examples/llava/qwen2_vl_surgery.py", line 142, in main
    processor: Qwen2VLProcessor = AutoProcessor.from_pretrained(model_name)
  File "/whatever/llama.cpp/venv/lib/python3.10/site-packages/transformers/models/auto/processing_auto.py", line 254, in from_pretrained
    processor_config_file = get_file_from_repo(
  File "/whatever/llama.cpp/venv/lib/python3.10/site-packages/transformers/utils/hub.py", line 557, in get_file_from_repo
    return cached_file(
  File "/whatever/llama.cpp/venv/lib/python3.10/site-packages/transformers/utils/hub.py", line 426, in cached_file
    raise EnvironmentError(
OSError: Qwen2-VL-2B-Instruct is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`

But it just works without specifying the path to the downloaded HF model repo :
(venv) ali0une@Debian:~/compil/llama.cpp$ python examples/llava/qwen2_vl_surgery.py
i get the /whatever/llama.cpp/qwen-qwen2-vl-2b-instruct-vision.gguf

@bartowski1182
Copy link
Contributor

@ali0une I found the same issue, made some minor changes to the qwen2_vl_surgery.py to fix it for already downloaded models:

#10833

The reason it works without specifying a model is because there's a default one set that it'll go and download

@stduhpf
Copy link
Contributor

stduhpf commented Dec 15, 2024

Maybe there should be an error or a warning when trying to run it with unsupported backends? I was confused by the completely made-up answers when running it on Vulkan.

@ggerganov
Copy link
Owner

Maybe there should be an error or a warning when trying to run it with unsupported backends? I was confused by the completely made-up answers when running it on Vulkan.

This indicates a bug. Unsupported operations should be automatically moved back to the CPU and the results would be correct, just the execution would be slow. But getting wrong results is not expected and should be investigated.

@husnoo
Copy link

husnoo commented Dec 15, 2024

Is it supposed to work with llama-server yet?

I used the gguf from this place: https://huggingface.co/tensorblock/Qwen2-VL-7B-Instruct-GGUF/blob/main/README.md

And it returns nonsense on the ui.

@ali0une
Copy link

ali0une commented Dec 15, 2024

@husnoo not yet. You have to compile llama.cpp and use the llama-qwenvl2 binary in command line. Read the doc above for details.

@HimariO
Copy link
Contributor Author

HimariO commented Dec 16, 2024

Maybe there should be an error or a warning when trying to run it with unsupported backends? I was confused by the completely made-up answers when running it on Vulkan.

After some preliminary investigation, I believe the issue is that clip_model_load does not use ggml_backend_sched_reserve to schedule operations to the appropriate backend, as llama_new_context_with_model does. As a result, the entire vision model’s computation graph runs on a single backend using the unsupported RoPE Vulkan OP.

@chigkim
Copy link

chigkim commented Dec 19, 2024

On M3-Max 64GB
error: unsupported op 'IM2COL'
Other people are having the same error on Mac.
#9246

@oursland
Copy link

Looks like IM2COL is only implemented for f32 and f16 on Metal. I've not looked into it personally, but what is the data type of the Qwen2VL you're testing?

@bunnyfu
Copy link

bunnyfu commented Dec 19, 2024

Looks like IM2COL is only implemented for f32 and f16 on Metal. I've not looked into it personally, but what is the data type of the Qwen2VL you're testing?

Q4_k_m for me, is giving the error.

@oursland
Copy link

As reported in #9246, this issue is resolved with #10896, which has just been merged.

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Dec 20, 2024
* Barebone Qwen2VL LLM convertor

* Add Qwen2VL cli entrypoint

* [WIP] add qwen2vl arch

* Verify m-rope output

* Add vl-rope/2d-rope support for qwen2vl ViT

* update qwen2vl cli tool

* update 5D tensor op workaround

* [WIP] qwen2vl vision model

* make batch and clip utils compatible with qwen2vl

* [WIP] create inference workflow, gguf convert script but fix

* correcting vision-rope behavior, add the missing last layer back to ViT

* add arg parser to qwen2vl_surgery

* replace variable size array with vector

* cuda-gdb cmake preset

* add fp32 mrope, vision rope kernel

* add fp16 support for qwen2vl and m-rope

* add `GGML_ROPE_TYPE_MROPE`, `GGML_ROPE_TYPE_VISION`

* fix rope op mode switching, out dated func args

* update `llama_hparams`

* update to keep up stream changes

* resolve linter, test errors

* add makefile entry, update speical image padding token

* add mrope unit test, fix few compiler warnings

* rename `mrope` related function, params

* minor updates on debug util, bug fixs

* add `m-rope` testcase to `test-backend-ops`

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <[email protected]>

* fix traililng whitespce

* store `llama_hparams.rope_sections` with fixed size array

* update position id tensor size check in GGML_OP_ROPE

* minor updates

* update `ggml_backend_*_supports_op` of unsupported backends

* remote old `rope_section` compare operator

---------

Co-authored-by: Georgi Gerganov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apple Metal https://en.wikipedia.org/wiki/Metal_(API) build Compilation issues examples ggml changes relating to the ggml tensor library for machine learning Kompute https://github.com/KomputeProject/kompute/ Nvidia GPU Issues specific to Nvidia GPUs python python script changes SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language testing Everything test related Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.