-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Qwen2VL #10361
Add support for Qwen2VL #10361
Conversation
There is such a model, ShowUI is called. It is supposed to point to the element on the image so that you can control the computer with the mouse. But with your support it misses, and nothing does not work. import subprocess
from PIL import Image, ImageDraw
def detect_and_mark_element(image_path, element, output_image_path):
# Run the model to get the coordinates of the element
command = f"./llama-qwen2vl-cli -m ShowUI-2B/Qwen2-VL-2B-Instruct-F16.gguf --mmproj ShowUI-2B/qwen2vl-vision.gguf --image \"{image_path}\" --temp 0 -p \"<|im_start|>system\nBased on the screenshot, I give a text description and you give its corresponding location. The coordinate represents a clickable location [x, y] for an element, which is a relative coordinate on the screenshot, scaled from 0 to 1.<|im_end|>|im_start|>user\n<|vision_start|><image><|vision_end|>{element}<|im_end|>\n<|im_start|>assistant\n\""
output = subprocess.check_output(command, shell=True)
output = output.decode("utf-8").strip()
# Remove the square brackets and split the string into coordinates
coordinates = output.splitlines()[-1][1:-1].split(", ")
x, y = float(coordinates[0]), float(coordinates[1])
# Open the image and get its dimensions
img = Image.open(image_path)
width, height = img.size
# Convert the relative coordinates to absolute coordinates
x_abs = int(x * width)
y_abs = int(y * height)
# Draw a red circle on the detected element
draw = ImageDraw.Draw(img)
draw.ellipse([(x_abs-5, y_abs-5), (x_abs+5, y_abs+5)], fill=(255, 0, 0))
# Save the output image
img.save(output_image_path)
# Example usage:
detect_and_mark_element("screenshot.png", "Click on the address bar", "output.png") Here is a link to the model https://huggingface.co/showlab/ShowUI-2B |
@barinov274 While ShowUI is built on top of Qwen2VL, it employs a different image processing workflow. Therefore, I believe adding support for ShowUI should be addressed in a separate PR or issue. |
Testing right now to get it running. if i git clone the Qwen/Qwen2-VL-2B-Instruct repo in /whatever/Qwen/Qwen2-VL-2B-Instruct/ and make a gguf out of it with convert_hf_to_gguf.py everything is fine and i get a Qwen-Qwen2-VL-2B-Instruct-F16.gguf But when i try to convert the vision encoder to GGUF format with qwen2_vl_surgery.py : i can't, python throws an error :
But it just works without specifying the path to the downloaded HF model repo : |
Maybe there should be an error or a warning when trying to run it with unsupported backends? I was confused by the completely made-up answers when running it on Vulkan. |
This indicates a bug. Unsupported operations should be automatically moved back to the CPU and the results would be correct, just the execution would be slow. But getting wrong results is not expected and should be investigated. |
Is it supposed to work with llama-server yet? I used the gguf from this place: https://huggingface.co/tensorblock/Qwen2-VL-7B-Instruct-GGUF/blob/main/README.md And it returns nonsense on the ui. |
@husnoo not yet. You have to compile llama.cpp and use the llama-qwenvl2 binary in command line. Read the doc above for details. |
After some preliminary investigation, I believe the issue is that |
On M3-Max 64GB |
Looks like |
Q4_k_m for me, is giving the error. |
* Barebone Qwen2VL LLM convertor * Add Qwen2VL cli entrypoint * [WIP] add qwen2vl arch * Verify m-rope output * Add vl-rope/2d-rope support for qwen2vl ViT * update qwen2vl cli tool * update 5D tensor op workaround * [WIP] qwen2vl vision model * make batch and clip utils compatible with qwen2vl * [WIP] create inference workflow, gguf convert script but fix * correcting vision-rope behavior, add the missing last layer back to ViT * add arg parser to qwen2vl_surgery * replace variable size array with vector * cuda-gdb cmake preset * add fp32 mrope, vision rope kernel * add fp16 support for qwen2vl and m-rope * add `GGML_ROPE_TYPE_MROPE`, `GGML_ROPE_TYPE_VISION` * fix rope op mode switching, out dated func args * update `llama_hparams` * update to keep up stream changes * resolve linter, test errors * add makefile entry, update speical image padding token * add mrope unit test, fix few compiler warnings * rename `mrope` related function, params * minor updates on debug util, bug fixs * add `m-rope` testcase to `test-backend-ops` * Apply suggestions from code review Co-authored-by: Georgi Gerganov <[email protected]> * fix traililng whitespce * store `llama_hparams.rope_sections` with fixed size array * update position id tensor size check in GGML_OP_ROPE * minor updates * update `ggml_backend_*_supports_op` of unsupported backends * remote old `rope_section` compare operator --------- Co-authored-by: Georgi Gerganov <[email protected]>
This PR implements the Qwen2VL model as requested at #9246 .
The main changes include:
llama_context.n_pos_per_token
to support more than one position id per tokenexamples/llava/qwen2vl-cli.cpp
to handle Qwen2VL data preprocess steps & promptsTODO:
Steps to convert model and inference
Download the official
Qwen/Qwen2-VL-2B-Instruct
checkpoint, then convert the LLM part of the model to GGUF format usingconvert_hf_to_gguf.py
:python3 convert_hf_to_gguf.py "/path/to/Qwen2-VL-2B-Instruct/model-dir"
Convert the vision encoder to GGUF format with
qwen2_vl_surgery.py
:Build the
llama-qwen2vl-cli
in the same way you would buildllama-llava-cli
.Run the command: (It's recommended to resize the image to a resolution below 640x640, so it won't take forever to run on CPU backend)
Future work: