Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eval bug: Qwen2-VL Hallucinates image content on Vulkan backend #10843

Open
stduhpf opened this issue Dec 15, 2024 · 12 comments
Open

Eval bug: Qwen2-VL Hallucinates image content on Vulkan backend #10843

stduhpf opened this issue Dec 15, 2024 · 12 comments

Comments

@stduhpf
Copy link
Contributor

stduhpf commented Dec 15, 2024

Name and Version

.\build\bin\Release\llama-cli.exe --version

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 5700 XT (AMD proprietary driver) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
version: 4329 (89d604f)
built with MSVC 19.41.34120.0 for x64

Operating systems

Windows

GGML backends

Vulkan

Hardware

Ryzen 5900X +RX 5700 XT

Models

Qwen2-VL-7B-Instruct-IQ4_NL + mmproj-Qwen2-VL-7B-Instruct-f32

Problem description & steps to reproduce

When I run it on Vulkan build, the description given by the model has nothing to do with the image given as argument (no matter the -ngl value, even -ngl 0 is broken). The exact same setup works perfectly fine on CPU backend.

I know the Vulkan backend doesn't support Qwen2-VL yet, but according to #10361 (comment), this should only cause slowdowns, not invalid outputs.

Relevant log output

Image input:

Untitled

-ngl 0

> .\build\bin\Release\llama-qwen2vl-cli.exe -m .\models\Qwen2-VL-7B-Instruct-IQ4_NL.gguf --mmproj .\models\mmproj-Qwen2-VL-7B-Instruct-f32.gguf -p 'What could be the context of this image.' --image '.\Pictures\Untitled.png' --seed 0 --temp 0
[...]
encode_image_with_clip: step 1 of 1 encoded in   843.10 ms
encode_image_with_clip: all 1 segments encoded in   843.17 ms
encode_image_with_clip: load_image_size 512 512
encode_image_with_clip: image embedding created: 361 tokens

encode_image_with_clip: image encoded in   845.06 ms by CLIP (    2.34 ms per image patch)

The image shows a person wearing a black and white striped shirt, a black jacket, and black pants, standing in front of a black background. The person is also holding a black and white striped umbrella. The context of this image could be a fashion or clothing advertisement, showcasing the person's outfit and accessories. The black and white striped shirt, jacket, and umbrella create a monochromatic look, which is often used in fashion photography to emphasize the clothing and accessories. The black background helps to highlight the person and their outfit, making them the focal point of the image.
llama_perf_context_print:        load time =    6644.91 ms
llama_perf_context_print: prompt eval time =    2276.84 ms /   391 tokens (    5.82 ms per token,   171.73 tokens per second)
llama_perf_context_print:        eval time =   11500.85 ms /   115 runs   (  100.01 ms per token,    10.00 tokens per second)
llama_perf_context_print:       total time =   18275.28 ms /   506 tokens

-ngl 99

> .\build\bin\Release\llama-qwen2vl-cli.exe -m .\models\Qwen2-VL-7B-Instruct-IQ4_NL.gguf --mmproj .\models\mmproj-Qwen2-VL-7B-Instruct-f32.gguf -p 'What could be the context of this image.' --image '.\Pictures\Untitled.png' --seed 0 --temp 0 -ngl 99
[...]
encode_image_with_clip: step 1 of 1 encoded in  3248.68 ms
encode_image_with_clip: all 1 segments encoded in  3248.76 ms
encode_image_with_clip: load_image_size 512 512
encode_image_with_clip: image embedding created: 361 tokens

encode_image_with_clip: image encoded in  3249.79 ms by CLIP (    9.00 ms per image patch)

The image appears to be a logo or a symbol, but it is not clear what it represents. It could be a brand logo, a company logo, or a symbol for a specific organization or group. Without additional context or information, it is difficult to determine the exact meaning or purpose of the image.
llama_perf_context_print:        load time =    9346.17 ms
llama_perf_context_print: prompt eval time =    1009.47 ms /   391 tokens (    2.58 ms per token,   387.33 tokens per second)
llama_perf_context_print:        eval time =    1500.12 ms /    61 runs   (   24.59 ms per token,    40.66 tokens per second)
llama_perf_context_print:       total time =   10889.94 ms /   452 tokens

CPU backend for comparison

> .\buildcpu\bin\Release\llama-qwen2vl-cli.exe -m .\models\Qwen2-VL-7B-Instruct-IQ4_NL.gguf --mmproj .\models\mmproj-Qwen2-VL-7B-Instruct-f32.gguf -p 'What could be the context of this image.' --image '.\Pictures\Untitled.png' --seed 0 --temp 0
[...]
encode_image_with_clip: step 1 of 1 encoded in  8483.38 ms
encode_image_with_clip: all 1 segments encoded in  8483.47 ms
encode_image_with_clip: load_image_size 512 512
encode_image_with_clip: image embedding created: 361 tokens

encode_image_with_clip: image encoded in  8484.85 ms by CLIP (   23.50 ms per image patch)

The image appears to be a simple text-based graphic with the words "READABLE TEXT" written in a bold, black font. The context of this image could be related to demonstrating or emphasizing the importance of clear and legible text, possibly in the context of design, typography, or user interface (UI) design. It might be used to highlight the importance of making text easy to read and understand for users.
llama_perf_context_print:        load time =   21741.16 ms
llama_perf_context_print: prompt eval time =   10924.92 ms /   391 tokens (   27.94 ms per token,    35.79 tokens per second)
llama_perf_context_print:        eval time =    8322.39 ms /    83 runs   (  100.27 ms per token,     9.97 tokens per second)
llama_perf_context_print:       total time =   30185.33 ms /   474 tokens
@ggerganov
Copy link
Owner

Could you do a quick test and see if it works with an F16 vision projector:

.\build\bin\Release\llama-quantize.exe .\models\mmproj-Qwen2-VL-7B-Instruct-f32.gguf .\models\mmproj-Qwen2-VL-7B-Instruct-f16.gguf f16

.\build\bin\Release\llama-qwen2vl-cli.exe -m .\models\Qwen2-VL-7B-Instruct-IQ4_NL.gguf --mmproj .\models\mmproj-Qwen2-VL-7B-Instruct-f16.gguf -p 'What could be the context of this image.' --image '.\Pictures\Untitled.png' --seed 0 --temp 0 -ngl 99

@stduhpf
Copy link
Contributor Author

stduhpf commented Dec 15, 2024

It's not working :(

.\build\bin\Release\llama-quantize.exe  .\models\mmproj-Qwen2-VL-7B-Instruct-f32.gguf  .\models\mmproj-Qwen2-VL-7B-Instruct-f16.gguf f16
main: build = 4333 (a0974156)
main: built with MSVC 19.41.34120.0 for x64
main: quantizing '.\models\mmproj-Qwen2-VL-7B-Instruct-f32.gguf' to '.\models\mmproj-Qwen2-VL-7B-Instruct-f16.gguf' as F16
llama_model_loader: loaded meta data with 20 key-value pairs and 521 tensors from .\models\mmproj-Qwen2-VL-7B-Instruct-f32.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = clip
llama_model_loader: - kv   1:                        general.description str              = image encoder for Qwen2VL
llama_model_loader: - kv   2:                          general.file_type u32              = 0
llama_model_loader: - kv   3:                      clip.has_text_encoder bool             = false
llama_model_loader: - kv   4:                    clip.has_vision_encoder bool             = true
llama_model_loader: - kv   5:                    clip.has_qwen2vl_merger bool             = true
llama_model_loader: - kv   6:                        clip.projector_type str              = qwen2vl_merger
llama_model_loader: - kv   7:                              clip.use_silu bool             = false
llama_model_loader: - kv   8:                              clip.use_gelu bool             = false
llama_model_loader: - kv   9:                     clip.vision.patch_size u32              = 14
llama_model_loader: - kv  10:                     clip.vision.image_size u32              = 560
llama_model_loader: - kv  11:               clip.vision.embedding_length u32              = 1280
llama_model_loader: - kv  12:                 clip.vision.projection_dim u32              = 3584
llama_model_loader: - kv  13:           clip.vision.attention.head_count u32              = 16
llama_model_loader: - kv  14:   clip.vision.attention.layer_norm_epsilon f32              = 0.000001
llama_model_loader: - kv  15:                    clip.vision.block_count u32              = 32
llama_model_loader: - kv  16:            clip.vision.feed_forward_length u32              = 0
llama_model_loader: - kv  17:                               general.name str              = Qwen2-VL-7B-Instruct
llama_model_loader: - kv  18:                     clip.vision.image_mean arr[f32,3]       = [0.481455, 0.457828, 0.408211]
llama_model_loader: - kv  19:                      clip.vision.image_std arr[f32,3]       = [0.268630, 0.261303, 0.275777]
llama_model_loader: - type  f32:  521 tensors
llama_model_quantize: failed to quantize: unknown model architecture: 'clip'
main: failed to quantize model from '.\models\mmproj-Qwen2-VL-7B-Instruct-f32.gguf'

stable-diffusion.cpp's cli does allow me convert it to f16, but I think its strips off important metadata:

.\buildcpu\bin\Release\llama-qwen2vl-cli.exe -m .\models\Qwen2-VL-7B-Instruct-IQ4_NL.gguf --mmproj .\models\mmproj-Qwen2-VL-7B-Instruct-f16.gguf -p 'What could be the context of this image.' --image '.\Pictures\Untitled.png' --seed 0 --temp 0
build: 4333 (a0974156) with MSVC 19.41.34120.0 for x64
llama_model_loader: loaded meta data with 37 key-value pairs and 339 tensors from .\models\Qwen2-VL-7B-Instruct-IQ4_NL.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2vl
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen2 VL 7B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Qwen2-VL
llama_model_loader: - kv   5:                         general.size_label str              = 7B
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen2 VL 7B
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2-VL-7B
llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["multimodal", "image-text-to-text"]
llama_model_loader: - kv  12:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  13:                        qwen2vl.block_count u32              = 28
llama_model_loader: - kv  14:                     qwen2vl.context_length u32              = 32768
llama_model_loader: - kv  15:                   qwen2vl.embedding_length u32              = 3584
llama_model_loader: - kv  16:                qwen2vl.feed_forward_length u32              = 18944
llama_model_loader: - kv  17:               qwen2vl.attention.head_count u32              = 28
llama_model_loader: - kv  18:            qwen2vl.attention.head_count_kv u32              = 4
llama_model_loader: - kv  19:                     qwen2vl.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  20:   qwen2vl.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  21:                          general.file_type u32              = 25
llama_model_loader: - kv  22:            qwen2vl.rope.dimension_sections arr[i32,4]       = [16, 24, 24, 0]
llama_model_loader: - kv  23:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  25:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  26:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  27:                      tokenizer.ggml.merges arr[str,151387]  = ["─á ─á", "─á─á ─á─á", "i n", "─á t",...
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  31:                    tokenizer.chat_template str              = {% set image_count = namespace(value=...
llama_model_loader: - kv  32:               general.quantization_version u32              = 2
llama_model_loader: - kv  33:                      quantize.imatrix.file str              = /models_out/Qwen2-VL-7B-Instruct-GGUF...
llama_model_loader: - kv  34:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  35:             quantize.imatrix.entries_count i32              = 196
llama_model_loader: - kv  36:              quantize.imatrix.chunks_count i32              = 128
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q5_K:   28 tensors
llama_model_loader: - type q6_K:    1 tensors
llama_model_loader: - type iq4_nl:  169 tensors
llm_load_vocab: special tokens cache size = 14
llm_load_vocab: token to piece cache size = 0.9309 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2vl
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 3584
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 28
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 7
llm_load_print_meta: n_embd_k_gqa     = 512
llm_load_print_meta: n_embd_v_gqa     = 512
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 18944
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 8
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = IQ4_NL - 4.5 bpw
llm_load_print_meta: model params     = 7.62 B
llm_load_print_meta: model size       = 4.13 GiB (4.66 BPW)
llm_load_print_meta: general.name     = Qwen2 VL 7B Instruct
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/29 layers to GPU
llm_load_tensors:   CPU_Mapped model buffer size =  4226.55 MiB
.....................................................................................
key general.file_type not found in file

@ggerganov
Copy link
Owner

Ah, I think you have to use the surgery script:

python ./examples/llava/qwen2_vl_surgery.py Qwen/Qwen2-VL-2B-Instruct --data_type fp16

@stduhpf
Copy link
Contributor Author

stduhpf commented Dec 15, 2024

It's the same mmproj for the 2b and the 7B model?

@stduhpf
Copy link
Contributor Author

stduhpf commented Dec 15, 2024

It's the same mmproj for the 2b and the 7B model?

It seems not

@stduhpf
Copy link
Contributor Author

stduhpf commented Dec 15, 2024

Could you do a quick test and see if it works with an F16 vision projector:

>.\build\bin\Release\llama-qwen2vl-cli.exe -m .\models\Qwen2-VL-2B-Instruct-Q8_0.gguf --mmproj .\qwen-qwen2-vl-2b-instruct-vision.gguf -p 'What could be the context of this image.' --image '.\Pictures\Untitled.png' --seed 0 --temp 0
[...]
clip_model_load: model name:   Qwen/Qwen2-VL-2B-Instruct
clip_model_load: description:  image encoder for Qwen2VL
clip_model_load: GGUF version: 3
clip_model_load: alignment:    32
clip_model_load: n_tensors:    521
clip_model_load: n_kv:         20
clip_model_load: ftype:        f16
[...]

CPU:

The image shows the text "READABLE TEXT." This text is likely used to indicate that the content or information presented is easy to read and understand. It could be used in various contexts such as a website, a document, or a presentation where the goal is to make the information accessible to a wide audience.

Vulkan (ngl 99):

The image appears to be a stylized representation of a person wearing a hat and a coat. The hat and coat are the main focus, and the background is a simple, minimalistic design. The context of this image could be related to a fashion advertisement, a promotional poster, or a branding image. The hat and coat might be part of a collection or a series of items, such as a hat and coat set, a fashion line, or a brand identity.

Still not working

@jeffbolznv
Copy link
Collaborator

Can you try enabling GGML_VULKAN_CHECK_RESULTS and see if it identifies the broken op? You might need to manually add the cpu backend source files to ggml-vulkan (I think this broke when the backends were refactored).

@stduhpf
Copy link
Contributor Author

stduhpf commented Dec 15, 2024

Can you try enabling GGML_VULKAN_CHECK_RESULTS and see if it identifies the broken op? You might need to manually add the cpu backend source files to ggml-vulkan (I think this broke when the backends were refactored).

ggml-vulkan.obj : error LNK2019: unresolved external symbol ggml_graph_compute_with_ctx referenced in function "void __cdecl ggml_vk_check_results_0(struct ggml_tensor *)" (?ggml_vk_check_results_0@@YAXPEAUggml_tensor@@@Z) [C:\llama.cpp\buildv\ggml\src\ggml-vulkan\ggml-vulkan.vcxproj] C:\llama.cpp\buildv\bin\Release\ggml-vulkan.dll : fatal error LNK1120: 1 unresolved externals [C:\llama.cpp\buildv\ggml\src\ggml-vulkan\ggml-vulkan.vcxproj]

@jeffbolznv
Copy link
Collaborator

To fix those linker issues you need to add the ggml-cpu sources to ggml-vulkan.

@slaren
Copy link
Collaborator

slaren commented Dec 15, 2024

Building with -DBUILD_SHARED_LIBS=OFF should also work.

@stduhpf
Copy link
Contributor Author

stduhpf commented Dec 15, 2024

Can you try enabling GGML_VULKAN_CHECK_RESULTS and see if it identifies the broken op? You might need to manually add the cpu backend source files to ggml-vulkan (I think this broke when the backends were refactored).

1 node_0 op=IM2COL avg_err=0
2 node_3 op=MUL_MAT avg_err=0.00111936
3  (reshaped) (permuted) (cont) op=CONT avg_err=0
4 node_7 op=IM2COL avg_err=0
5 node_10 op=MUL_MAT avg_err=0.00109479
6  (reshaped) (permuted) (cont) op=CONT avg_err=0
7 node_14 op=ADD avg_err=0
8  (permuted) (cont) op=CONT avg_err=0
9  (permuted) (cont) (reshaped) (reshaped) (permuted) (cont) op=CONT avg_err=0
10 node_22 op=NORM avg_err=3.37601e-09
11 node_23 op=MUL avg_err=0
12 node_24 op=ADD avg_err=0
13 node_25 op=MUL_MAT avg_err=0.000117832
14 node_26 op=ADD avg_err=0
15  (reshaped) (permuted) (cont) op=CONT avg_err=0
16 node_31 op=MUL_MAT avg_err=0.0010295
17 node_32 op=ADD avg_err=0
C:\llama.cpp\ggml\src\ggml.c:3513: GGML_ASSERT(a->ne[2] == b->ne[0]) failed

@LostRuins
Copy link
Collaborator

LostRuins commented Dec 16, 2024

I can confirm this issue happens even with no layers offloaded. On CPU backend it works fine.

Model is BF16, projector F16. Same assert as above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants