llava: Add ACC OP for GPU acceleration to the Vulkan backend in the LLAVA CLIP model. #8984

cyzero-kim · 2024-08-11T10:39:01Z

The CLIP model now prioritizes the Vulkan backend over the CPU when vulkan available.
A GGML_OP_ACC shader has been added.
The encoding performance of the CLIP model improved from 4.2s on the CPU to 0.9s on the GPU.
I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High
Test image : https://raw.githubusercontent.com/neuralmagic/deepsparse/main/tests/deepsparse/pipelines/sample_images/buddy.jpeg

master :

clip_model_load: CLIP using CPU backend

encode_image_with_clip: image encoded in  4291.85 ms by CLIP (    7.45 ms per image patch)

 The image features a happy dog wearing a blue bandana with a red ribbon and tongue sticking out. The dog's tongue is pink, and it appears to be enjoying a walk. It is wearing a leash and seems to be looking at the camera, possibly posing for a picture. The dog is standing next to a tree, creating a natural and outdoor environment.

llama_print_timings:        load time =   44230.33 ms
llama_print_timings:      sample time =       4.17 ms /    83 runs   (    0.05 ms per token, 19918.41 tokens per second)
llama_print_timings: prompt eval time =   38950.50 ms /   616 tokens (   63.23 ms per token,    15.81 tokens per second)
llama_print_timings:        eval time =   11133.88 ms /    82 runs   (  135.78 ms per token,     7.36 tokens per second)
llama_print_timings:       total time =   55502.11 ms /   698 tokens

PR :

clip_model_load: CLIP using Vulkan backend

encode_image_with_clip: image encoded in   933.11 ms by CLIP (    1.62 ms per image patch)

 A brown dog with its mouth open, showing its teeth and tongue, is the main subject in the image. This dog is located on the left side of the image and seems to be looking to its left. The image also includes a few other elements, such as a clock on the right side, a small green leaf on the right bottom corner, and a red ball in the bottom right part of the image.

llama_print_timings:        load time =   38104.42 ms
llama_print_timings:      sample time =       3.43 ms /    84 runs   (    0.04 ms per token, 24518.39 tokens per second)
llama_print_timings: prompt eval time =   34997.53 ms /   616 tokens (   56.81 ms per token,    17.60 tokens per second)
llama_print_timings:        eval time =   14383.80 ms /    83 runs   (  173.30 ms per token,     5.77 tokens per second)
llama_print_timings:       total time =   52679.99 ms /   699 tokens

Full logs:

.\llama-llava-cli.exe -m 'C:\work\llm\ggml-model-q4_k.gguf' --mmproj C:\work\llm\mmproj-model-f16.gguf --image C:\work\llm\buddy.jpeg -ngl 15
Log start
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from C:\work\llm\ggml-model-q4_k.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 15
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens cache size = 3
llm_load_vocab: token to piece cache size = 0.1684 MB
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 3.80 GiB (4.84 BPW)
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: max token length = 48
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: Intel(R) Iris(R) Xe Graphics (Intel Corporation) | uma: 1 | fp16: 1 | warp size: 32
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 15 repeating layers to GPU
llm_load_tensors: offloaded 15/33 layers to GPU
llm_load_tensors: Intel(R) Iris(R) Xe Graphics buffer size =  1750.59 MiB
llm_load_tensors:        CPU buffer size =  3891.24 MiB
.................................................................................................
clip_model_load: model name:   openai/clip-vit-large-patch14-336
clip_model_load: description:  image encoder for LLaVA
clip_model_load: GGUF version: 2
clip_model_load: alignment:    32
clip_model_load: n_tensors:    377
clip_model_load: n_kv:         18
clip_model_load: ftype:        f16

clip_model_load: loaded meta data with 18 key-value pairs and 377 tensors from C:\work\llm\mmproj-model-f16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv   0:                       general.architecture str              = clip
clip_model_load: - kv   1:                      clip.has_text_encoder bool             = false
clip_model_load: - kv   2:                    clip.has_vision_encoder bool             = true
clip_model_load: - kv   3:                   clip.has_llava_projector bool             = true
clip_model_load: - kv   4:                          general.file_type u32              = 1
clip_model_load: - kv   5:                               general.name str              = openai/clip-vit-large-patch14-336
clip_model_load: - kv   6:                        general.description str              = image encoder for LLaVA
clip_model_load: - kv   7:                     clip.vision.image_size u32              = 336
clip_model_load: - kv   8:                     clip.vision.patch_size u32              = 14
clip_model_load: - kv   9:               clip.vision.embedding_length u32              = 1024
clip_model_load: - kv  10:            clip.vision.feed_forward_length u32              = 4096
clip_model_load: - kv  11:                 clip.vision.projection_dim u32              = 768
clip_model_load: - kv  12:           clip.vision.attention.head_count u32              = 16
clip_model_load: - kv  13:   clip.vision.attention.layer_norm_epsilon f32              = 0.000010
clip_model_load: - kv  14:                    clip.vision.block_count u32              = 23
clip_model_load: - kv  15:                     clip.vision.image_mean arr[f32,3]       = [0.481455, 0.457828, 0.408211]
clip_model_load: - kv  16:                      clip.vision.image_std arr[f32,3]       = [0.268630, 0.261303, 0.275777]
clip_model_load: - kv  17:                              clip.use_gelu bool             = false
clip_model_load: - type  f32:  235 tensors
clip_model_load: - type  f16:  142 tensors
clip_model_load: CLIP using Vulkan backend
clip_model_load: text_encoder:   0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector:  1
clip_model_load: minicpmv_projector:  0
clip_model_load: model size:     595.49 MB
clip_model_load: metadata size:  0.13 MB
clip_model_load: params backend buffer size =  595.49 MB (377 tensors)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
clip_model_load: compute allocated memory: 32.89 MB
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: Intel(R) Iris(R) Xe Graphics KV buffer size =   480.00 MiB
llama_kv_cache_init: Vulkan_Host KV buffer size =   544.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model: Vulkan_Host  output buffer size =     0.12 MiB
llama_new_context_with_model: Intel(R) Iris(R) Xe Graphics compute buffer size =   193.00 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size =    20.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 191
encode_image_with_clip: image embedding created: 576 tokens

encode_image_with_clip: image encoded in   956.78 ms by CLIP (    1.66 ms per image patch)

 The image depicts a close-up view of a dog's face with its mouth wide open, revealing its teeth. The dog appears to be a large and powerful breed, with its facial muscles clearly visible. The dog's teeth are clearly displayed, showing the strength and resilience of its jaw. The scene evokes an emotion of awe and admiration for the dog's impressive features.

llama_print_timings:        load time =   39887.31 ms
llama_print_timings:      sample time =       4.71 ms /    92 runs   (    0.05 ms per token, 19541.21 tokens per second)
llama_print_timings: prompt eval time =   36266.98 ms /   616 tokens (   58.87 ms per token,    16.99 tokens per second)
llama_print_timings:        eval time =   16155.81 ms /    91 runs   (  177.54 ms per token,     5.63 tokens per second)
llama_print_timings:       total time =   56258.79 ms /   707 tokens

…LAVA CLIP model. - The CLIP model now prioritizes the Vulkan backend over the CPU when vulkan available. - A GGML_OP_ACC shader has been added. - The encoding performance of the CLIP model improved from 4.2s on the CPU to 0.9s on the GPU. Signed-off-by: Changyeon Kim <[email protected]>

Signed-off-by: Changyeon Kim <[email protected]>

0cc4m

This implementation isn't correct yet. You can check that with test-backend-ops -o ACC, here's the results from my system:

» build_vk/bin/test-backend-ops -o ACC
ggml_vulkan: Found 3 Vulkan devices:
Vulkan0: AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64
Vulkan1: Tesla P40 (NVIDIA) | uma: 0 | fp16: 0 | warp size: 32
Vulkan2: NVIDIA GeForce RTX 3090 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32
Testing 4 backends

Backend 1/4 (CPU)
  Skipping CPU backend
Backend 2/4 (Vulkan0)
  Backend name: Vulkan0
  ACC(type=f32,ne_a=[1024,577,1,1],ne_b=[1024,576,1,1]): [ACC] NMSE = 1.002194223 > 0.000000100 FAIL
  1341/1342 tests passed
  Backend Vulkan0: FAIL

Backend 3/4 (Vulkan1)
  Backend name: Vulkan1
  ACC(type=f32,ne_a=[1024,577,1,1],ne_b=[1024,576,1,1]): [ACC] NMSE = 1.000925279 > 0.000000100 FAIL
  1341/1342 tests passed
  Backend Vulkan1: FAIL

Backend 4/4 (Vulkan2)
  Backend name: Vulkan2
  ACC(type=f32,ne_a=[1024,577,1,1],ne_b=[1024,576,1,1]): [ACC] NMSE = 1.000223190 > 0.000000100 FAIL
  1341/1342 tests passed
  Backend Vulkan2: FAIL

1/4 backends passed
FAIL

Do you want to fix this or would you prefer me to? I don't mind, it's not a complicated operator and I have the most experience with the backend.

ggml/src/ggml-vulkan.cpp

Signed-off-by: Changyeon Kim <[email protected]>

cyzero-kim · 2024-08-15T13:18:15Z

0cc4m Thank you for letting me know about the OP test method. Your comments have greatly contributed to my growth. As you mentioned, I confirmed that the parameter was missing and have made the necessary corrections. Here are the results from the retest.

PS C:\work\llm\cyzero\llama.cpp.latest> .\build\bin\Release\test-backend-ops.exe -o ACC
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: Intel(R) Iris(R) Xe Graphics (Intel Corporation) | uma: 1 | fp16: 1 | warp size: 32
Testing 2 backends

Backend 1/2 (CPU)
Skipping CPU backend
Backend 2/2 (Vulkan0)
Backend name: Vulkan0
ACC(type=f32,ne_a=[1024,577,1,1],ne_b=[1024,576,1,1]): OK
1342/1342 tests passed
Backend Vulkan0: OK

2/2 backends passed
OK

encode_image_with_clip: image encoded in 882.93 ms by CLIP ( 1.53 ms per image patch)

The image shows a happy golden retriever dog with a blue bandana around its neck. The dog is sitting down on the grass and looking at the camera with a smile on its face. Its tongue is hanging out, showing a playful and joyful expression. The dog appears to be a beloved and well-cared-for pet.

0cc4m

Looks good and passes the tests, only a small change needed and it's ready to merge.

ggml/src/ggml-vulkan.cpp

Signed-off-by: Changyeon Kim <[email protected]>

0cc4m

Thank you, that looks correct.

ggml/src/ggml-vulkan.cpp

…LAVA CLIP model. (ggerganov#8984) * llava: Add ACC OP for GPU acceleration to the Vulkan backend in the LLAVA CLIP model. - The CLIP model now prioritizes the Vulkan backend over the CPU when vulkan available. - A GGML_OP_ACC shader has been added. - The encoding performance of the CLIP model improved from 4.2s on the CPU to 0.9s on the GPU. Signed-off-by: Changyeon Kim <[email protected]> * fix-up coding style. Signed-off-by: Changyeon Kim <[email protected]> * Fix-up the missing initial parameter to resolve the compilation warning. Signed-off-by: Changyeon Kim <[email protected]> * [fix] Add missing parameters. Signed-off-by: Changyeon Kim <[email protected]> * [fix] Use nb1 and nb2 for dst. Signed-off-by: Changyeon Kim <[email protected]> * Fix check results ggml_acc call --------- Signed-off-by: Changyeon Kim <[email protected]> Co-authored-by: 0cc4m <[email protected]>

github-actions bot added Vulkan Issues specific to the Vulkan backend examples ggml changes relating to the ggml tensor library for machine learning labels Aug 11, 2024

cyzero-kim added 2 commits August 11, 2024 21:28

fix-up coding style.

b9fc678

Signed-off-by: Changyeon Kim <[email protected]>

Fix-up the missing initial parameter to resolve the compilation warning.

e6e018d

Signed-off-by: Changyeon Kim <[email protected]>

cyzero-kim changed the title ~~llava: Add ACC OP for GPU acceleration to the Vulkan backend in the L…~~ llava: Add ACC OP for GPU acceleration to the Vulkan backend in the LLAVA CLIP model. Aug 12, 2024

0cc4m self-requested a review August 12, 2024 20:30

0cc4m reviewed Aug 14, 2024

View reviewed changes

ggml/src/ggml-vulkan.cpp Show resolved Hide resolved

[fix] Add missing parameters.

12ab18b

Signed-off-by: Changyeon Kim <[email protected]>

Merge branch 'master' into vulkan

f84cf24

cyzero-kim requested a review from 0cc4m August 16, 2024 14:00

0cc4m reviewed Aug 19, 2024

View reviewed changes

ggml/src/ggml-vulkan.cpp Outdated Show resolved Hide resolved

cyzero-kim added 2 commits August 20, 2024 21:26

[fix] Use nb1 and nb2 for dst.

065a9d8

Signed-off-by: Changyeon Kim <[email protected]>

Merge branch 'ggerganov:master' into vulkan

e85e232

0cc4m approved these changes Aug 20, 2024

View reviewed changes

0cc4m reviewed Aug 20, 2024

View reviewed changes

ggml/src/ggml-vulkan.cpp Outdated Show resolved Hide resolved

Fix check results ggml_acc call

a2d1f44

0cc4m merged commit 2f3c146 into ggerganov:master Aug 20, 2024
52 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llava: Add ACC OP for GPU acceleration to the Vulkan backend in the LLAVA CLIP model. #8984

llava: Add ACC OP for GPU acceleration to the Vulkan backend in the LLAVA CLIP model. #8984

cyzero-kim commented Aug 11, 2024 •

edited

Loading

0cc4m left a comment

cyzero-kim commented Aug 15, 2024 •

edited

Loading

0cc4m left a comment

0cc4m left a comment

llava: Add ACC OP for GPU acceleration to the Vulkan backend in the LLAVA CLIP model. #8984

llava: Add ACC OP for GPU acceleration to the Vulkan backend in the LLAVA CLIP model. #8984

Conversation

cyzero-kim commented Aug 11, 2024 • edited Loading

master :

PR :

Full logs:

0cc4m left a comment

Choose a reason for hiding this comment

cyzero-kim commented Aug 15, 2024 • edited Loading

0cc4m left a comment

Choose a reason for hiding this comment

0cc4m left a comment

Choose a reason for hiding this comment

cyzero-kim commented Aug 11, 2024 •

edited

Loading

cyzero-kim commented Aug 15, 2024 •

edited

Loading