Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support video understanding #9165

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

tc-mb
Copy link
Contributor

@tc-mb tc-mb commented Aug 25, 2024

Dear llama.cpp official,

hi, as I promised before, after the MiniCPM-V 2.6 merge, I submitted a PR to support video understanding.Because llama.cpp does not currently support video file processing, I think this PR may last for a long time to fully discuss how to integrate video capabilities into the code. But I am ready to actively support the review of this PR in the future.

For MiniCPM-V 2.6, we took the approach of extracting frames from the video file and inputting each frame data sequentially to the model. At the code level, I introduced the open source library ffmpeg to implement video frame extraction.And added the "video" parameter to the args of llama.cpp to read video files.

Before use, install FFmpeg in environment.

brew install ffmpeg
brew install pkg-config

run quantized int4 version

./llama-minicpmv-cli -m ../MiniCPM-V-2_6/model/ggml-model-Q4_K_M.gguf --mmproj ../MiniCPM-V-2_6/mmproj-model-f16.gguf -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --video xx.jpg  -p "What is in the video?"

The above is the difference in using video. I look forward to your testing and discussion.

Best regards,
MiniCPM-V official ^_^

@github-actions github-actions bot added examples python python script changes labels Aug 25, 2024
@Dampfinchen
Copy link

Dampfinchen commented Aug 27, 2024

Very interesting. Video understanding could be the next big thing. Thank you for the contribution!

Makefile Outdated
llama-minicpmv-cli: examples/llava/minicpmv-cli.cpp \
examples/llava/llava.cpp \
examples/llava/llava.h \
examples/llava/clip.cpp \
examples/llava/clip.h \
$(OBJ_ALL)
$(CXX) $(CXXFLAGS) $< $(filter-out %.h $<,$^) -o $@ $(LDFLAGS) -Wno-cast-qual
$(CXX) $(CXXFLAGS) $(FFMPEG_CFLAGS) $< $(filter-out %.h $<,$^) -o $@ $(LDFLAGS) $(FFMPEG_LIBS) -Wno-cast-qual
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to only enable support for video with a special flag, for example LLAMA_FFMPEG (same way with LLAMA_CURL)

Also, don't forget to add support for cmake linking ffmpeg in cmake

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, i will try it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have taken a stab at implementing this compiler flag in an amending PR -- it may or may not be useful to you:
OpenBMB#32

@tc-mb If you like it, feel free to merge that one -- if you do, it should smoothly merge my changes into your PR here. If you don't want it, then no hard feelings -- I won't be offended. :) I'm simply a fan of your work, and generally wanted to make an attempt at helping this PR along.

@saket424
Copy link

@tc-mb
What is the recommended ffmpeg command that you are using to convert the ffmpeg to multiple jpg(mjpeg?)

for example

ffmpeg -i ./clip.mp4 -vf fps=1/3,scale=480:480:force_original_aspect_ratio=decrease -q:v 2 ./f/frame_%04d.jpg

@saket424
Copy link

ah i see --video takes an mp4 file as an input and does the sampling internally

static void show_additional_info(int /*argc*/, char ** argv) {
    LOG_TEE("\n example usage: %s -m <llava-v1.5-7b/ggml-model-q5_k.gguf> --mmproj <llava-v1.5-7b/mmproj-model-f16.gguf> [--video <path/to/an/video.mp4>] [--image <path/to/an/image.jpg>] [--image <path/to/another/image.jpg>] [--temp 0.1] [-p \"describe the image in detail.\"]\n", argv[0]);
    LOG_TEE("  note: a lower temperature value like 0.1 is recommended for better quality.\n");
}

@saket424
Copy link

@tc-mb
I hope i am not the only one experiencing this issue. i ask for a description in english and i get chinese instead

./llama-minicpmv-cli -m ./mini2.6/ggml-model-Q4_K_M.gguf --mmproj ./mini2.6/mmproj-model-f16.gguf --image ./f/frame_0001.jpg --image ./f/frame_0002.jpg --image ./f/frame_0003.jpg --image ./f/frame_0004.jpg --temp 0.1 -p "describe the image in detail in english language" -c 4096
Log start
llama_model_loader: loaded meta data with 22 key-value pairs and 339 tensors from ./mini2.6/ggml-model-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.name str              = model
llama_model_loader: - kv   2:                          qwen2.block_count u32              = 28
llama_model_loader: - kv   3:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv   4:                     qwen2.embedding_length u32              = 3584
llama_model_loader: - kv   5:                  qwen2.feed_forward_length u32              = 18944
llama_model_loader: - kv   6:                 qwen2.attention.head_count u32              = 28
llama_model_loader: - kv   7:              qwen2.attention.head_count_kv u32              = 4
llama_model_loader: - kv   8:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv   9:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                          general.file_type u32              = 15
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  12:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,151666]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,151666]  = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  15:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 151644
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 128244
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% for message in messages %}{% if lo...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q4_K:  169 tensors
llama_model_loader: - type q6_K:   29 tensors
llm_load_vocab: special tokens cache size = 25
llm_load_vocab: token to piece cache size = 0.9309 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151666
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 3584
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 28
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 7
llm_load_print_meta: n_embd_k_gqa     = 512
llm_load_print_meta: n_embd_v_gqa     = 512
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 18944
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 7.61 B
llm_load_print_meta: model size       = 4.35 GiB (4.91 BPW) 
llm_load_print_meta: general.name     = model
llm_load_print_meta: BOS token        = 151644 '<|im_start|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: UNK token        = 128244 '<unk>'
llm_load_print_meta: PAD token        = 0 '!'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4060 Laptop GPU, compute capability 8.9, VMM: yes
llm_load_tensors: ggml ctx size =    0.15 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/29 layers to GPU
llm_load_tensors:        CPU buffer size =  4458.57 MiB
....................................................................................
clip_model_load: description:  image encoder for MiniCPM-V
clip_model_load: GGUF version: 3
clip_model_load: alignment:    32
clip_model_load: n_tensors:    455
clip_model_load: n_kv:         19
clip_model_load: ftype:        f16

clip_model_load: loaded meta data with 19 key-value pairs and 455 tensors from ./mini2.6/mmproj-model-f16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv   0:                       general.architecture str              = clip
clip_model_load: - kv   1:                      clip.has_text_encoder bool             = false
clip_model_load: - kv   2:                    clip.has_vision_encoder bool             = true
clip_model_load: - kv   3:                clip.has_minicpmv_projector bool             = true
clip_model_load: - kv   4:                          general.file_type u32              = 1
clip_model_load: - kv   5:                        general.description str              = image encoder for MiniCPM-V
clip_model_load: - kv   6:                        clip.projector_type str              = resampler
clip_model_load: - kv   7:                      clip.minicpmv_version i32              = 3
clip_model_load: - kv   8:                     clip.vision.image_size u32              = 448
clip_model_load: - kv   9:                     clip.vision.patch_size u32              = 14
clip_model_load: - kv  10:               clip.vision.embedding_length u32              = 1152
clip_model_load: - kv  11:            clip.vision.feed_forward_length u32              = 4304
clip_model_load: - kv  12:                 clip.vision.projection_dim u32              = 0
clip_model_load: - kv  13:           clip.vision.attention.head_count u32              = 16
clip_model_load: - kv  14:   clip.vision.attention.layer_norm_epsilon f32              = 0.000001
clip_model_load: - kv  15:                    clip.vision.block_count u32              = 26
clip_model_load: - kv  16:                     clip.vision.image_mean arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv  17:                      clip.vision.image_std arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv  18:                              clip.use_gelu bool             = true
clip_model_load: - type  f32:  285 tensors
clip_model_load: - type  f16:  170 tensors
clip_model_load: CLIP using CUDA backend
clip_model_load: text_encoder:   0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector:  0
clip_model_load: minicpmv_projector:  1
clip_model_load: model size:     996.02 MB
clip_model_load: metadata size:  0.16 MB
clip_model_load: params backend buffer size =  996.02 MB (455 tensors)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
clip_image_build_graph: 448 448
clip_model_load: compute allocated memory: 102.80 MB
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =   224.00 MiB
llama_new_context_with_model: KV self size  =  224.00 MiB, K (f16):  112.00 MiB, V (f16):  112.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.58 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   742.36 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    15.01 MiB
llama_new_context_with_model: graph nodes  = 986
llama_new_context_with_model: graph splits = 396
uhd_slice_image: multiple 1
clip_image_preprocess: 602 336
clip_image_build_graph: 602 336
encode_image_with_clip: step 1 of 1 encoded in   263.65 ms
encode_image_with_clip: all 1 segments encoded in   263.71 ms
encode_image_with_clip: load_image_size 480 270
encode_image_with_clip: image embedding created: 64 tokens

encode_image_with_clip: image encoded in   264.15 ms by CLIP (    4.13 ms per image patch)
process_image: image token past: 3
process_image: image token past: 69
uhd_slice_image: multiple 1
clip_image_preprocess: 602 336
clip_image_build_graph: 602 336
encode_image_with_clip: step 1 of 1 encoded in   130.90 ms
encode_image_with_clip: all 1 segments encoded in   130.94 ms
encode_image_with_clip: load_image_size 480 270
encode_image_with_clip: image embedding created: 64 tokens

encode_image_with_clip: image encoded in   131.36 ms by CLIP (    2.05 ms per image patch)
process_image: image token past: 69
process_image: image token past: 135
uhd_slice_image: multiple 1
clip_image_preprocess: 602 336
clip_image_build_graph: 602 336
encode_image_with_clip: step 1 of 1 encoded in   126.81 ms
encode_image_with_clip: all 1 segments encoded in   126.85 ms
encode_image_with_clip: load_image_size 480 270
encode_image_with_clip: image embedding created: 64 tokens

encode_image_with_clip: image encoded in   126.92 ms by CLIP (    1.98 ms per image patch)
process_image: image token past: 135
process_image: image token past: 201
uhd_slice_image: multiple 1
clip_image_preprocess: 602 336
clip_image_build_graph: 602 336
encode_image_with_clip: step 1 of 1 encoded in   126.85 ms
encode_image_with_clip: all 1 segments encoded in   126.89 ms
encode_image_with_clip: load_image_size 480 270
encode_image_with_clip: image embedding created: 64 tokens

encode_image_with_clip: image encoded in   126.95 ms by CLIP (    1.98 ms per image patch)
process_image: image token past: 201
process_image: image token past: 267
用户: 视频中显示了什么场景?
AI: 视频中显示了两个人在更衣室里发生冲突的场景。一个人穿着黑色上衣和白色裤子,另一个人穿着黑色上衣和黑色裤子。

llama_print_timings:        load time =    4954.32 ms
llama_print_timings:      sample time =       2.73 ms /    50 runs   (    0.05 ms per token, 18288.22 tokens per second)
llama_print_timings: prompt eval time =    4012.86 ms /   271 tokens (   14.81 ms per token,    67.53 tokens per second)
llama_print_timings:        eval time =    5478.07 ms /    49 runs   (  111.80 ms per token,     8.94 tokens per second)
llama_print_timings:       total time =   10569.70 ms /   320 tokens

@saket424
Copy link

@tc-mb
when i give it a single image i get english output but when i give it two or more images i only get chinese output even though I explicitly tell it no chinese please in the prompt

@saket424
Copy link

saket424 commented Aug 27, 2024

a fifteen second video clip seems to work fine and produces english output . it will be great if we can specify how many images we would like libav to sample from the clip 0.3fps rather than the default of 1 fps

./llama-minicpmv-cli -m ./mini2.6/ggml-model-Q4_K_M.gguf --mmproj ./mini2.6/mmproj-model-f16.gguf --video ./assets/fight.mp4 --temp 0.1 -p "describe the video in detail" -c 4096
Log start
llama_model_loader: loaded meta data with 22 key-value pairs and 339 tensors from ./mini2.6/ggml-model-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.name str              = model
llama_model_loader: - kv   2:                          qwen2.block_count u32              = 28
llama_model_loader: - kv   3:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv   4:                     qwen2.embedding_length u32              = 3584
llama_model_loader: - kv   5:                  qwen2.feed_forward_length u32              = 18944
llama_model_loader: - kv   6:                 qwen2.attention.head_count u32              = 28
llama_model_loader: - kv   7:              qwen2.attention.head_count_kv u32              = 4
llama_model_loader: - kv   8:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv   9:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                          general.file_type u32              = 15
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  12:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,151666]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,151666]  = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  15:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 151644
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 128244
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% for message in messages %}{% if lo...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q4_K:  169 tensors
llama_model_loader: - type q6_K:   29 tensors
llm_load_vocab: special tokens cache size = 25
llm_load_vocab: token to piece cache size = 0.9309 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151666
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 3584
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 28
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 7
llm_load_print_meta: n_embd_k_gqa     = 512
llm_load_print_meta: n_embd_v_gqa     = 512
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 18944
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 7.61 B
llm_load_print_meta: model size       = 4.35 GiB (4.91 BPW) 
llm_load_print_meta: general.name     = model
llm_load_print_meta: BOS token        = 151644 '<|im_start|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: UNK token        = 128244 '<unk>'
llm_load_print_meta: PAD token        = 0 '!'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4060 Laptop GPU, compute capability 8.9, VMM: yes
llm_load_tensors: ggml ctx size =    0.15 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/29 layers to GPU
llm_load_tensors:        CPU buffer size =  4458.57 MiB
....................................................................................
clip_model_load: description:  image encoder for MiniCPM-V
clip_model_load: GGUF version: 3
clip_model_load: alignment:    32
clip_model_load: n_tensors:    455
clip_model_load: n_kv:         19
clip_model_load: ftype:        f16

clip_model_load: loaded meta data with 19 key-value pairs and 455 tensors from ./mini2.6/mmproj-model-f16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv   0:                       general.architecture str              = clip
clip_model_load: - kv   1:                      clip.has_text_encoder bool             = false
clip_model_load: - kv   2:                    clip.has_vision_encoder bool             = true
clip_model_load: - kv   3:                clip.has_minicpmv_projector bool             = true
clip_model_load: - kv   4:                          general.file_type u32              = 1
clip_model_load: - kv   5:                        general.description str              = image encoder for MiniCPM-V
clip_model_load: - kv   6:                        clip.projector_type str              = resampler
clip_model_load: - kv   7:                      clip.minicpmv_version i32              = 3
clip_model_load: - kv   8:                     clip.vision.image_size u32              = 448
clip_model_load: - kv   9:                     clip.vision.patch_size u32              = 14
clip_model_load: - kv  10:               clip.vision.embedding_length u32              = 1152
clip_model_load: - kv  11:            clip.vision.feed_forward_length u32              = 4304
clip_model_load: - kv  12:                 clip.vision.projection_dim u32              = 0
clip_model_load: - kv  13:           clip.vision.attention.head_count u32              = 16
clip_model_load: - kv  14:   clip.vision.attention.layer_norm_epsilon f32              = 0.000001
clip_model_load: - kv  15:                    clip.vision.block_count u32              = 26
clip_model_load: - kv  16:                     clip.vision.image_mean arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv  17:                      clip.vision.image_std arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv  18:                              clip.use_gelu bool             = true
clip_model_load: - type  f32:  285 tensors
clip_model_load: - type  f16:  170 tensors
clip_model_load: CLIP using CUDA backend
clip_model_load: text_encoder:   0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector:  0
clip_model_load: minicpmv_projector:  1
clip_model_load: model size:     996.02 MB
clip_model_load: metadata size:  0.16 MB
clip_model_load: params backend buffer size =  996.02 MB (455 tensors)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
clip_image_build_graph: 448 448
clip_model_load: compute allocated memory: 102.80 MB
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =   224.00 MiB
llama_new_context_with_model: KV self size  =  224.00 MiB, K (f16):  112.00 MiB, V (f16):  112.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.58 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   742.36 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    15.01 MiB
llama_new_context_with_model: graph nodes  = 986
llama_new_context_with_model: graph splits = 396
frame_len: inf
uhd_slice_image: multiple 2
uhd_slice_image: image_size: 1280 720; source_image size: 602 336
uhd_slice_image: image_size: 1280 720; best_grid: 2 1
uhd_slice_image: refine_image_size: 840 476; refine_size: 840 476
clip_image_preprocess: 602 336
clip_image_preprocess: 420 476
clip_image_preprocess: 420 476
clip_image_build_graph: 602 336
encode_image_with_clip: step 1 of 3 encoded in   259.82 ms
clip_image_build_graph: 420 476
encode_image_with_clip: step 2 of 3 encoded in   133.52 ms
clip_image_build_graph: 420 476
encode_image_with_clip: step 3 of 3 encoded in   122.71 ms
encode_image_with_clip: all 3 segments encoded in   516.23 ms
encode_image_with_clip: load_image_size 1280 720
encode_image_with_clip: image embedding created: 192 tokens

encode_image_with_clip: image encoded in   517.59 ms by CLIP (    2.70 ms per image patch)
process_image: image token past: 3
process_image: image token past: 204
uhd_slice_image: multiple 2
uhd_slice_image: image_size: 1280 720; source_image size: 602 336
uhd_slice_image: image_size: 1280 720; best_grid: 2 1
uhd_slice_image: refine_image_size: 840 476; refine_size: 840 476
clip_image_preprocess: 602 336
clip_image_preprocess: 420 476
clip_image_preprocess: 420 476
clip_image_build_graph: 602 336
encode_image_with_clip: step 1 of 3 encoded in   133.12 ms
clip_image_build_graph: 420 476
encode_image_with_clip: step 2 of 3 encoded in   116.41 ms
clip_image_build_graph: 420 476
encode_image_with_clip: step 3 of 3 encoded in   116.46 ms
encode_image_with_clip: all 3 segments encoded in   366.11 ms
encode_image_with_clip: load_image_size 1280 720
encode_image_with_clip: image embedding created: 192 tokens

encode_image_with_clip: image encoded in   366.94 ms by CLIP (    1.91 ms per image patch)
process_image: image token past: 204
process_image: image token past: 405
uhd_slice_image: multiple 2
uhd_slice_image: image_size: 1280 720; source_image size: 602 336
uhd_slice_image: image_size: 1280 720; best_grid: 2 1
uhd_slice_image: refine_image_size: 840 476; refine_size: 840 476
clip_image_preprocess: 602 336
clip_image_preprocess: 420 476
clip_image_preprocess: 420 476
clip_image_build_graph: 602 336
encode_image_with_clip: step 1 of 3 encoded in   129.93 ms
clip_image_build_graph: 420 476
encode_image_with_clip: step 2 of 3 encoded in   115.06 ms
clip_image_build_graph: 420 476
encode_image_with_clip: step 3 of 3 encoded in   116.86 ms
encode_image_with_clip: all 3 segments encoded in   361.96 ms
encode_image_with_clip: load_image_size 1280 720
encode_image_with_clip: image embedding created: 192 tokens

encode_image_with_clip: image encoded in   362.14 ms by CLIP (    1.89 ms per image patch)
process_image: image token past: 405
process_image: image token past: 606
uhd_slice_image: multiple 2
uhd_slice_image: image_size: 1280 720; source_image size: 602 336
uhd_slice_image: image_size: 1280 720; best_grid: 2 1
uhd_slice_image: refine_image_size: 840 476; refine_size: 840 476
clip_image_preprocess: 602 336
clip_image_preprocess: 420 476
clip_image_preprocess: 420 476
clip_image_build_graph: 602 336
encode_image_with_clip: step 1 of 3 encoded in   129.16 ms
clip_image_build_graph: 420 476
encode_image_with_clip: step 2 of 3 encoded in   115.65 ms
clip_image_build_graph: 420 476
encode_image_with_clip: step 3 of 3 encoded in   115.95 ms
encode_image_with_clip: all 3 segments encoded in   360.88 ms
encode_image_with_clip: load_image_size 1280 720
encode_image_with_clip: image embedding created: 192 tokens

encode_image_with_clip: image encoded in   361.10 ms by CLIP (    1.88 ms per image patch)
process_image: image token past: 606
process_image: image token past: 807
uhd_slice_image: multiple 2
uhd_slice_image: image_size: 1280 720; source_image size: 602 336
uhd_slice_image: image_size: 1280 720; best_grid: 2 1
uhd_slice_image: refine_image_size: 840 476; refine_size: 840 476
clip_image_preprocess: 602 336
clip_image_preprocess: 420 476
clip_image_preprocess: 420 476
clip_image_build_graph: 602 336
encode_image_with_clip: step 1 of 3 encoded in   130.70 ms
clip_image_build_graph: 420 476
encode_image_with_clip: step 2 of 3 encoded in   115.39 ms
clip_image_build_graph: 420 476
encode_image_with_clip: step 3 of 3 encoded in   116.85 ms
encode_image_with_clip: all 3 segments encoded in   363.06 ms
encode_image_with_clip: load_image_size 1280 720
encode_image_with_clip: image embedding created: 192 tokens

encode_image_with_clip: image encoded in   363.27 ms by CLIP (    1.89 ms per image patch)
process_image: image token past: 807
process_image: image token past: 1008
uhd_slice_image: multiple 2
uhd_slice_image: image_size: 1280 720; source_image size: 602 336
uhd_slice_image: image_size: 1280 720; best_grid: 2 1
uhd_slice_image: refine_image_size: 840 476; refine_size: 840 476
clip_image_preprocess: 602 336
clip_image_preprocess: 420 476
clip_image_preprocess: 420 476
clip_image_build_graph: 602 336
encode_image_with_clip: step 1 of 3 encoded in   127.82 ms
clip_image_build_graph: 420 476
encode_image_with_clip: step 2 of 3 encoded in   110.62 ms
clip_image_build_graph: 420 476
encode_image_with_clip: step 3 of 3 encoded in   116.34 ms
encode_image_with_clip: all 3 segments encoded in   354.92 ms
encode_image_with_clip: load_image_size 1280 720
encode_image_with_clip: image embedding created: 192 tokens

encode_image_with_clip: image encoded in   355.15 ms by CLIP (    1.85 ms per image patch)
process_image: image token past: 1008
process_image: image token past: 1209
uhd_slice_image: multiple 2
uhd_slice_image: image_size: 1280 720; source_image size: 602 336
uhd_slice_image: image_size: 1280 720; best_grid: 2 1
uhd_slice_image: refine_image_size: 840 476; refine_size: 840 476
clip_image_preprocess: 602 336
clip_image_preprocess: 420 476
clip_image_preprocess: 420 476
clip_image_build_graph: 602 336
encode_image_with_clip: step 1 of 3 encoded in   128.92 ms
clip_image_build_graph: 420 476
encode_image_with_clip: step 2 of 3 encoded in   114.38 ms
clip_image_build_graph: 420 476
encode_image_with_clip: step 3 of 3 encoded in   117.02 ms
encode_image_with_clip: all 3 segments encoded in   360.44 ms
encode_image_with_clip: load_image_size 1280 720
encode_image_with_clip: image embedding created: 192 tokens

encode_image_with_clip: image encoded in   360.64 ms by CLIP (    1.88 ms per image patch)
process_image: image token past: 1209
process_image: image token past: 1410
uhd_slice_image: multiple 2
uhd_slice_image: image_size: 1280 720; source_image size: 602 336
uhd_slice_image: image_size: 1280 720; best_grid: 2 1
uhd_slice_image: refine_image_size: 840 476; refine_size: 840 476
clip_image_preprocess: 602 336
clip_image_preprocess: 420 476
clip_image_preprocess: 420 476
clip_image_build_graph: 602 336
encode_image_with_clip: step 1 of 3 encoded in   129.02 ms
clip_image_build_graph: 420 476
encode_image_with_clip: step 2 of 3 encoded in   112.81 ms
clip_image_build_graph: 420 476
encode_image_with_clip: step 3 of 3 encoded in   118.51 ms
encode_image_with_clip: all 3 segments encoded in   360.49 ms
encode_image_with_clip: load_image_size 1280 720
encode_image_with_clip: image embedding created: 192 tokens

encode_image_with_clip: image encoded in   360.74 ms by CLIP (    1.88 ms per image patch)
process_image: image token past: 1410
process_image: image token past: 1611
uhd_slice_image: multiple 2
uhd_slice_image: image_size: 1280 720; source_image size: 602 336
uhd_slice_image: image_size: 1280 720; best_grid: 2 1
uhd_slice_image: refine_image_size: 840 476; refine_size: 840 476
clip_image_preprocess: 602 336
clip_image_preprocess: 420 476
clip_image_preprocess: 420 476
clip_image_build_graph: 602 336
encode_image_with_clip: step 1 of 3 encoded in   129.13 ms
clip_image_build_graph: 420 476
encode_image_with_clip: step 2 of 3 encoded in   115.78 ms
clip_image_build_graph: 420 476
encode_image_with_clip: step 3 of 3 encoded in   116.56 ms
encode_image_with_clip: all 3 segments encoded in   361.65 ms
encode_image_with_clip: load_image_size 1280 720
encode_image_with_clip: image embedding created: 192 tokens

encode_image_with_clip: image encoded in   361.89 ms by CLIP (    1.88 ms per image patch)
process_image: image token past: 1611
process_image: image token past: 1812
uhd_slice_image: multiple 2
uhd_slice_image: image_size: 1280 720; source_image size: 602 336
uhd_slice_image: image_size: 1280 720; best_grid: 2 1
uhd_slice_image: refine_image_size: 840 476; refine_size: 840 476
clip_image_preprocess: 602 336
clip_image_preprocess: 420 476
clip_image_preprocess: 420 476
clip_image_build_graph: 602 336
encode_image_with_clip: step 1 of 3 encoded in   127.78 ms
clip_image_build_graph: 420 476
encode_image_with_clip: step 2 of 3 encoded in   114.77 ms
clip_image_build_graph: 420 476
encode_image_with_clip: step 3 of 3 encoded in   112.61 ms
encode_image_with_clip: all 3 segments encoded in   355.27 ms
encode_image_with_clip: load_image_size 1280 720
encode_image_with_clip: image embedding created: 192 tokens

encode_image_with_clip: image encoded in   355.48 ms by CLIP (    1.85 ms per image patch)
process_image: image token past: 1812
process_image: image token past: 2013
uhd_slice_image: multiple 2
uhd_slice_image: image_size: 1280 720; source_image size: 602 336
uhd_slice_image: image_size: 1280 720; best_grid: 2 1
uhd_slice_image: refine_image_size: 840 476; refine_size: 840 476
clip_image_preprocess: 602 336
clip_image_preprocess: 420 476
clip_image_preprocess: 420 476
clip_image_build_graph: 602 336
encode_image_with_clip: step 1 of 3 encoded in   128.14 ms
clip_image_build_graph: 420 476
encode_image_with_clip: step 2 of 3 encoded in   116.01 ms
clip_image_build_graph: 420 476
encode_image_with_clip: step 3 of 3 encoded in   117.17 ms
encode_image_with_clip: all 3 segments encoded in   361.42 ms
encode_image_with_clip: load_image_size 1280 720
encode_image_with_clip: image embedding created: 192 tokens

encode_image_with_clip: image encoded in   361.63 ms by CLIP (    1.88 ms per image patch)
process_image: image token past: 2013
process_image: image token past: 2214
uhd_slice_image: multiple 2
uhd_slice_image: image_size: 1280 720; source_image size: 602 336
uhd_slice_image: image_size: 1280 720; best_grid: 2 1
uhd_slice_image: refine_image_size: 840 476; refine_size: 840 476
clip_image_preprocess: 602 336
clip_image_preprocess: 420 476
clip_image_preprocess: 420 476
clip_image_build_graph: 602 336
encode_image_with_clip: step 1 of 3 encoded in   127.22 ms
clip_image_build_graph: 420 476
encode_image_with_clip: step 2 of 3 encoded in   115.59 ms
clip_image_build_graph: 420 476
encode_image_with_clip: step 3 of 3 encoded in   117.27 ms
encode_image_with_clip: all 3 segments encoded in   360.19 ms
encode_image_with_clip: load_image_size 1280 720
encode_image_with_clip: image embedding created: 192 tokens

encode_image_with_clip: image encoded in   360.42 ms by CLIP (    1.88 ms per image patch)
process_image: image token past: 2214
process_image: image token past: 2415
uhd_slice_image: multiple 2
uhd_slice_image: image_size: 1280 720; source_image size: 602 336
uhd_slice_image: image_size: 1280 720; best_grid: 2 1
uhd_slice_image: refine_image_size: 840 476; refine_size: 840 476
clip_image_preprocess: 602 336
clip_image_preprocess: 420 476
clip_image_preprocess: 420 476
clip_image_build_graph: 602 336
encode_image_with_clip: step 1 of 3 encoded in   132.04 ms
clip_image_build_graph: 420 476
encode_image_with_clip: step 2 of 3 encoded in   115.48 ms
clip_image_build_graph: 420 476
encode_image_with_clip: step 3 of 3 encoded in   115.21 ms
encode_image_with_clip: all 3 segments encoded in   362.84 ms
encode_image_with_clip: load_image_size 1280 720
encode_image_with_clip: image embedding created: 192 tokens

encode_image_with_clip: image encoded in   363.05 ms by CLIP (    1.89 ms per image patch)
process_image: image token past: 2415
process_image: image token past: 2616
uhd_slice_image: multiple 2
uhd_slice_image: image_size: 1280 720; source_image size: 602 336
uhd_slice_image: image_size: 1280 720; best_grid: 2 1
uhd_slice_image: refine_image_size: 840 476; refine_size: 840 476
clip_image_preprocess: 602 336
clip_image_preprocess: 420 476
clip_image_preprocess: 420 476
clip_image_build_graph: 602 336
encode_image_with_clip: step 1 of 3 encoded in   129.99 ms
clip_image_build_graph: 420 476
encode_image_with_clip: step 2 of 3 encoded in   113.54 ms
clip_image_build_graph: 420 476
encode_image_with_clip: step 3 of 3 encoded in   118.48 ms
encode_image_with_clip: all 3 segments encoded in   362.12 ms
encode_image_with_clip: load_image_size 1280 720
encode_image_with_clip: image embedding created: 192 tokens

encode_image_with_clip: image encoded in   362.34 ms by CLIP (    1.89 ms per image patch)
process_image: image token past: 2616
process_image: image token past: 2817
uhd_slice_image: multiple 2
uhd_slice_image: image_size: 1280 720; source_image size: 602 336
uhd_slice_image: image_size: 1280 720; best_grid: 2 1
uhd_slice_image: refine_image_size: 840 476; refine_size: 840 476
clip_image_preprocess: 602 336
clip_image_preprocess: 420 476
clip_image_preprocess: 420 476
clip_image_build_graph: 602 336
encode_image_with_clip: step 1 of 3 encoded in   129.22 ms
clip_image_build_graph: 420 476
encode_image_with_clip: step 2 of 3 encoded in   112.81 ms
clip_image_build_graph: 420 476
encode_image_with_clip: step 3 of 3 encoded in   115.57 ms
encode_image_with_clip: all 3 segments encoded in   357.70 ms
encode_image_with_clip: load_image_size 1280 720
encode_image_with_clip: image embedding created: 192 tokens

encode_image_with_clip: image encoded in   357.90 ms by CLIP (    1.86 ms per image patch)
process_image: image token past: 2817
process_image: image token past: 3018
uhd_slice_image: multiple 2
uhd_slice_image: image_size: 1280 720; source_image size: 602 336
uhd_slice_image: image_size: 1280 720; best_grid: 2 1
uhd_slice_image: refine_image_size: 840 476; refine_size: 840 476
clip_image_preprocess: 602 336
clip_image_preprocess: 420 476
clip_image_preprocess: 420 476
clip_image_build_graph: 602 336
encode_image_with_clip: step 1 of 3 encoded in   129.90 ms
clip_image_build_graph: 420 476
encode_image_with_clip: step 2 of 3 encoded in   113.44 ms
clip_image_build_graph: 420 476
encode_image_with_clip: step 3 of 3 encoded in   117.38 ms
encode_image_with_clip: all 3 segments encoded in   360.83 ms
encode_image_with_clip: load_image_size 1280 720
encode_image_with_clip: image embedding created: 192 tokens

encode_image_with_clip: image encoded in   361.03 ms by CLIP (    1.88 ms per image patch)
process_image: image token past: 3018
process_image: image token past: 3219
The video begins with a news broadcast from FOX 11, indicating the time as 5 PM with a temperature of 67°F in Santa Ana, and transitions to a series of clips showing a physical altercation in a school hallway. The altercation involves two individuals, one in a black shirt and beige pants, and the other in a white shirt and black pants. The black-shirted individual appears to be restraining or attacking the white-shirted person, who is on the ground, against a backdrop of blue lockers and yellow and white floor markings. The sequence of clips captures various stages of the confrontation, including attempts to control the situation, with the black-shirted individual using their weight and positioning to maintain dominance over the white-shirted person. The camera angles shift to focus on different aspects of the altercation, including the individuals' footwear and the surrounding environment, which includes lockers and a tiled floor. The video concludes with a graphic stating "ONLY ON FOX 11" and a shot of the FOX 11 news studio with anchors seated at a desk, suggesting the broadcast of the news segment.

@tc-mb
Copy link
Contributor Author

tc-mb commented Aug 28, 2024

@saket424
Sorry, due to the time difference, my reply may seem a bit slow.
u0rmPDXVmH

I saw the log you sent and noticed a situation. The input prompt is after "用户", and the input is in Chinese, and the model's reply is after "AI". The answer is in Chinese, which seems reasonable.
But I don't know if you have modified the cli code, because the input video should not have the two parts "用户" and "AI".
I will modify the output interface today for testing.

@saket424
Copy link

@tc-mb
I know no chinese and so I certainly dont have have any chinese in my input prompt . Here is another run

./llama-minicpmv-cli -m ./mini2.6/ggml-model-Q4_K_M.gguf --mmproj ./mini2.6/mmproj-model-f16.gguf --image ./f/frame_0001.jpg --image ./f/frame_0002.jpg --image ./f/frame_0003.jpg --image ./f/frame_0004.jpg --temp 0.7 -p "describe the images in detail. Please output in English!" -c 4096
Log start
llama_model_loader: loaded meta data with 22 key-value pairs and 339 tensors from ./mini2.6/ggml-model-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.name str              = model
llama_model_loader: - kv   2:                          qwen2.block_count u32              = 28
llama_model_loader: - kv   3:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv   4:                     qwen2.embedding_length u32              = 3584
llama_model_loader: - kv   5:                  qwen2.feed_forward_length u32              = 18944
llama_model_loader: - kv   6:                 qwen2.attention.head_count u32              = 28
llama_model_loader: - kv   7:              qwen2.attention.head_count_kv u32              = 4
llama_model_loader: - kv   8:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv   9:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                          general.file_type u32              = 15
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  12:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,151666]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,151666]  = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  15:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 151644
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 128244
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% for message in messages %}{% if lo...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q4_K:  169 tensors
llama_model_loader: - type q6_K:   29 tensors
llm_load_vocab: special tokens cache size = 25
llm_load_vocab: token to piece cache size = 0.9309 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151666
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 3584
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 28
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 7
llm_load_print_meta: n_embd_k_gqa     = 512
llm_load_print_meta: n_embd_v_gqa     = 512
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 18944
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 7.61 B
llm_load_print_meta: model size       = 4.35 GiB (4.91 BPW) 
llm_load_print_meta: general.name     = model
llm_load_print_meta: BOS token        = 151644 '<|im_start|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: UNK token        = 128244 '<unk>'
llm_load_print_meta: PAD token        = 0 '!'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4060 Laptop GPU, compute capability 8.9, VMM: yes
llm_load_tensors: ggml ctx size =    0.15 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/29 layers to GPU
llm_load_tensors:        CPU buffer size =  4458.57 MiB
....................................................................................
clip_model_load: description:  image encoder for MiniCPM-V
clip_model_load: GGUF version: 3
clip_model_load: alignment:    32
clip_model_load: n_tensors:    455
clip_model_load: n_kv:         19
clip_model_load: ftype:        f16

clip_model_load: loaded meta data with 19 key-value pairs and 455 tensors from ./mini2.6/mmproj-model-f16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv   0:                       general.architecture str              = clip
clip_model_load: - kv   1:                      clip.has_text_encoder bool             = false
clip_model_load: - kv   2:                    clip.has_vision_encoder bool             = true
clip_model_load: - kv   3:                clip.has_minicpmv_projector bool             = true
clip_model_load: - kv   4:                          general.file_type u32              = 1
clip_model_load: - kv   5:                        general.description str              = image encoder for MiniCPM-V
clip_model_load: - kv   6:                        clip.projector_type str              = resampler
clip_model_load: - kv   7:                      clip.minicpmv_version i32              = 3
clip_model_load: - kv   8:                     clip.vision.image_size u32              = 448
clip_model_load: - kv   9:                     clip.vision.patch_size u32              = 14
clip_model_load: - kv  10:               clip.vision.embedding_length u32              = 1152
clip_model_load: - kv  11:            clip.vision.feed_forward_length u32              = 4304
clip_model_load: - kv  12:                 clip.vision.projection_dim u32              = 0
clip_model_load: - kv  13:           clip.vision.attention.head_count u32              = 16
clip_model_load: - kv  14:   clip.vision.attention.layer_norm_epsilon f32              = 0.000001
clip_model_load: - kv  15:                    clip.vision.block_count u32              = 26
clip_model_load: - kv  16:                     clip.vision.image_mean arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv  17:                      clip.vision.image_std arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv  18:                              clip.use_gelu bool             = true
clip_model_load: - type  f32:  285 tensors
clip_model_load: - type  f16:  170 tensors
clip_model_load: CLIP using CUDA backend
clip_model_load: text_encoder:   0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector:  0
clip_model_load: minicpmv_projector:  1
clip_model_load: model size:     996.02 MB
clip_model_load: metadata size:  0.16 MB
clip_model_load: params backend buffer size =  996.02 MB (455 tensors)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
clip_image_build_graph: 448 448
clip_model_load: compute allocated memory: 102.80 MB
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =   224.00 MiB
llama_new_context_with_model: KV self size  =  224.00 MiB, K (f16):  112.00 MiB, V (f16):  112.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.58 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   742.36 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    15.01 MiB
llama_new_context_with_model: graph nodes  = 986
llama_new_context_with_model: graph splits = 396
uhd_slice_image: multiple 1
clip_image_preprocess: 602 336
clip_image_build_graph: 602 336
encode_image_with_clip: step 1 of 1 encoded in   265.91 ms
encode_image_with_clip: all 1 segments encoded in   265.98 ms
encode_image_with_clip: load_image_size 480 270
encode_image_with_clip: image embedding created: 64 tokens

encode_image_with_clip: image encoded in   266.68 ms by CLIP (    4.17 ms per image patch)
process_image: image token past: 3
process_image: image token past: 69
uhd_slice_image: multiple 1
clip_image_preprocess: 602 336
clip_image_build_graph: 602 336
encode_image_with_clip: step 1 of 1 encoded in   145.91 ms
encode_image_with_clip: all 1 segments encoded in   145.97 ms
encode_image_with_clip: load_image_size 480 270
encode_image_with_clip: image embedding created: 64 tokens

encode_image_with_clip: image encoded in   146.46 ms by CLIP (    2.29 ms per image patch)
process_image: image token past: 69
process_image: image token past: 135
uhd_slice_image: multiple 1
clip_image_preprocess: 602 336
clip_image_build_graph: 602 336
encode_image_with_clip: step 1 of 1 encoded in   125.97 ms
encode_image_with_clip: all 1 segments encoded in   126.01 ms
encode_image_with_clip: load_image_size 480 270
encode_image_with_clip: image embedding created: 64 tokens

encode_image_with_clip: image encoded in   126.07 ms by CLIP (    1.97 ms per image patch)
process_image: image token past: 135
process_image: image token past: 201
uhd_slice_image: multiple 1
clip_image_preprocess: 602 336
clip_image_build_graph: 602 336
encode_image_with_clip: step 1 of 1 encoded in   126.55 ms
encode_image_with_clip: all 1 segments encoded in   126.58 ms
encode_image_with_clip: load_image_size 480 270
encode_image_with_clip: image embedding created: 64 tokens

encode_image_with_clip: image encoded in   126.64 ms by CLIP (    1.98 ms per image patch)
process_image: image token past: 201
process_image: image token past: 267
这是一段视频片段,展示了两个人在一个蓝色背景和黄色地板的区域进行肢体冲突。穿着黑色短袖上衣和白色裤子的男子与穿着黑色夹克和灰色裤子的男子发生争执。冲突包括推搡、摔打和拳打脚踢。穿黑色短袖上衣的男子看起来占了上风,最终将穿黑色夹克的男子摔倒在地板上,用脚踩在他身上。视频中可以看到黄色的地板和蓝色的背景,暗示这是一个类似储物柜或更衣室的环境。

llama_print_timings:        load time =    5408.42 ms
llama_print_timings:      sample time =       5.85 ms /   116 runs   (    0.05 ms per token, 19832.45 tokens per second)
llama_print_timings: prompt eval time =    4432.80 ms /   271 tokens (   16.36 ms per token,    61.14 tokens per second)
llama_print_timings:        eval time =   12764.70 ms /   115 runs   (  111.00 ms per token,     9.01 tokens per second)
llama_print_timings:       total time =   18338.22 ms /   386 tokens
(ytvenv) anand@nitro17:~/moondream-stuff/llama.cpp$

@tc-mb
Copy link
Contributor Author

tc-mb commented Aug 28, 2024

@tc-mb I know no chinese and so I certainly dont have have any chinese in my input prompt . Here is another run

./llama-minicpmv-cli -m ./mini2.6/ggml-model-Q4_K_M.gguf --mmproj ./mini2.6/mmproj-model-f16.gguf --image ./f/frame_0001.jpg --image ./f/frame_0002.jpg --image ./f/frame_0003.jpg --image ./f/frame_0004.jpg --temp 0.7 -p "describe the images in detail. Please output in English!" -c 4096
Log start
llama_model_loader: loaded meta data with 22 key-value pairs and 339 tensors from ./mini2.6/ggml-model-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.name str              = model
llama_model_loader: - kv   2:                          qwen2.block_count u32              = 28
llama_model_loader: - kv   3:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv   4:                     qwen2.embedding_length u32              = 3584
llama_model_loader: - kv   5:                  qwen2.feed_forward_length u32              = 18944
llama_model_loader: - kv   6:                 qwen2.attention.head_count u32              = 28
llama_model_loader: - kv   7:              qwen2.attention.head_count_kv u32              = 4
llama_model_loader: - kv   8:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv   9:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                          general.file_type u32              = 15
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  12:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,151666]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,151666]  = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  15:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 151644
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 128244
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% for message in messages %}{% if lo...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q4_K:  169 tensors
llama_model_loader: - type q6_K:   29 tensors
llm_load_vocab: special tokens cache size = 25
llm_load_vocab: token to piece cache size = 0.9309 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151666
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 3584
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 28
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 7
llm_load_print_meta: n_embd_k_gqa     = 512
llm_load_print_meta: n_embd_v_gqa     = 512
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 18944
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 7.61 B
llm_load_print_meta: model size       = 4.35 GiB (4.91 BPW) 
llm_load_print_meta: general.name     = model
llm_load_print_meta: BOS token        = 151644 '<|im_start|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: UNK token        = 128244 '<unk>'
llm_load_print_meta: PAD token        = 0 '!'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4060 Laptop GPU, compute capability 8.9, VMM: yes
llm_load_tensors: ggml ctx size =    0.15 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/29 layers to GPU
llm_load_tensors:        CPU buffer size =  4458.57 MiB
....................................................................................
clip_model_load: description:  image encoder for MiniCPM-V
clip_model_load: GGUF version: 3
clip_model_load: alignment:    32
clip_model_load: n_tensors:    455
clip_model_load: n_kv:         19
clip_model_load: ftype:        f16

clip_model_load: loaded meta data with 19 key-value pairs and 455 tensors from ./mini2.6/mmproj-model-f16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv   0:                       general.architecture str              = clip
clip_model_load: - kv   1:                      clip.has_text_encoder bool             = false
clip_model_load: - kv   2:                    clip.has_vision_encoder bool             = true
clip_model_load: - kv   3:                clip.has_minicpmv_projector bool             = true
clip_model_load: - kv   4:                          general.file_type u32              = 1
clip_model_load: - kv   5:                        general.description str              = image encoder for MiniCPM-V
clip_model_load: - kv   6:                        clip.projector_type str              = resampler
clip_model_load: - kv   7:                      clip.minicpmv_version i32              = 3
clip_model_load: - kv   8:                     clip.vision.image_size u32              = 448
clip_model_load: - kv   9:                     clip.vision.patch_size u32              = 14
clip_model_load: - kv  10:               clip.vision.embedding_length u32              = 1152
clip_model_load: - kv  11:            clip.vision.feed_forward_length u32              = 4304
clip_model_load: - kv  12:                 clip.vision.projection_dim u32              = 0
clip_model_load: - kv  13:           clip.vision.attention.head_count u32              = 16
clip_model_load: - kv  14:   clip.vision.attention.layer_norm_epsilon f32              = 0.000001
clip_model_load: - kv  15:                    clip.vision.block_count u32              = 26
clip_model_load: - kv  16:                     clip.vision.image_mean arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv  17:                      clip.vision.image_std arr[f32,3]       = [0.500000, 0.500000, 0.500000]
clip_model_load: - kv  18:                              clip.use_gelu bool             = true
clip_model_load: - type  f32:  285 tensors
clip_model_load: - type  f16:  170 tensors
clip_model_load: CLIP using CUDA backend
clip_model_load: text_encoder:   0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector:  0
clip_model_load: minicpmv_projector:  1
clip_model_load: model size:     996.02 MB
clip_model_load: metadata size:  0.16 MB
clip_model_load: params backend buffer size =  996.02 MB (455 tensors)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
clip_image_build_graph: 448 448
clip_model_load: compute allocated memory: 102.80 MB
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =   224.00 MiB
llama_new_context_with_model: KV self size  =  224.00 MiB, K (f16):  112.00 MiB, V (f16):  112.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.58 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   742.36 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    15.01 MiB
llama_new_context_with_model: graph nodes  = 986
llama_new_context_with_model: graph splits = 396
uhd_slice_image: multiple 1
clip_image_preprocess: 602 336
clip_image_build_graph: 602 336
encode_image_with_clip: step 1 of 1 encoded in   265.91 ms
encode_image_with_clip: all 1 segments encoded in   265.98 ms
encode_image_with_clip: load_image_size 480 270
encode_image_with_clip: image embedding created: 64 tokens

encode_image_with_clip: image encoded in   266.68 ms by CLIP (    4.17 ms per image patch)
process_image: image token past: 3
process_image: image token past: 69
uhd_slice_image: multiple 1
clip_image_preprocess: 602 336
clip_image_build_graph: 602 336
encode_image_with_clip: step 1 of 1 encoded in   145.91 ms
encode_image_with_clip: all 1 segments encoded in   145.97 ms
encode_image_with_clip: load_image_size 480 270
encode_image_with_clip: image embedding created: 64 tokens

encode_image_with_clip: image encoded in   146.46 ms by CLIP (    2.29 ms per image patch)
process_image: image token past: 69
process_image: image token past: 135
uhd_slice_image: multiple 1
clip_image_preprocess: 602 336
clip_image_build_graph: 602 336
encode_image_with_clip: step 1 of 1 encoded in   125.97 ms
encode_image_with_clip: all 1 segments encoded in   126.01 ms
encode_image_with_clip: load_image_size 480 270
encode_image_with_clip: image embedding created: 64 tokens

encode_image_with_clip: image encoded in   126.07 ms by CLIP (    1.97 ms per image patch)
process_image: image token past: 135
process_image: image token past: 201
uhd_slice_image: multiple 1
clip_image_preprocess: 602 336
clip_image_build_graph: 602 336
encode_image_with_clip: step 1 of 1 encoded in   126.55 ms
encode_image_with_clip: all 1 segments encoded in   126.58 ms
encode_image_with_clip: load_image_size 480 270
encode_image_with_clip: image embedding created: 64 tokens

encode_image_with_clip: image encoded in   126.64 ms by CLIP (    1.98 ms per image patch)
process_image: image token past: 201
process_image: image token past: 267
这是一段视频片段,展示了两个人在一个蓝色背景和黄色地板的区域进行肢体冲突。穿着黑色短袖上衣和白色裤子的男子与穿着黑色夹克和灰色裤子的男子发生争执。冲突包括推搡、摔打和拳打脚踢。穿黑色短袖上衣的男子看起来占了上风,最终将穿黑色夹克的男子摔倒在地板上,用脚踩在他身上。视频中可以看到黄色的地板和蓝色的背景,暗示这是一个类似储物柜或更衣室的环境。

llama_print_timings:        load time =    5408.42 ms
llama_print_timings:      sample time =       5.85 ms /   116 runs   (    0.05 ms per token, 19832.45 tokens per second)
llama_print_timings: prompt eval time =    4432.80 ms /   271 tokens (   16.36 ms per token,    61.14 tokens per second)
llama_print_timings:        eval time =   12764.70 ms /   115 runs   (  111.00 ms per token,     9.01 tokens per second)
llama_print_timings:       total time =   18338.22 ms /   386 tokens
(ytvenv) anand@nitro17:~/moondream-stuff/llama.cpp$

strange. I have reproduced the problem. Thank you very much for helping me find this problem. I didn't find it before.
I will find out the problem and I will solve it as soon as possible and submit a commit.

@chigkim
Copy link

chigkim commented Aug 28, 2024

@tc-mb, Is that possible to have the interactive option -i in order to ask follow up questions like you can do with image input? Right now it just describes and quits if you specify -i.
Thanks so much!

@saket424
Copy link

saket424 commented Aug 29, 2024

@tc-mb
The binary crashes when a very large resolution jpg image is fed. Instead, you should assert saying image resolution is too high or since you have libav , you can resize the image. Hope you can reproduce this error and fix the crash

yuri.jpg (Baseline), yuvj420p(pc, bt470bg/unknown/unknown), 4080x3072
clip_image_build_graph: 448 448
clip_model_load: compute allocated memory: 102.80 MB
uhd_slice_image: multiple 9
uhd_slice_image: image_size: 4080 3072; source_image size: 518 392
uhd_slice_image: image_size: 4080 3072; best_grid: 3 3
zsh: segmentation fault
segmentation fault

ffmpeg -i yuri.jpg -vf scale=1280:-1 yuri-small.jpg

yuri-small.jpg (Baseline), yuvj420p(pc, bt470bg/unknown/unknown), 1280x964
uhd_slice_image: multiple 7
uhd_slice_image: image_size: 1280 964; source_image size: 518 392
uhd_slice_image: image_size: 1280 964; best_grid: 3 2
uhd_slice_image: refine_image_size: 1260 952; refine_size: 1260 952
clip_image_preprocess: 518 392
clip_image_preprocess: 420 476
clip_image_preprocess: 420 476
clip_image_preprocess: 420 476
clip_image_preprocess: 420 476
clip_image_preprocess: 420 476
clip_image_preprocess: 420 476
clip_image_build_graph: 518 392
encode_image_with_clip: step 1 of 7 encoded in 1068.46 ms
clip_image_build_graph: 420 476
encode_image_with_clip: step 2 of 7 encoded in 1001.44 ms
clip_image_build_graph: 420 476
encode_image_with_clip: step 3 of 7 encoded in 996.33 ms
clip_image_build_graph: 420 476
encode_image_with_clip: step 4 of 7 encoded in 997.90 ms
clip_image_build_graph: 420 476
encode_image_with_clip: step 5 of 7 encoded in 998.75 ms
clip_image_build_graph: 420 476
encode_image_with_clip: step 6 of 7 encoded in 995.54 ms
clip_image_build_graph: 420 476
encode_image_with_clip: step 7 of 7 encoded in 997.09 ms
encode_image_with_clip: all 7 segments encoded in 7055.78 ms
encode_image_with_clip: load_image_size 1280 964
encode_image_with_clip: image embedding created: 448 tokens
no crash -- inference successful

@tc-mb
Copy link
Contributor Author

tc-mb commented Aug 29, 2024

@tc-mb, Is that possible to have the interactive option -i in order to ask follow up questions like you can do with image input? Right now it just describes and quits if you specify -i. Thanks so much!

No problem, I will update the code this week to make you to use the video understanding features with -i mode.

@tc-mb
Copy link
Contributor Author

tc-mb commented Aug 29, 2024

@tc-mb The binary crashes when a very large resolution jpg image is fed. Instead, you should assert saying image resolution is too high or since you have libav , you can resize the image. Hope you can reproduce this error and fix the crash

yuri.jpg (Baseline), yuvj420p(pc, bt470bg/unknown/unknown), 4080x3072 clip_image_build_graph: 448 448 clip_model_load: compute allocated memory: 102.80 MB uhd_slice_image: multiple 9 uhd_slice_image: image_size: 4080 3072; source_image size: 518 392 uhd_slice_image: image_size: 4080 3072; best_grid: 3 3 zsh: segmentation fault segmentation fault

ffmpeg -i yuri.jpg -vf scale=1280:-1 yuri-small.jpg

yuri-small.jpg (Baseline), yuvj420p(pc, bt470bg/unknown/unknown), 1280x964 uhd_slice_image: multiple 7 uhd_slice_image: image_size: 1280 964; source_image size: 518 392 uhd_slice_image: image_size: 1280 964; best_grid: 3 2 uhd_slice_image: refine_image_size: 1260 952; refine_size: 1260 952 clip_image_preprocess: 518 392 clip_image_preprocess: 420 476 clip_image_preprocess: 420 476 clip_image_preprocess: 420 476 clip_image_preprocess: 420 476 clip_image_preprocess: 420 476 clip_image_preprocess: 420 476 clip_image_build_graph: 518 392 encode_image_with_clip: step 1 of 7 encoded in 1068.46 ms clip_image_build_graph: 420 476 encode_image_with_clip: step 2 of 7 encoded in 1001.44 ms clip_image_build_graph: 420 476 encode_image_with_clip: step 3 of 7 encoded in 996.33 ms clip_image_build_graph: 420 476 encode_image_with_clip: step 4 of 7 encoded in 997.90 ms clip_image_build_graph: 420 476 encode_image_with_clip: step 5 of 7 encoded in 998.75 ms clip_image_build_graph: 420 476 encode_image_with_clip: step 6 of 7 encoded in 995.54 ms clip_image_build_graph: 420 476 encode_image_with_clip: step 7 of 7 encoded in 997.09 ms encode_image_with_clip: all 7 segments encoded in 7055.78 ms encode_image_with_clip: load_image_size 1280 964 encode_image_with_clip: image embedding created: 448 tokens no crash -- inference successful

I'm sorry, I can't reproduce the bug you posted. I tested large images here, and they are all feasible. I tried both square and rectangular images, and even much larger than the size you mentioned, but they are all usable. You can send me the image, or check whether it is due to insufficient memory. Because I feel that the most likely problem is this.

The idea of ​​llama.cpp is that edge devices can also run large models, so the program will continuously apply for small-scale space during the inference process, rather than applying for a very large space at one time during initialization. This is convenient for optimizing performance on edge devices, but sometimes it will cause insufficient memory to jump out during execution.

My code is inherited from the original implementation of llava. The source code does not add a judgment after each malloc, which will result in an invalid pointer being used when executing downstream. Only when this pointer is used, an error will be reported, and at this time, it may have skipped many functions away from the real problem.

I will add some judgments to the malloc part of the multimodal code later this week, so that the problem of failed memory application can be discovered in time.

@saket424
Copy link

I have documented the crash here #9230
I have 96GB ram on my Mac and I do not think I ran out of memory as you surmise

@tc-mb
Copy link
Contributor Author

tc-mb commented Aug 29, 2024

I have documented the crash here #9230 I have 96GB ram on my Mac and I do not think I ran out of memory as you surmise

I have found the issue and submitted a PR, maybe you can look at it directly here.

#9237

@mofosyne mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label Aug 30, 2024
@chigkim
Copy link

chigkim commented Sep 5, 2024

@tc-mb is there a separate PR for -i interactive for follow up questions, or are you planning to push more commits to this one?
Thanks so much!

common/common.cpp Outdated Show resolved Hide resolved
@HanClinto
Copy link
Collaborator

I have tested this out, and I was able to successfully get it to answer questions about a video file -- very exciting!!

The portion where the video frames are encoded takes a very long time -- adding a message that says something like "encoding video frame 7 of 16" or something may be a nice thing to add. I'm also wondering about other ways to speed up video processing, such as adding a frame_skip parameter (so that only every N frames is processed).

Overall great work, and I'll be very excited for us to get video support added to llama.cpp!

…nabled optionally by setting LLAMA_FFMPEG=1 in call to make.
HanClinto and others added 2 commits September 17, 2024 17:04
…PEG=1 to help users know exactly how to recompile with video support. Suggestion by @Galunid.
@tc-mb
Copy link
Contributor Author

tc-mb commented Sep 29, 2024

I have tested this out, and I was able to successfully get it to answer questions about a video file -- very exciting!!

The portion where the video frames are encoded takes a very long time -- adding a message that says something like "encoding video frame 7 of 16" or something may be a nice thing to add. I'm also wondering about other ways to speed up video processing, such as adding a frame_skip parameter (so that only every N frames is processed).

Overall great work, and I'll be very excited for us to get video support added to llama.cpp!

I'm sorry that I was busy with another project and I responded a little late.
very happy to get your suggestion. "ffmpeg compiler flag" is a good way. I'm taking the time to change the code to adapt it.
frame extraction function used may be relatively simple. I will try to see if ffmpeg has any acceleration methods. Or whether there are other ways to speed up the frame extraction function. ^_^

ffmpeg compiler flag for video understanding
@github-actions github-actions bot added the build Compilation issues label Oct 9, 2024
@saket424
Copy link

@tc-mb
This pull request no longer applies cleanly and needs to be refreshed. Can you take another whirl with it
Thanks

@tc-mb
Copy link
Contributor Author

tc-mb commented Nov 12, 2024

@tc-mb This pull request no longer applies cleanly and needs to be refreshed. Can you take another whirl with it Thanks

OK, I will adapt to the current main branch. Change it this week.

@tc-mb
Copy link
Contributor Author

tc-mb commented Nov 19, 2024

@tc-mb This pull request no longer applies cleanly and needs to be refreshed. Can you take another whirl with it Thanks

OK, I will adapt to the current main branch. Change it this week.

I've been a little busy in the past two weeks, and I will revise it as soon as possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Compilation issues examples python python script changes Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants