Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: GGML_SCHED_MAX_SPLITS must be increased to run BigLlama-3.1-681B-Instruct using GPU acceleration #9044

Closed
nicoboss opened this issue Aug 15, 2024 · 2 comments · Fixed by #9047
Labels
bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)

Comments

@nicoboss
Copy link
Contributor

nicoboss commented Aug 15, 2024

What happened?

When running inference on BigLlama-3.1-681B-Instruct using GPU acceleration llama.cpp crashed with GGML_ASSERT(i_split < GGML_SCHED_MAX_SPLITS) failed. CPU inference works without any issues. This issue occurs no matter what GPU backend is used or if any layers are offloaded to the GPU.

Increasing GGML_SCHED_MAX_SPLITS to 4096 fixed this crash and made GPU accelerated inference work without any issues: https://github.com/ggerganov/llama.cpp/blob/4b9afbbe9037f8a2d659097c0c7d9fce32c6494c/ggml/src/ggml-backend.c#L1022

Issue of GGML_SCHED_MAX_SPLITS being a compile-time constant

Having GGML_SCHED_MAX_SPLITS as a compile-time constant is problematic as changing it requires recompiling llama.cpp from source. While this is relatively easy if you use llama.cpp directly as soon you deal with 3rd party software using backend specific pre-built llama-cpp-python bindings (like oobabooga/text-generation-webui) changing GGML_SCHED_MAX_SPLITS is unfeasible for the general user.

Possible solutions

  • Bump GGML_SCHED_MAX_SPLITS to 4096
  • Make it so llama.cpp automatically sets GGML_SCHED_MAX_SPLITS to the optimal value based on the model instructed to load

Evaluation of possible solutions

I believe determining the optimal value of GGML_SCHED_MAX_SPLITS value at runtime based on the model instructed to load to be simple. Under the following location we store the actual amounts of splits into sched->n_splits:
https://github.com/ggerganov/llama.cpp/blob/4b9afbbe9037f8a2d659097c0c7d9fce32c6494c/ggml/src/ggml-backend.c#L1618

Here the only place where GGML_SCHED_MAX_SPLITS is used outside the assert and some disabled debug code:
https://github.com/ggerganov/llama.cpp/blob/4b9afbbe9037f8a2d659097c0c7d9fce32c6494c/ggml/src/ggml-backend.c#L1868-L1875

Proposed solution

The assert could be removed and GGML_SCHED_MAX_SPLITS be replaced with max(2048, sched->n_splits) at ggml-backend.c#L1868 and ggml-backend.c#L1874 to resolve this issue.

Name and Version

version: 3590 (4b9afbb)
built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

root@AI:~/llama.cpp# ./llama-cli -m /mradermacher/tmp/BigLlama-3.1-681B-Instruct.Q5_K_M.gguf -p "I believe the meaning of life is" -n 128 -c 7000 -ngl 1
Log start
main: build = 3590 (4b9afbbe)
main: built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu
main: seed  = 1723729499
llama_model_loader: loaded meta data with 32 key-value pairs and 1894 tensors from /mradermacher/tmp/BigLlama-3.1-681B-Instruct.Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 405B Instruct
llama_model_loader: - kv   3:                       general.organization str              = Meta Llama
llama_model_loader: - kv   4:                           general.finetune str              = Instruct
llama_model_loader: - kv   5:                           general.basename str              = Meta-Llama-3.1
llama_model_loader: - kv   6:                         general.size_label str              = 405B
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Meta Llama 3.1 405B Instruct
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Meta Llama
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/meta-llama/Met...
llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["mergekit", "merge"]
llama_model_loader: - kv  12:                          llama.block_count u32              = 210
llama_model_loader: - kv  13:                       llama.context_length u32              = 131072
llama_model_loader: - kv  14:                     llama.embedding_length u32              = 16384
llama_model_loader: - kv  15:                  llama.feed_forward_length u32              = 53248
llama_model_loader: - kv  16:                 llama.attention.head_count u32              = 128
llama_model_loader: - kv  17:              llama.attention.head_count_kv u32              = 16
llama_model_loader: - kv  18:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  19:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  20:                          general.file_type u32              = 17
llama_model_loader: - kv  21:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  22:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  23:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  25:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  26:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  27:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", ...
llama_model_loader: - kv  28:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  31:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  422 tensors
llama_model_loader: - type q5_K: 1261 tensors
llama_model_loader: - type q6_K:  211 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 16384
llm_load_print_meta: n_layer          = 210
llm_load_print_meta: n_head           = 128
llm_load_print_meta: n_head_kv        = 16
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 2048
llm_load_print_meta: n_embd_v_gqa     = 2048
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 53248
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 680.67 B
llm_load_print_meta: model size       = 447.87 GiB (5.65 BPW) 
llm_load_print_meta: general.name     = Meta Llama 3.1 405B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 2: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    1.77 MiB
llm_load_tensors: offloading 1 repeating layers to GPU
llm_load_tensors: offloaded 1/211 layers to GPU
llm_load_tensors:        CPU buffer size = 458616.72 MiB
llm_load_tensors:      CUDA0 buffer size =  2112.13 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 7008
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size = 11442.75 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =    54.75 MiB
llama_new_context_with_model: KV self size  = 11497.50 MiB, K (f16): 5748.75 MiB, V (f16): 5748.75 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
ggml/src/ggml-backend.c:1552: GGML_ASSERT(i_split < GGML_SCHED_MAX_SPLITS) failed
Aborted
@nicoboss nicoboss added bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) labels Aug 15, 2024
@slaren
Copy link
Member

slaren commented Aug 15, 2024

The assert could be removed and GGML_SCHED_MAX_SPLITS be replaced with max(2048, sched->n_splits) at ggml-backend.c#L1868 and ggml-backend.c#L1874 to resolve this issue.

n_splits is not known at this point, so this is not really a workable solution. I would like to remove GGML_SCHED_MAX_SPLITS, but the underlying problem that complicates this implementation is that the number of tensors that needs to be allocated is proportional to the number of splits, but the ggml contexts have a fixed size and thus we need to know how many tensors will need to be created before creating the ggml context. Since the tensors are created as each split is added, we do not know how many splits will be needed before the tensors are created. I think this needs to be addressed in ggml by allowing ggml contexts to grow dynamically rather than using fixed sized buffers, but that's not going to happen soon. In the meanwhile, graph_size could be used as the value of GGML_SCHED_MAX_SPLITS, since at most there is one split for each node in the graph.

@nicoboss
Copy link
Contributor Author

Thanks a lot for your great explanation about the underlying problem. It helped me a lot to improve my general understanding and made it clear why my first idea won't work saving me the time to debug it. I implemented your suggested fix by using graph_size as ggml_sched_max_splits inside #9047.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants