Bug: Can't quantize 405B Mega merge #8528

bartowski1182 · 2024-07-17T04:12:51Z

What happened?

Trying to quantize https://huggingface.co/TensorWave/Meta-Llama-3-405B-Instruct-Up-Merge

I was able to convert without issue, but when trying to quantize I get an annoyingly generic assert:

GGML_ASSERT: src/llama.cpp:3973: n <= N_MAX

Anything I can do to get more useful outputs or debugging?

Name and Version

b3389

What operating system are you seeing the problem on?

No response

Relevant log output

main: build = 3389 (73cf442e)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: quantizing '/models_out/Meta-Llama-3-405B-Instruct-Up-Merge-GGUF/Meta-Llama-3-405B-Instruct-Up-Merge-f16.gguf' to '/models_out/Meta-Llama-3-405B-Instruct-Up-Merge-GGUF/Meta-Llama-3-405B-Instruct-Up-Merge-Q4_K_M.gguf' as Q4_K_M
llama_model_loader: loaded meta data with 22 key-value pairs and 4242 tensors from /models_out/Meta-Llama-3-405B-Instruct-Up-Merge-GGUF/Meta-Llama-3-405B-Instruct-Up-Merge-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-405B-Instruct-Up-Merge
llama_model_loader: - kv   2:                          llama.block_count u32              = 471
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 28672
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 64
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  943 tensors
llama_model_loader: - type  f16: 3299 tensors
GGML_ASSERT: src/llama.cpp:3973: n <= N_MAX

slaren · 2024-07-17T04:41:44Z

#7359 broke models with more than 256 layers.

bartowski1182 · 2024-07-17T05:10:09Z

Ooo I see... On purpose or as a consequence of supporting that model? Could it be patched or is it a hard limit?

compilade · 2024-07-17T06:56:42Z

On purpose or as a consequence of supporting that model? Could it be patched or is it a hard limit?

It's a consequence of keeping llama_hparams trivially copyable with a compile-time known size, while having layer-wise hyper-parameters. Increasing the limit to 512 would make llama_hparams take 6.16 KiB instead of 3.16 KiB, but that's pretty much the only thing it changes. The size.

Making the layer-wise hparams take less space when not needed is something which I'll likely fix eventually, so that the limit only applies to models which need layer-wise hparams.

Haus1 · 2024-07-17T14:23:19Z

Ooo I see... On purpose or as a consequence of supporting that model? Could it be patched or is it a hard limit?

It appears to be an arbitrary limit even though an int64 can handle an absurd 9x10^18 before overflowing. I don't know why but this seems to be something fairly unique to the machine learning space even though it makes the code needlessly brittle and user hostile.

bartowski1182 added bug-unconfirmed low severity Used to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches) labels Jul 17, 2024

ggerganov mentioned this issue Jul 17, 2024

llama : bump max layers from 256 to 512 #8530

Merged

4 tasks

ggerganov closed this as completed in #8530 Jul 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Can't quantize 405B Mega merge #8528

Bug: Can't quantize 405B Mega merge #8528

bartowski1182 commented Jul 17, 2024

slaren commented Jul 17, 2024

bartowski1182 commented Jul 17, 2024

compilade commented Jul 17, 2024 •

edited

Loading

Haus1 commented Jul 17, 2024

Bug: Can't quantize 405B Mega merge #8528

Bug: Can't quantize 405B Mega merge #8528

Comments

bartowski1182 commented Jul 17, 2024

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

slaren commented Jul 17, 2024

bartowski1182 commented Jul 17, 2024

compilade commented Jul 17, 2024 • edited Loading

Haus1 commented Jul 17, 2024

compilade commented Jul 17, 2024 •

edited

Loading