You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It is incompatible with flash attention, because flash attention doesn't support the scaling / soft-capping that Gemma-2 uses. There's an open MR in llama.cpp to add support a compatible implementation: ggerganov/llama.cpp#8542
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
Installed flash_attn==2.6.1 and passing to Llama(flash_attn=True) for Gemma-2
it is should enable flash_attn
Current Behavior
The Llama engine shows:
llama_new_context_with_model: flash_attn is not compatible with attn_soft_cap - forcing off
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
Environment and Context
A100 x 1 GPU (90GB), 24 CPU, 220 GB vRAM
Running bartowski/gemma-2-27b-it-GGUF
Steps to Reproduce
Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.
model_path=/path/to/model,
flash_attn=True,
n_gpu_layers=-1,
n_ctx=1024,
verbose=True,
)
llama_model_loader: loaded meta data with 33 key-value pairs and 508 tensors from static/gemma2/gemma-2-27b-it-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = gemma2
llama_model_loader: - kv 1: general.name str = gemma-2-27b-it
llama_model_loader: - kv 2: gemma2.context_length u32 = 8192
llama_model_loader: - kv 3: gemma2.embedding_length u32 = 4608
llama_model_loader: - kv 4: gemma2.block_count u32 = 46
llama_model_loader: - kv 5: gemma2.feed_forward_length u32 = 36864
llama_model_loader: - kv 6: gemma2.attention.head_count u32 = 32
llama_model_loader: - kv 7: gemma2.attention.head_count_kv u32 = 16
llama_model_loader: - kv 8: gemma2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 9: gemma2.attention.key_length u32 = 128
llama_model_loader: - kv 10: gemma2.attention.value_length u32 = 128
llama_model_loader: - kv 11: general.file_type u32 = 7
llama_model_loader: - kv 12: gemma2.attn_logit_softcapping f32 = 50.000000
llama_model_loader: - kv 13: gemma2.final_logit_softcapping f32 = 30.000000
llama_model_loader: - kv 14: gemma2.attention.sliding_window u32 = 4096
llama_model_loader: - kv 15: tokenizer.ggml.model str = llama
llama_model_loader: - kv 16: tokenizer.ggml.pre str = default
llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,256000] = ["", "", "", "", ...
llama_model_loader: - kv 18: tokenizer.ggml.scores arr[f32,256000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv 20: tokenizer.ggml.bos_token_id u32 = 2
llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 1
llama_model_loader: - kv 22: tokenizer.ggml.unknown_token_id u32 = 3
llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 25: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 26: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol...
llama_model_loader: - kv 27: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 28: general.quantization_version u32 = 2
llama_model_loader: - kv 29: quantize.imatrix.file str = /models/gemma-2-27b-it-GGUF/gemma-2-2...
llama_model_loader: - kv 30: quantize.imatrix.dataset str = /training_data/calibration_datav3.txt
llama_model_loader: - kv 31: quantize.imatrix.entries_count i32 = 322
llama_model_loader: - kv 32: quantize.imatrix.chunks_count i32 = 128
llama_model_loader: - type f32: 185 tensors
llama_model_loader: - type q8_0: 323 tensors
llm_load_vocab: special tokens cache size = 364
llm_load_vocab: token to piece cache size = 1.6014 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = gemma2
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 256000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 8192
llm_load_print_meta: n_embd = 4608
llm_load_print_meta: n_layer = 46
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 16
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 4096
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 2
llm_load_print_meta: n_embd_k_gqa = 2048
llm_load_print_meta: n_embd_v_gqa = 2048
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 36864
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 8192
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 27B
llm_load_print_meta: model ftype = Q8_0
llm_load_print_meta: model params = 27.23 B
llm_load_print_meta: model size = 26.94 GiB (8.50 BPW)
llm_load_print_meta: general.name = gemma-2-27b-it
llm_load_print_meta: BOS token = 2 ''
llm_load_print_meta: EOS token = 1 ''
llm_load_print_meta: UNK token = 3 ''
llm_load_print_meta: PAD token = 0 ''
llm_load_print_meta: LF token = 227 '<0x0A>'
llm_load_print_meta: EOT token = 107 '<end_of_turn>'
llm_load_print_meta: max token length = 93
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
llm_load_tensors: ggml ctx size = 0.45 MiB
llm_load_tensors: offloading 46 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 47/47 layers to GPU
llm_load_tensors: CPU buffer size = 1195.31 MiB
llm_load_tensors: CUDA0 buffer size = 27591.06 MiB
..............................................................................................
llama_new_context_with_model: flash_attn is not compatible with attn_soft_cap - forcing off
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 736.00 MiB
llama_new_context_with_model: KV self size = 736.00 MiB, K (f16): 368.00 MiB, V (f16): 368.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.98 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 509.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 17.01 MiB
llama_new_context_with_model: graph nodes = 1850
llama_new_context_with_model: graph splits = 2
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 |
Model metadata: {'quantize.imatrix.chunks_count': '128', 'gemma2.attn_logit_softcapping': '50.000000', 'gemma2.attention.value_length': '128', 'gemma2.attention.sliding_window': '4096', 'gemma2.attention.head_count': '32', 'gemma2.feed_forward_length': '36864', 'gemma2.block_count': '46', 'tokenizer.ggml.pre': 'default', 'gemma2.embedding_length': '4608', 'general.file_type': '7', 'gemma2.attention.layer_norm_rms_epsilon': '0.000001', 'gemma2.context_length': '8192', 'tokenizer.chat_template': "{{ bos_token }}{% if messages[0]['role'] == 'system' %}{{ raise_exception('System role not supported') }}{% endif %}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if (message['role'] == 'assistant') %}{% set role = 'model' %}{% else %}{% set role = message['role'] %}{% endif %}{{ '<start_of_turn>' + role + '\n' + message['content'] | trim + '<end_of_turn>\n' }}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model\n'}}{% endif %}", 'general.architecture': 'gemma2', 'gemma2.final_logit_softcapping': '30.000000', 'gemma2.attention.head_count_kv': '16', 'tokenizer.ggml.add_eos_token': 'false', 'quantize.imatrix.file': '/models/gemma-2-27b-it-GGUF/gemma-2-27b-it.imatrix', 'tokenizer.ggml.add_space_prefix': 'false', 'tokenizer.ggml.model': 'llama', 'general.quantization_version': '2', 'general.name': 'gemma-2-27b-it', 'tokenizer.ggml.bos_token_id': '2', 'tokenizer.ggml.eos_token_id': '1', 'tokenizer.ggml.unknown_token_id': '3', 'tokenizer.ggml.padding_token_id': '0', 'tokenizer.ggml.add_bos_token': 'true', 'gemma2.attention.key_length': '128', 'quantize.imatrix.dataset': '/training_data/calibration_datav3.txt', 'quantize.imatrix.entries_count': '322'}
Available chat formats from metadata: chat_template.default
Using gguf chat template: {{ bos_token }}{% if messages[0]['role'] == 'system' %}{{ raise_exception('System role not supported') }}{% endif %}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if (message['role'] == 'assistant') %}{% set role = 'model' %}{% else %}{% set role = message['role'] %}{% endif %}{{ '<start_of_turn>' + role + '
' + message['content'] | trim + '<end_of_turn>
' }}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model
'}}{% endif %}
Using chat eos_token:
Using chat bos_token:
Is this expected or i am missing something? please help
The text was updated successfully, but these errors were encountered: