Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gemma 2 : flash_attn is not compatible with attn_soft_cap - forcing off #1598

Open
iamsaurabhgupt opened this issue Jul 13, 2024 · 2 comments

Comments

@iamsaurabhgupt
Copy link

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • [ Yes] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • [ Yes] I carefully followed the README.md.
  • [ Yes] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • [Yes ] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

Installed flash_attn==2.6.1 and passing to Llama(flash_attn=True) for Gemma-2
it is should enable flash_attn

Current Behavior

The Llama engine shows:
llama_new_context_with_model: flash_attn is not compatible with attn_soft_cap - forcing off
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1

Environment and Context

A100 x 1 GPU (90GB), 24 CPU, 220 GB vRAM
Running bartowski/gemma-2-27b-it-GGUF

$ python3 --version 3.10

Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

  1. Download bartowski/gemma-2-27b-it-GGUF
  2. gemma_engine = Llama(
    model_path=/path/to/model,
    flash_attn=True,
    n_gpu_layers=-1,
    n_ctx=1024,
    verbose=True,
    )
  3. See verbose logs:
    llama_model_loader: loaded meta data with 33 key-value pairs and 508 tensors from static/gemma2/gemma-2-27b-it-Q8_0.gguf (version GGUF V3 (latest))
    llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
    llama_model_loader: - kv 0: general.architecture str = gemma2
    llama_model_loader: - kv 1: general.name str = gemma-2-27b-it
    llama_model_loader: - kv 2: gemma2.context_length u32 = 8192
    llama_model_loader: - kv 3: gemma2.embedding_length u32 = 4608
    llama_model_loader: - kv 4: gemma2.block_count u32 = 46
    llama_model_loader: - kv 5: gemma2.feed_forward_length u32 = 36864
    llama_model_loader: - kv 6: gemma2.attention.head_count u32 = 32
    llama_model_loader: - kv 7: gemma2.attention.head_count_kv u32 = 16
    llama_model_loader: - kv 8: gemma2.attention.layer_norm_rms_epsilon f32 = 0.000001
    llama_model_loader: - kv 9: gemma2.attention.key_length u32 = 128
    llama_model_loader: - kv 10: gemma2.attention.value_length u32 = 128
    llama_model_loader: - kv 11: general.file_type u32 = 7
    llama_model_loader: - kv 12: gemma2.attn_logit_softcapping f32 = 50.000000
    llama_model_loader: - kv 13: gemma2.final_logit_softcapping f32 = 30.000000
    llama_model_loader: - kv 14: gemma2.attention.sliding_window u32 = 4096
    llama_model_loader: - kv 15: tokenizer.ggml.model str = llama
    llama_model_loader: - kv 16: tokenizer.ggml.pre str = default
    llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,256000] = ["", "", "", "", ...
    llama_model_loader: - kv 18: tokenizer.ggml.scores arr[f32,256000] = [0.000000, 0.000000, 0.000000, 0.0000...
    llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
    llama_model_loader: - kv 20: tokenizer.ggml.bos_token_id u32 = 2
    llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 1
    llama_model_loader: - kv 22: tokenizer.ggml.unknown_token_id u32 = 3
    llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 0
    llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = true
    llama_model_loader: - kv 25: tokenizer.ggml.add_eos_token bool = false
    llama_model_loader: - kv 26: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol...
    llama_model_loader: - kv 27: tokenizer.ggml.add_space_prefix bool = false
    llama_model_loader: - kv 28: general.quantization_version u32 = 2
    llama_model_loader: - kv 29: quantize.imatrix.file str = /models/gemma-2-27b-it-GGUF/gemma-2-2...
    llama_model_loader: - kv 30: quantize.imatrix.dataset str = /training_data/calibration_datav3.txt
    llama_model_loader: - kv 31: quantize.imatrix.entries_count i32 = 322
    llama_model_loader: - kv 32: quantize.imatrix.chunks_count i32 = 128
    llama_model_loader: - type f32: 185 tensors
    llama_model_loader: - type q8_0: 323 tensors
    llm_load_vocab: special tokens cache size = 364
    llm_load_vocab: token to piece cache size = 1.6014 MB
    llm_load_print_meta: format = GGUF V3 (latest)
    llm_load_print_meta: arch = gemma2
    llm_load_print_meta: vocab type = SPM
    llm_load_print_meta: n_vocab = 256000
    llm_load_print_meta: n_merges = 0
    llm_load_print_meta: vocab_only = 0
    llm_load_print_meta: n_ctx_train = 8192
    llm_load_print_meta: n_embd = 4608
    llm_load_print_meta: n_layer = 46
    llm_load_print_meta: n_head = 32
    llm_load_print_meta: n_head_kv = 16
    llm_load_print_meta: n_rot = 128
    llm_load_print_meta: n_swa = 4096
    llm_load_print_meta: n_embd_head_k = 128
    llm_load_print_meta: n_embd_head_v = 128
    llm_load_print_meta: n_gqa = 2
    llm_load_print_meta: n_embd_k_gqa = 2048
    llm_load_print_meta: n_embd_v_gqa = 2048
    llm_load_print_meta: f_norm_eps = 0.0e+00
    llm_load_print_meta: f_norm_rms_eps = 1.0e-06
    llm_load_print_meta: f_clamp_kqv = 0.0e+00
    llm_load_print_meta: f_max_alibi_bias = 0.0e+00
    llm_load_print_meta: f_logit_scale = 0.0e+00
    llm_load_print_meta: n_ff = 36864
    llm_load_print_meta: n_expert = 0
    llm_load_print_meta: n_expert_used = 0
    llm_load_print_meta: causal attn = 1
    llm_load_print_meta: pooling type = 0
    llm_load_print_meta: rope type = 2
    llm_load_print_meta: rope scaling = linear
    llm_load_print_meta: freq_base_train = 10000.0
    llm_load_print_meta: freq_scale_train = 1
    llm_load_print_meta: n_ctx_orig_yarn = 8192
    llm_load_print_meta: rope_finetuned = unknown
    llm_load_print_meta: ssm_d_conv = 0
    llm_load_print_meta: ssm_d_inner = 0
    llm_load_print_meta: ssm_d_state = 0
    llm_load_print_meta: ssm_dt_rank = 0
    llm_load_print_meta: model type = 27B
    llm_load_print_meta: model ftype = Q8_0
    llm_load_print_meta: model params = 27.23 B
    llm_load_print_meta: model size = 26.94 GiB (8.50 BPW)
    llm_load_print_meta: general.name = gemma-2-27b-it
    llm_load_print_meta: BOS token = 2 ''
    llm_load_print_meta: EOS token = 1 ''
    llm_load_print_meta: UNK token = 3 ''
    llm_load_print_meta: PAD token = 0 ''
    llm_load_print_meta: LF token = 227 '<0x0A>'
    llm_load_print_meta: EOT token = 107 '<end_of_turn>'
    llm_load_print_meta: max token length = 93
    ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes
    ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
    ggml_cuda_init: found 1 CUDA devices:
    Device 0: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
    llm_load_tensors: ggml ctx size = 0.45 MiB
    llm_load_tensors: offloading 46 repeating layers to GPU
    llm_load_tensors: offloading non-repeating layers to GPU
    llm_load_tensors: offloaded 47/47 layers to GPU
    llm_load_tensors: CPU buffer size = 1195.31 MiB
    llm_load_tensors: CUDA0 buffer size = 27591.06 MiB
    ..............................................................................................
    llama_new_context_with_model: flash_attn is not compatible with attn_soft_cap - forcing off
    llama_new_context_with_model: n_ctx = 2048
    llama_new_context_with_model: n_batch = 512
    llama_new_context_with_model: n_ubatch = 512
    llama_new_context_with_model: flash_attn = 0
    llama_new_context_with_model: freq_base = 10000.0
    llama_new_context_with_model: freq_scale = 1
    llama_kv_cache_init: CUDA0 KV buffer size = 736.00 MiB
    llama_new_context_with_model: KV self size = 736.00 MiB, K (f16): 368.00 MiB, V (f16): 368.00 MiB
    llama_new_context_with_model: CUDA_Host output buffer size = 0.98 MiB
    llama_new_context_with_model: CUDA0 compute buffer size = 509.00 MiB
    llama_new_context_with_model: CUDA_Host compute buffer size = 17.01 MiB
    llama_new_context_with_model: graph nodes = 1850
    llama_new_context_with_model: graph splits = 2
    AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 |
    Model metadata: {'quantize.imatrix.chunks_count': '128', 'gemma2.attn_logit_softcapping': '50.000000', 'gemma2.attention.value_length': '128', 'gemma2.attention.sliding_window': '4096', 'gemma2.attention.head_count': '32', 'gemma2.feed_forward_length': '36864', 'gemma2.block_count': '46', 'tokenizer.ggml.pre': 'default', 'gemma2.embedding_length': '4608', 'general.file_type': '7', 'gemma2.attention.layer_norm_rms_epsilon': '0.000001', 'gemma2.context_length': '8192', 'tokenizer.chat_template': "{{ bos_token }}{% if messages[0]['role'] == 'system' %}{{ raise_exception('System role not supported') }}{% endif %}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if (message['role'] == 'assistant') %}{% set role = 'model' %}{% else %}{% set role = message['role'] %}{% endif %}{{ '<start_of_turn>' + role + '\n' + message['content'] | trim + '<end_of_turn>\n' }}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model\n'}}{% endif %}", 'general.architecture': 'gemma2', 'gemma2.final_logit_softcapping': '30.000000', 'gemma2.attention.head_count_kv': '16', 'tokenizer.ggml.add_eos_token': 'false', 'quantize.imatrix.file': '/models/gemma-2-27b-it-GGUF/gemma-2-27b-it.imatrix', 'tokenizer.ggml.add_space_prefix': 'false', 'tokenizer.ggml.model': 'llama', 'general.quantization_version': '2', 'general.name': 'gemma-2-27b-it', 'tokenizer.ggml.bos_token_id': '2', 'tokenizer.ggml.eos_token_id': '1', 'tokenizer.ggml.unknown_token_id': '3', 'tokenizer.ggml.padding_token_id': '0', 'tokenizer.ggml.add_bos_token': 'true', 'gemma2.attention.key_length': '128', 'quantize.imatrix.dataset': '/training_data/calibration_datav3.txt', 'quantize.imatrix.entries_count': '322'}
    Available chat formats from metadata: chat_template.default
    Using gguf chat template: {{ bos_token }}{% if messages[0]['role'] == 'system' %}{{ raise_exception('System role not supported') }}{% endif %}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if (message['role'] == 'assistant') %}{% set role = 'model' %}{% else %}{% set role = message['role'] %}{% endif %}{{ '<start_of_turn>' + role + '
    ' + message['content'] | trim + '<end_of_turn>
    ' }}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model
    '}}{% endif %}
    Using chat eos_token:
    Using chat bos_token:

Is this expected or i am missing something? please help

@iamsaurabhgupt
Copy link
Author

can anyone please guide on this issue?
Is Gemma-2 incompatible with flash attention? @abetlen

@tc-wolf
Copy link
Contributor

tc-wolf commented Jul 22, 2024

It is incompatible with flash attention, because flash attention doesn't support the scaling / soft-capping that Gemma-2 uses. There's an open MR in llama.cpp to add support a compatible implementation: ggerganov/llama.cpp#8542

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants