Support FP8 model fallback KVCache to bfloat16 #1505

changwangss · 2024-11-20T12:25:29Z

I plan to load fp8 model with the following config, Linear is fp8 and kvcache and others op are bf16.

FP8Config(allowlist={"types": ["Linear"], "names": []}, blocklist=blocklist =  {"types": [], "names": []})

when use run_generation.py do model.generate, the error raised.

    outputs = self.model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/home/changwang/workspace/vllm/optimum-habana-fork/optimum/habana/transformers/models/llama/modeling_llama.py", line 1278, in forward
    layer_outputs = decoder_layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1556, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1606, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/home/changwang/workspace/vllm/optimum-habana-fork/optimum/habana/transformers/models/llama/modeling_llama.py", line 962, in forward
    hidden_states, self_attn_weights, present_key_value = self.pre_attn(
  File "/home/changwang/workspace/vllm/optimum-habana-fork/optimum/habana/transformers/models/llama/modeling_llama.py", line 1019, in pre_attn
    hidden_states, attn_weights, present_key_value = self.self_attn.pre_attn_forward(
  File "/home/changwang/workspace/vllm/optimum-habana-fork/optimum/habana/transformers/models/llama/modeling_llama.py", line 682, in pre_attn_forward
    key_states = self.k_cache.update(past_key_value[0], key_states, 2, token_idx, self.inp_seq_len)
  File "/home/changwang/workspace/vllm/optimum-habana-fork/optimum/habana/transformers/models/llama/modeling_llama.py", line 426, in update
    prev.index_copy_(dim, idx - 1, cur)
RuntimeError: index_copy_(): self and source expected to have the same dtype, but got (self) Float8_e4m3fn and (source) BFloat16

HolyFalafel · 2024-11-21T06:43:20Z

optimum/habana/transformers/models/llama/modeling_llama.py

@@ -628,10 +628,14 @@ def pre_attn_forward(
            else:
                if past_key_value is None:
                    past_key = torch.zeros(
-                        key_states.shape, dtype=self.get_k_proj_weight_dtype(), device=key_states.device
+                        key_states.shape, 
+                        dtype=torch.bfloat16 if isinstance(self.k_cache, KVCache) else self.get_k_proj_weight_dtype(),


Why not use the function?
The default value is:
self.k_proj.weight.dtype

for recipes FP8Config(allowlist={"types": ["Linear"], "names": []}, blocklist=blocklist = {"types": [], "names": []}), self.k_proj.weight.dtype is torch.float8_e4m3fn, but the past_key dtype should be torch.bfloat16

HolyFalafel · 2024-11-21T06:43:28Z

optimum/habana/transformers/models/llama/modeling_llama.py

                    )
                    past_value = torch.zeros(
-                        key_states.shape, dtype=self.get_k_proj_weight_dtype(), device=key_states.device
+                        key_states.shape,
+                        dtype=torch.bfloat16 if isinstance(self.v_cache, KVCache) else self.get_k_proj_weight_dtype(),


schoi-habana · 2025-01-18T00:46:20Z

optimum/habana/transformers/models/llama/modeling_llama.py

@@ -628,10 +628,14 @@ def pre_attn_forward(
            else:
                if past_key_value is None:
                    past_key = torch.zeros(
-                        key_states.shape, dtype=self.get_k_proj_weight_dtype(), device=key_states.device
+                        key_states.shape, 
+                        dtype=torch.bfloat16 if isinstance(self.k_cache, KVCache) else self.get_k_proj_weight_dtype(),


in the init(), self.k_proj is initialized as a KVCache(). isn't isinstance(self.k_cache, KVCache) always true? when does it hit the else part?

the self.k_proj is the kind of linear,
no, self.k_cache is KVCache for bf16, self.k_cache is PatchedKVCache for fp8.
Scenario: self.k_proj's weight is fp8, but the past_key_value need use bf16 datatype. ( for example,
When the accuracy does not reach 1% loss, we will selectively fallback operator, such as kv_cache.
When I want to load neuralmagic model(https://huggingface.co/neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8), there are only linear fp8 weights in the checkpoints, but no kv cache scale.)

astachowiczhabana · 2025-02-19T11:05:25Z

@changwangss can you please resolve conflicts?

Support FP8 model fallback kvcache to bfloat16

ed64add

changwangss requested review from mandy-li and libinta as code owners November 20, 2024 12:25

changwangss requested a review from a user November 20, 2024 12:25

improve

0863e51

HolyFalafel reviewed Nov 21, 2024

View reviewed changes

schoi-habana reviewed Jan 18, 2025

View reviewed changes

Merge branch 'main' into patch-2

eac6367

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support FP8 model fallback KVCache to bfloat16 #1505

Support FP8 model fallback KVCache to bfloat16 #1505

changwangss commented Nov 20, 2024 •

edited

Loading

HolyFalafel Nov 21, 2024

changwangss Nov 21, 2024 •

edited

Loading

HolyFalafel Nov 21, 2024

schoi-habana Jan 18, 2025

changwangss Feb 19, 2025 •

edited

Loading

astachowiczhabana commented Feb 19, 2025

Support FP8 model fallback KVCache to bfloat16 #1505

Are you sure you want to change the base?

Support FP8 model fallback KVCache to bfloat16 #1505

Conversation

changwangss commented Nov 20, 2024 • edited Loading

HolyFalafel Nov 21, 2024

Choose a reason for hiding this comment

changwangss Nov 21, 2024 • edited Loading

Choose a reason for hiding this comment

HolyFalafel Nov 21, 2024

Choose a reason for hiding this comment

schoi-habana Jan 18, 2025

Choose a reason for hiding this comment

changwangss Feb 19, 2025 • edited Loading

Choose a reason for hiding this comment

astachowiczhabana commented Feb 19, 2025

changwangss commented Nov 20, 2024 •

edited

Loading

changwangss Nov 21, 2024 •

edited

Loading

changwangss Feb 19, 2025 •

edited

Loading