`GPT2Model` StaticCache support #35761

poedator · 2025-01-18T02:24:07Z

I copied _update_causal_mask() and _prepare_4d_causal_attention_mask_with_cache_position() from LlamaModel

some tests are still failing:

tests/models/gpt2/test_modeling_gpt2.py::GPT2ModelTest::test_custom_4d_attention_mask
test_modeling_vision_encoder_decoder.py::VIT2GPT2Test::test_save_and_load_from_pretrained

both may be linked to attention implementations. So far I was enable to figure out the reasons for failures. I'd appreciate advice or help from the maintainers.

cc: @gante

ArthurZucker

Thanks for the PR!
Not entirely sure it's worth adding as GPT2 is a super small model, not super optimized anymore, and fairly old so the amount of work is a bit high...

Let's make sure we test cross attetnion path with kv cache as I am not even sure it was supported before

ArthurZucker · 2025-01-23T08:33:22Z

src/transformers/models/decision_transformer/modeling_decision_transformer.py

+        past_key_value: Optional[Cache] = None,
+        cache_position: Optional[torch.LongTensor] = None,


the big issue with this is that we are breking backward compatibility for people who use layer_past. We need to deprecate layer_past!

added @deprecate_kwarg decorator to this forward and 3 more (incl in GPT2Model). Noted that it is an inner model class for attention or inner block, and not affecting the external model interface.

ArthurZucker · 2025-01-23T08:42:37Z

src/transformers/models/gpt2/modeling_gpt2.py

+        # based on pattern from src/transformers/models/whisper/modeling_whisper.py::WhisperDecoder
+        return_legacy_cache = False
+        if use_cache:
+            if past_key_values is not None:
+                if isinstance(past_key_values, Cache):
+                    if self.config.add_cross_attention and not isinstance(past_key_values, EncoderDecoderCache):
+                        past_key_values = EncoderDecoderCache(past_key_values, DynamicCache())
+                elif not isinstance(past_key_values, Cache):
+                    return_legacy_cache = True
+                    logger.warning_once(
+                        "Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.49.0. "
+                        "You should pass an instance of `Cache` instead, e.g. "
+                        "`past_key_values=DynamicCache.from_legacy_cache(past_key_values)`."
+                    )
+                    if self.config.add_cross_attention:
+                        past_key_values = EncoderDecoderCache.from_legacy_cache(past_key_values)
+                    else:
+                        past_key_values = DynamicCache.from_legacy_cache(past_key_values)
+            elif past_key_values is None:
+                return_legacy_cache = True
+                logger.warning_once(
+                    "Passing `use_cache=True` and `past_key_values=None` will is produce cache output in legacy format.  "
+                    "This behavior is deprecated and will be changed in Transformers v4.49.0. "
+                    "To obtain output past_key_values as `Cache` instance you should pass an instance of `Cache` instead, e.g. "
+                    "`past_key_values=DynamicCache.from_legacy_cache(past_key_values)`."
+                )
+                if self.config.add_cross_attention:
+                    past_key_values = EncoderDecoderCache(DynamicCache(), DynamicCache())
+                else:
+                    past_key_values = DynamicCache()


From the look of it, we are adding quite a complex code, which I am not super fan of.
Let's go with this for now, but would be nice to have a single warning to just say the one or the other is deprecated. This as is is not super readable and you have too many code pathes, when you should have:
past_key value is None -> create DynamicCache
past_key_value is not None -> convert to Dynamic cache (not even sure that cross attention cache was even supported)
add_cross_attention -> create EncodeDDecoderCache with past_key_value and a new dynamic cache

I simplified this logic according to our outline

poedator · 2025-01-25T16:37:43Z

Not entirely sure it's worth adding ...

I agree in principle, but my friends use it for Tortoise text-to-speech and intend to compile it (with modifications for static shapes) to accelerate.

Let's make sure we test cross attention path with kv cache as I am not even sure it was supported before

I made effort to patch the cross-attention parts of the code as well. The relevant tests seem to pass

poedator · 2025-01-28T20:28:03Z

@Rocketknight1 @ArthurZucker , could you, please, approve the remaining checks workflows ?

HuggingFaceDocBuilderDev · 2025-01-29T14:44:54Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

poedator mentioned this pull request Jan 18, 2025

Fix Whisper CI #34617

Merged

poedator force-pushed the gpt_static branch 2 times, most recently from 278bcf7 to dedb154 Compare January 18, 2025 13:30

poedator marked this pull request as ready for review January 18, 2025 17:43

poedator requested review from Rocketknight1 and ArthurZucker as code owners January 18, 2025 17:43

ArthurZucker mentioned this pull request Jan 21, 2025

[ tests] remove some flash attention class tests #35817

Merged

ArthurZucker reviewed Jan 23, 2025

View reviewed changes

poedator added 17 commits January 25, 2025 18:20

initial GPT2 changes

67372b0

causal_mask support

7d21cee

return_legacy_cache

b6846ec

cleanup

f85f22d

fix1

34ca768

outputs shape fixes

c06276c

gpt2 return fix

a953241

pkv, attn fixes

bc97d9f

fix dual_head

cb056ce

is_causal arg fix

6d1888f

decision transformer updated

390f159

style fix

842e2d4

batch_size from inputs_embeds

51a4ef8

DecisionTransformerModel fixes

69393ed

cross-attn support + cache warning

9b9e399

x-attn @Decision

0072626

EDCache proper init

d16cea9

poedator force-pushed the gpt_static branch from 80581ae to d16cea9 Compare January 25, 2025 15:26

poedator added 3 commits January 25, 2025 19:15

simplified logic in if use_cache: for GPT2Model

428a3e5

@deprecate_kwarg for DecisionTr attn fwd

98c08a4

@deprecate_kwarg in gpt2

79821cd

poedator force-pushed the gpt_static branch from 5cd37a9 to 79821cd Compare January 25, 2025 16:34

poedator changed the title ~~[WiP] GPT2Model StaticCache support~~ GPT2Model StaticCache support Jan 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`GPT2Model` StaticCache support #35761

`GPT2Model` StaticCache support #35761

poedator commented Jan 18, 2025 •

edited

Loading

ArthurZucker left a comment

ArthurZucker Jan 23, 2025

poedator Jan 25, 2025 •

edited

Loading

ArthurZucker Jan 23, 2025

poedator Jan 25, 2025

poedator commented Jan 25, 2025 •

edited

Loading

poedator commented Jan 28, 2025 •

edited

Loading

HuggingFaceDocBuilderDev commented Jan 29, 2025

		past_key_value: Optional[Cache] = None,
		cache_position: Optional[torch.LongTensor] = None,

GPT2Model StaticCache support #35761

Are you sure you want to change the base?

GPT2Model StaticCache support #35761

Conversation

poedator commented Jan 18, 2025 • edited Loading

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Jan 23, 2025

Choose a reason for hiding this comment

poedator Jan 25, 2025 • edited Loading

Choose a reason for hiding this comment

ArthurZucker Jan 23, 2025

Choose a reason for hiding this comment

poedator Jan 25, 2025

Choose a reason for hiding this comment

poedator commented Jan 25, 2025 • edited Loading

poedator commented Jan 28, 2025 • edited Loading

HuggingFaceDocBuilderDev commented Jan 29, 2025

`GPT2Model` StaticCache support #35761

`GPT2Model` StaticCache support #35761

poedator commented Jan 18, 2025 •

edited

Loading

poedator Jan 25, 2025 •

edited

Loading

poedator commented Jan 25, 2025 •

edited

Loading

poedator commented Jan 28, 2025 •

edited

Loading