-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix 1383 Llama model on transformers=4.41[WIP] #11280
Conversation
Tested on Max1100 and documented LLama2-7B model metrics on issue-1383, performance metrics of transformers4.41 are similar to 4.38 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Others LGTM
if cache_position is not None: | ||
# for transformers 4.38.0 | ||
causal_mask = attention_mask[:, :, cache_position, : kv_seq_len] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason to remove causal_mask
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the code is compared to wrong place, did not touch for 4_38
|
||
next_cache = next_decoder_cache if use_cache else None | ||
if return_legacy_cache: | ||
next_cache = next_cache.to_legacy_cache() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to double check if next_decoder_cache
is DynamicFP8Cache
.
Description
add llama_attention_forward_4_41, llama_model_forward_4_41