Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adopt dynamo cache size to current layer definition #737

Open
wants to merge 1 commit into
base: habana_main
Choose a base branch
from

Conversation

anko-intel
Copy link

@anko-intel anko-intel commented Jan 24, 2025

Adopt setting dynamo cache size to current behavior where number of graphs in compilation in one forward path is equal:
number of LlamaDecoderLayer's + 2 (RMSNorm, VocabParallelEmbedding)

It is other approach to prepare hot fix after performance regression introduced by vllm-project#11967.
The hot fix #709 restores previous performance results for torch compile mode.
This one partially recover throughput but with big cost of warmup time - as after it much more graphs are compiled during warmup.
Without increasing the cache size torch.compile reach the limits and goes in eager mode which gives low throughput .

Adopt setting dynamo cache size to current behavior where number of
graphs in compilation in one forward path is equal:
number of LlamaDecoderLayer's + 2 (RMSNorm, VocabParallelEmbedding)
@anko-intel anko-intel changed the title Set dynamo cache size for torch compile Adopt dynamo cache size to current layer definition Jan 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant