Adopt dynamo cache size to current layer definition #737

anko-intel · 2025-01-24T08:16:06Z

Adopt setting dynamo cache size to current behavior where number of graphs in compilation in one forward path is equal:
number of LlamaDecoderLayer's + 2 (RMSNorm, VocabParallelEmbedding)

It is other approach to prepare hot fix after performance regression introduced by vllm-project#11967.
The hot fix #709 restores previous performance results for torch compile mode.
This one partially recover throughput but with big cost of warmup time - as after it much more graphs are compiled during warmup.
Without increasing the cache size torch.compile reach the limits and goes in eager mode which gives low throughput .

Adopt setting dynamo cache size to current behavior where number of graphs in compilation in one forward path is equal: number of LlamaDecoderLayer's + 2 (RMSNorm, VocabParallelEmbedding)

Set dynamo cache size for torch compile

f78b021

Adopt setting dynamo cache size to current behavior where number of graphs in compilation in one forward path is equal: number of LlamaDecoderLayer's + 2 (RMSNorm, VocabParallelEmbedding)

anko-intel requested review from kzawora-intel, madamczykhabana, michalkuligowski, mgawarkiewicz, vivekgoe and afierka-intel as code owners January 24, 2025 08:16

anko-intel changed the title ~~Set dynamo cache size for torch compile~~ Adopt dynamo cache size to current layer definition Jan 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adopt dynamo cache size to current layer definition #737

Adopt dynamo cache size to current layer definition #737

anko-intel commented Jan 24, 2025 •

edited by github-actions bot

Loading

Adopt dynamo cache size to current layer definition #737

Are you sure you want to change the base?

Adopt dynamo cache size to current layer definition #737

Conversation

anko-intel commented Jan 24, 2025 • edited by github-actions bot Loading

anko-intel commented Jan 24, 2025 •

edited by github-actions bot

Loading