Add VLLM_SCHED_PREFILL_KVC_FREEPCT #89

sanyalington · 2024-07-18T09:45:41Z

Add VLLM_SCHED_PREFILL_KVC_FREEPCT feature to schedule prefill only when percentage of kv cache is free.
To enable this feature export VLLM_SCHED_PREFILL_KVC_FREEPCT= (float num>0.0).
This helps in large Batch Size offline inference scenarios where prefills can be batched and scheduled when a certain percentage of KV cache is free. If KV cache is below VLLM_SCHED_PREFILL_KVC_FREEPCT, no prefills will be scheduled and decode gets priority. As sequences finish decode, kv cache gets freed up.

…hen percentage of kv cache is free

gshtras · 2024-07-18T15:33:25Z

vllm/core/scheduler.py

@@ -24,6 +24,9 @@
 ARTIFICIAL_PREEMPTION_MAX_CNT = 500


+VLLM_SCHED_PREFILL_KVC_FREEPCT = float(
+    os.getenv("VLLM_SCHED_PREFILL_KVC_FREEPCT", 0.0))  # noqa


Please move to envs.py

github-actions · 2024-10-30T01:59:47Z

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

Add VLLM_SCHED_PREFILL_KVC_FREEPCT feature to schedule prefill only w…

402f99d

…hen percentage of kv cache is free

shajrawi requested a review from gshtras July 18, 2024 15:16

gshtras reviewed Jul 18, 2024

View reviewed changes

github-actions bot added the stale label Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add VLLM_SCHED_PREFILL_KVC_FREEPCT #89

Add VLLM_SCHED_PREFILL_KVC_FREEPCT #89

sanyalington commented Jul 18, 2024

gshtras Jul 18, 2024

github-actions bot commented Oct 30, 2024

Add VLLM_SCHED_PREFILL_KVC_FREEPCT #89

Are you sure you want to change the base?

Add VLLM_SCHED_PREFILL_KVC_FREEPCT #89

Conversation

sanyalington commented Jul 18, 2024

gshtras Jul 18, 2024

Choose a reason for hiding this comment

github-actions bot commented Oct 30, 2024