You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On a system with a 1xH100 80GB GPU, vLLM and LLaMa 70B (FP8) I go from 1000+ tok/s to around 100 tok/s when I enable lm-format-enforcer. Profiling revealed that the core parsing by lm-format-enforcer is sufficiently fast. The large slowdown seems to come from applying or moving around the logit mask between vllm and lmformatenforcer.
Outlines and vLLM are changing how the logits are passed around, see also the progress here: vllm-project/vllm#3567
I would like to request a similar strategy is replicated for lmformatenforcer. From what I understood, the idea could be replicated by keeping a mask cache in the VLLMLogitsProcessor for common sets of allowed_tokens returned by the token enforcer to avoid having to go from list[int] (on CPU) to tensor[128k] (on GPU) many times.
Another optimization I imagine could work is to work with the complement set of allowed_tokens when the allowed_tokens set is large. E.g. start with a tensor of zeros and assign -inf to the set of not_allowed_tokens. Although I expect the mask cache already deals with this case because if the set of allowed_tokens is very large we are probably in a freetext state where the same set is repeatedly returned by the parser and thus we can depend on the cache instead.
The text was updated successfully, but these errors were encountered:
laurens-gs
changed the title
Replicate outlines vLLM performance improvements
[Request] Replicate outlines vLLM performance improvements
Jul 29, 2024
On a system with a 1xH100 80GB GPU, vLLM and LLaMa 70B (FP8) I go from 1000+ tok/s to around 100 tok/s when I enable lm-format-enforcer. Profiling revealed that the core parsing by lm-format-enforcer is sufficiently fast. The large slowdown seems to come from applying or moving around the logit mask between vllm and lmformatenforcer.
Outlines and vLLM are changing how the logits are passed around, see also the progress here: vllm-project/vllm#3567
I would like to request a similar strategy is replicated for lmformatenforcer. From what I understood, the idea could be replicated by keeping a mask cache in the
VLLMLogitsProcessor
for common sets ofallowed_tokens
returned by the token enforcer to avoid having to go fromlist[int]
(on CPU) totensor[128k]
(on GPU) many times.Another optimization I imagine could work is to work with the complement set of
allowed_tokens
when theallowed_tokens
set is large. E.g. start with a tensor of zeros and assign -inf to the set ofnot_allowed_tokens
. Although I expect the mask cache already deals with this case because if the set ofallowed_tokens
is very large we are probably in a freetext state where the same set is repeatedly returned by the parser and thus we can depend on the cache instead.The text was updated successfully, but these errors were encountered: