[Request] Replicate outlines vLLM performance improvements #125

laurens-gs · 2024-07-29T14:33:28Z

On a system with a 1xH100 80GB GPU, vLLM and LLaMa 70B (FP8) I go from 1000+ tok/s to around 100 tok/s when I enable lm-format-enforcer. Profiling revealed that the core parsing by lm-format-enforcer is sufficiently fast. The large slowdown seems to come from applying or moving around the logit mask between vllm and lmformatenforcer.

Outlines and vLLM are changing how the logits are passed around, see also the progress here: vllm-project/vllm#3567

I would like to request a similar strategy is replicated for lmformatenforcer. From what I understood, the idea could be replicated by keeping a mask cache in the VLLMLogitsProcessor for common sets of allowed_tokens returned by the token enforcer to avoid having to go from list[int] (on CPU) to tensor[128k] (on GPU) many times.

Another optimization I imagine could work is to work with the complement set of allowed_tokens when the allowed_tokens set is large. E.g. start with a tensor of zeros and assign -inf to the set of not_allowed_tokens. Although I expect the mask cache already deals with this case because if the set of allowed_tokens is very large we are probably in a freetext state where the same set is repeatedly returned by the parser and thus we can depend on the cache instead.

The text was updated successfully, but these errors were encountered:

laurens-gs changed the title ~~Replicate outlines vLLM performance improvements~~ [Request] Replicate outlines vLLM performance improvements Jul 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Request] Replicate outlines vLLM performance improvements #125

[Request] Replicate outlines vLLM performance improvements #125

laurens-gs commented Jul 29, 2024 •

edited

Loading

[Request] Replicate outlines vLLM performance improvements #125

[Request] Replicate outlines vLLM performance improvements #125

Comments

laurens-gs commented Jul 29, 2024 • edited Loading

laurens-gs commented Jul 29, 2024 •

edited

Loading