Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Request] Replicate outlines vLLM performance improvements #125

Open
laurens-gs opened this issue Jul 29, 2024 · 0 comments
Open

[Request] Replicate outlines vLLM performance improvements #125

laurens-gs opened this issue Jul 29, 2024 · 0 comments

Comments

@laurens-gs
Copy link

laurens-gs commented Jul 29, 2024

On a system with a 1xH100 80GB GPU, vLLM and LLaMa 70B (FP8) I go from 1000+ tok/s to around 100 tok/s when I enable lm-format-enforcer. Profiling revealed that the core parsing by lm-format-enforcer is sufficiently fast. The large slowdown seems to come from applying or moving around the logit mask between vllm and lmformatenforcer.

Outlines and vLLM are changing how the logits are passed around, see also the progress here: vllm-project/vllm#3567

I would like to request a similar strategy is replicated for lmformatenforcer. From what I understood, the idea could be replicated by keeping a mask cache in the VLLMLogitsProcessor for common sets of allowed_tokens returned by the token enforcer to avoid having to go from list[int] (on CPU) to tensor[128k] (on GPU) many times.

Another optimization I imagine could work is to work with the complement set of allowed_tokens when the allowed_tokens set is large. E.g. start with a tensor of zeros and assign -inf to the set of not_allowed_tokens. Although I expect the mask cache already deals with this case because if the set of allowed_tokens is very large we are probably in a freetext state where the same set is repeatedly returned by the parser and thus we can depend on the cache instead.

@laurens-gs laurens-gs changed the title Replicate outlines vLLM performance improvements [Request] Replicate outlines vLLM performance improvements Jul 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant