Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ExllamaV2 optimizations #88

Merged
merged 5 commits into from
Apr 19, 2024
Merged

ExllamaV2 optimizations #88

merged 5 commits into from
Apr 19, 2024

Conversation

bdashore3
Copy link
Contributor

Currently, building the initial token tree is inefficient and can cause slow ingestion of tokens (for example, a JSON schema). This is evident when using models with large vocab sizes such as cohere command-r, gemma, and qwen. Generation locks up and takes hours to process. These commits help optimize that initial building when creating an ExllamaV2 LMFE filter.

Tests: Run command-r with a JSON schema in TabbyAPI using LMFE v0.9.5, would not start generating. With these commits, generation immediately starts.

References #75

Thanks @turboderp for creating these commits.

@noamgat noamgat merged commit 6e87b80 into noamgat:main Apr 19, 2024
1 check passed
@noamgat
Copy link
Owner

noamgat commented Apr 19, 2024

Merged, thanks @bdashore3 and @turboderp for the contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants