ExllamaV2 optimizations #88

bdashore3 · 2024-04-14T15:49:38Z

Currently, building the initial token tree is inefficient and can cause slow ingestion of tokens (for example, a JSON schema). This is evident when using models with large vocab sizes such as cohere command-r, gemma, and qwen. Generation locks up and takes hours to process. These commits help optimize that initial building when creating an ExllamaV2 LMFE filter.

Tests: Run command-r with a JSON schema in TabbyAPI using LMFE v0.9.5, would not start generating. With these commits, generation immediately starts.

References #75

Thanks @turboderp for creating these commits.

…eneck), further 2x speedup

# Conflicts: # lmformatenforcer/tokenizerprefixtree.py

Merge recent changes

noamgat · 2024-04-19T07:03:28Z

Merged, thanks @bdashore3 and @turboderp for the contribution!

turboderp added 5 commits February 18, 2024 19:08

Extract vocab from ExLlamaV2Tokenizer id_to_piece list

d34296a

Optimize initialization of JsonFreetextTokenCache

980b632

Precompute token lengths and simplify set construction (current bottl…

4695657

…eneck), further 2x speedup

Merge remote-tracking branch 'origin/main'

6257b74

# Conflicts: # lmformatenforcer/tokenizerprefixtree.py

Merge pull request #1 from noamgat/main

13b0ef9

Merge recent changes

noamgat merged commit 6e87b80 into noamgat:main Apr 19, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ExllamaV2 optimizations #88

ExllamaV2 optimizations #88

bdashore3 commented Apr 14, 2024

noamgat commented Apr 19, 2024

ExllamaV2 optimizations #88

ExllamaV2 optimizations #88

Conversation

bdashore3 commented Apr 14, 2024

noamgat commented Apr 19, 2024