-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initializtion is slow #75
Comments
There is a WIP branch for speeding this up, it's almost ready, and it
addresses almost everything you raised here. There's one bug left to solve
and then I'll merge.
…On Sun, Feb 18, 2024, 22:28 turboderp ***@***.***> wrote:
Hi. So I'm a bit confused by the contribution guidelines and don't want to
submit an unsolicited PR.
But I was playing around with this, and the initialization seemed quite
slow. Especially with Qwen models that have a 150k vocabulary, it takes
over a minute to build the prefix trie (most of it spent in
JsonFreetextTokenCache.freeze), but even with smaller vocabularies it
would take upwards of 10 seconds.
So I made some optimizations here
<main...turboderp:lm-format-enforcer:main>.
Would you like me to submit a PR?
Briefly, the changes are:
First, ExLlamaV2 integration reads the vocabulary straight from the
ExLlamaV2Tokenizer instead of calling decode() on every token. This is
somewhat faster, especially using the HF Tokenizer with Qwen, where the
decode method is surprisingly expensive. This also fixes some bugs, I
think?
token_0 = tokenizer.encode("0")[0] # Returns multiple tokens if "0" encodes to more than one token
decoded_after_0 = tokenizer.decode(tensor_after_0)[1:] # Seems to assume "0" encodes to one tokendecoded_regular = tokenizer.decode(token_0) # Always returns "0"is_word_start_token = len(decoded_after_0) > len(decoded_regular) # Considers all tokens that decode to more than one character to start a new word
So the output isn't exactly the same, but I assume it's more correct,
since tokens will have is_word_start_token == True precisely when they
are word start tokens.
Second change is to JsonFreetextTokenCache which now constructs the cache
using intersections on sets of ints, and avoids having to convert back to
token IDs at the end.
I've tested it on Mistral, Llama and Qwen and confirmed that the resulting
cache is identical, except for:
- The resulting tuples are sorted by token ID instead of by token
text. I couldn't see anywhere this would matter, though. (?)
- There are duplicate tokens in most models. For example in Mistral,
token 37 is <0x22>, which is an ASCII double quote, while token 28739
is \". The way the cache was built before, only the last of these
tokens was considered:
self.token_str_to_num[token_str] = token_int
And as a result, wherever " is a valid string, only token 28739 would be
considered a valid token ID. So I *think* (?) it's more correct to allow
both tokens in that case.
In any case, it does seem to still work in my tests, and it initializes
10-20x faster.
—
Reply to this email directly, view it on GitHub
<#75>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKFA2F3NVRINZGQ2XOM4SLYUJP73AVCNFSM6AAAAABDONOGRGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGE2DCMJXGEYDMOA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
v0.9.0 was just released with the improvement. Can you update and check if the performance is now better? |
It is a lot better, yes, and somewhat faster than my approach. It still has the bugs in the ExLlamaV2 integration, and it's unusably slow for Qwen because it calls I tested it on a few different models, and here is the time it takes to call
All seem to work correctly (except for the Qwen models that I lost patience with). The initialization of It's a small and self-contained change: def _build_regular_tokens_list(tokenizer: ExLlamaV2Tokenizer) -> List[Tuple[int, str, bool]]:
vocab_size = tokenizer.tokenizer.vocab_size()
all_special_ids = set(tokenizer.extended_id_to_piece.keys())
all_special_ids.update({ tokenizer.bos_token_id, tokenizer.eos_token_id, tokenizer.pad_token_id, tokenizer.unk_token_id })
id_to_piece = tokenizer.get_id_to_piece_list()
regular_tokens = []
for token_idx in range(vocab_size):
if token_idx in all_special_ids:
continue
decoded = id_to_piece[token_idx]
is_word_start_token = len(decoded) > 0 and decoded[0] == " "
regular_tokens.append((token_idx, decoded, is_word_start_token))
return regular_tokens Last column above is 0.9.0 with this change applied. |
Any news on this? |
The reason for the usage of decode, is that it is the only way (as far as I
know) to know which token is a start word. In most tokenizers the leading
space does not appear in this mapping, but we need it to build the correct
prefix tree. Is there a solution for this?
…On Fri, Feb 23, 2024 at 5:13 PM turboderp ***@***.***> wrote:
Any news on this?
—
Reply to this email directly, view it on GitHub
<#75 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKFA2EF3APTUJRF7AV4HF3YVCWZ7AVCNFSM6AAAAABDONOGRGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRRGUYDQOBVGI>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
The data is usually available in the tokenizer model. For instance with HF Tokenizers, the Anyway, the ExLlama tokenizer creates the id-to-piece mapping for this reason, with a bunch of logic to handle various formats, and producing a list with a common format. Here are the contents of Mistral:
Llama:
Orion:
Gemma:
Word start tokens are then simply the ones that begin with a space, and a sequence can be decoded as just the concatenation of the respective pieces for each token ID. The ExLlama tokenizer also needs this to build a trie (used for token healing and other stuff), which is the same data structure as the One place where this all breaks a bit is with multi-token characters, such as UTF-8 emitted by the model using byte fallback tokens. I've also had issues with how Chinese characters are encoded in Tiktoken (seems to be almost UTF-8 but not quite?). But none of that really pertains to the leading space, and it would be an issue in any case if you're assuming a token always represents at least one character (as I think you probably have to for this kind of character-based constrained sampling.) |
Thanks for the detailed response!
The unit test could lazy-import exllamav2 and skip if it doesn't exist to avoid optional dependency and CI issues. |
Yes, I'm only suggesting the code snippet above. #76 is a bigger improvement as far as tokenizerprefixtree.py is concerned. As for the integration, though, the original is unusably slow and I'm not sure how useful a unit test would be. I let it finish now just to be sure, and with the latest Tokenizers library (0.15.2) on this Windows PC I'm currently on and the Qwen1.5 tokenizer model, the It also looks like the unit test would fail anyway because the current implementation isn't correct. Here's a snippet of the
Vs. with the fix above:
Here's some intermediates for Llama SPM tokenizer, token idx 293: tensor_after_0 = torch.tensor(token_0.tolist() + [token_idx], dtype=torch.long) # tensor([31822, 31852, 293])
decoded_after_0 = tokenizer.decode(tensor_after_0)[1:] # 'ion'
decoded_regular = tokenizer.decode(token_0) # '0'
is_word_start_token = len(decoded_after_0) > len(decoded_regular) # True I guess All of this is dealt with in ExLlamaV2's tokenizer, specifically to produce a reliable token -> string mapping that strips out all of the normalization. |
Hi, I'd like to take a deeper look into this. Can you send a reproducing snippet (which model correctly activates the Qwen1.5 tokenizer etc)? |
Sure, here's a snippet: import sys, os
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache, ExLlamaV2Tokenizer
from exllamav2.generator import ExLlamaV2StreamingGenerator, ExLlamaV2Sampler
from pydantic import BaseModel
from lmformatenforcer.integrations.exllamav2 import ExLlamaV2TokenEnforcerFilter
from lmformatenforcer import JsonSchemaParser
import time
from typing import List, Tuple
# Initialize model, load only tokenizer
model_directory = "/mnt/str/models/smaug-72b-exl2/4.0bpw/"
config = ExLlamaV2Config()
config.model_dir = model_directory
config.prepare()
model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model, lazy = True)
tokenizer = ExLlamaV2Tokenizer(config)
generator = ExLlamaV2StreamingGenerator(model, cache, tokenizer)
class TestSchema(BaseModel):
value_a: str
value_b: str
schema_parser = JsonSchemaParser(TestSchema.schema())
# Create filter
print("Init filter...")
time_a = time.time()
lmfe_filter1 = ExLlamaV2TokenEnforcerFilter(schema_parser, tokenizer)
time_b = time.time() - time_a
print(f"Duration: {time_b:.4f}")
# Patch function and repeat
def _build_regular_tokens_list_new(tokenizer: ExLlamaV2Tokenizer) -> List[Tuple[int, str, bool]]:
vocab_size = tokenizer.tokenizer.vocab_size()
all_special_ids = set(tokenizer.extended_id_to_piece.keys())
all_special_ids.update({ tokenizer.bos_token_id, tokenizer.eos_token_id, tokenizer.pad_token_id, tokenizer.unk_token_id })
id_to_piece = tokenizer.get_id_to_piece_list()
regular_tokens = []
for token_idx in range(vocab_size):
if token_idx in all_special_ids:
continue
decoded = id_to_piece[token_idx]
is_word_start_token = len(decoded) > 0 and decoded[0] == " "
regular_tokens.append((token_idx, decoded, is_word_start_token))
return regular_tokens
import lmformatenforcer.integrations.exllamav2
lmformatenforcer.integrations.exllamav2._build_regular_tokens_list = _build_regular_tokens_list_new
print("Init filter (patched)...")
time_a = time.time()
lmfe_filter2 = ExLlamaV2TokenEnforcerFilter(schema_parser, tokenizer)
time_b = time.time() - time_a
print(f"Duration: {time_b:.4f}")
# Compare regular token lists
max_diff = 20
for a, b in zip(lmfe_filter1.token_enforcer.regular_tokens, lmfe_filter2.token_enforcer.regular_tokens):
if a == b: continue
print(f"A: {repr(a):30} != B: {repr(b):30}")
max_diff -= 1
if max_diff == 0: break Note that the behavior might depend on your version of the The behavior is the same for all Qwen models, and you can reproduce it with for instance this one, though it also shows up to a lesser extent with other models that rely in the
And with HF tokenizer (reading tokenizer.json rather than tokenizer.model):
As for the output, it seems to be identical except the existing version recognizes any multi-character token as a word start token. Which seems wrong to me? Not sure what effect that has, since it still seems to constrain generation correctly regardless, at least for JSON. There's also some example code here you could look at. |
Merged, released in v0.9.6. Thank you @turboderp and @bdashore3 ! |
Hi. So I'm a bit confused by the contribution guidelines and don't want to submit an unsolicited PR.
But I was playing around with this, and the initialization seemed quite slow. Especially with Qwen models that have a 150k vocabulary, it takes over a minute to build the prefix trie (most of it spent in
JsonFreetextTokenCache.freeze
), but even with smaller vocabularies it would take upwards of 10 seconds.So I made some optimizations here. Would you like me to submit a PR?
Briefly, the changes are:
First, ExLlamaV2 integration reads the vocabulary straight from the
ExLlamaV2Tokenizer
instead of calling decode() on every token. This is somewhat faster, especially using the HF Tokenizer with Qwen, where thedecode
method is surprisingly expensive. This also fixes some bugs, I think?So the output isn't exactly the same, but I assume it's more correct, since tokens will have
is_word_start_token == True
precisely when they are word start tokens.Second change is to
JsonFreetextTokenCache
which now constructs the cache using intersections on sets of ints, and avoids having to convert back to token IDs at the end.I've tested it on Mistral, Llama and Qwen and confirmed that the resulting cache is identical, except for:
<0x22>
, which is an ASCII double quote, while token 28739 is\"
. The way the cache was built before, only the last of these tokens was considered:And as a result, wherever
"
is a valid string, only token 28739 would be considered a valid token ID. So I think (?) it's more correct to allow both tokens in that case.In any case, it does seem to still work in my tests, and it initializes 10-20x faster.
The text was updated successfully, but these errors were encountered: