-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validate special token ids are in range when loading GGUF model #3635
Validate special token ids are in range when loading GGUF model #3635
Conversation
Is the test workflow borked ?
|
Happens to everyone, especially if you find something exciting and care about it :) |
I've been looking at the Python side and I actually can't replicate how the special EOS token got set to Possibly it was converted using an older version. Also, I can't even get it to convert with the existing {
"</s>": 2,
"<s>": 1,
"<unk>": 0
} I don't know if that's wrong, but the convert script expects added tokens to be... well, added. This is redefining tokens that are already in the main vocab, but convert expects the added tokens to come with ids past the normal vocab. I think the error in if expected_ids != actual_ids:
raise Exception(f"Expected added token IDs to be sequential and start at {vocab_size}; got {actual_ids}")
vs
(Unfortunately I think I'm responsible for that mistake.) Anyway, I'm not even sure how that model converter even managed to convert the model to GGUF, maybe they hacked |
Yeah that error message confused me at first :) I believe the issue with added_tokens < vocab_size is fixed by another PR though, so hopefully no need to look at that? #3585 I pointed out the issue in the base repo to Undi so he fixed it. As to how it happened in the first place - he may have had different config files at the time at the time he made his GGUFs, or yes it's entirely possible he hacked convert.py! I talked to him about it the other day and he couldn't remember how he made them or what config files he had at the time. And he never noticed the llama.cpp problem because he tested with SillyTavern, which must not have the same bug. I did GGUFs for that model at the weekend, with the corrected config files (EOS should be It's always an adventure with Undi models 🤣 |
Ah, thanks.
The thing that confused me is setting "eos_token_id": 32000, in {
"added_tokens": [
{
"id": 32000,
"content": "</s>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
}
Haha, at least it's a reallly good model for being so small. Looking at the output, I can actually imagine it's coming from one of the weaker 70Bs which really hasn't been the case in the past even with 30Bs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs coordination with #3585 and we can merge
I don't think there should be a conflict, they're kind of doing different stuff. #3585 only touches Pretty sure it wouldn't even cause a merge conflict. |
Small optimization for llama_byte_to_token SPM mode
Update conversion scripts accordingly
Co-authored-by: Georgi Gerganov <[email protected]>
9d14041
to
76b05fc
Compare
Is this actually approved or am I missing something about the #3585 coordination issue? |
This pull adds validation for special token ids that out of range. Invalid ones will be ignored and a warning message is display when loading the model. This model can reproduce the issue: https://huggingface.co/Undi95/Mistral-11B-OmniMix-bf16-GGUF (see issue for HF version with metadata).
I also included a small optimization for
llama_byte_to_token
SPM mode -snprintf
is kind of overkill there since converting auint8_t
to hex is so trivial. Also slightly more friendly behavior if a BPM model is missing newline - assert rather than blindly dereferencing the tokenize result.Of course, we shouldn't end up in a situation where the GGUF file has invalid token ids. I looked at
convert.py
and the GGUF Python side. The problem is that the special vocab handling stuff doesn't have access to other information like the model's vocab size so it can't validate the token ids.I updated
SpecialVocab
to take an optionaln_vocab
argument so it can validate that the ids are in range (in-repo conversion scripts also updated to pass this). Refactored to make the logic a bit clearer as well.Closes #3634