Validate special token ids are in range when loading GGUF model #3635

KerfuffleV2 · 2023-10-15T13:35:23Z

This pull adds validation for special token ids that out of range. Invalid ones will be ignored and a warning message is display when loading the model. This model can reproduce the issue: https://huggingface.co/Undi95/Mistral-11B-OmniMix-bf16-GGUF (see issue for HF version with metadata).

I also included a small optimization for llama_byte_to_token SPM mode - snprintf is kind of overkill there since converting a uint8_t to hex is so trivial. Also slightly more friendly behavior if a BPM model is missing newline - assert rather than blindly dereferencing the tokenize result.

Of course, we shouldn't end up in a situation where the GGUF file has invalid token ids. I looked at convert.py and the GGUF Python side. The problem is that the special vocab handling stuff doesn't have access to other information like the model's vocab size so it can't validate the token ids.

I updated SpecialVocab to take an optional n_vocab argument so it can validate that the ids are in range (in-repo conversion scripts also updated to pass this). Refactored to make the logic a bit clearer as well.

Closes #3634

staviq · 2023-10-15T15:48:34Z

Is the test workflow borked ?

main : failed test:    ' ('
main : detokenized to: ' (' instead of ' ('
main : expected tokens:  29871,    313, 
main : got tokens:       29871,    313,

KerfuffleV2 · 2023-10-15T15:54:38Z

@staviq I was thinking that, but it was actually me that's borked. :(

Happens to everyone, especially if you find something exciting and care about it :)

llama.cpp

KerfuffleV2 · 2023-10-17T10:24:24Z

I've been looking at the Python side and I actually can't replicate how the special EOS token got set to 32000. (It is fixed in the HF repo now, but even reverting to the previous version where it was set to 32000 doesn't result in creating a GGUF file with EOS set to 32000.)

Possibly it was converted using an older version. Also, I can't even get it to convert with the existing added_tokens.json which looks like:

{
  "</s>": 2,
  "<s>": 1,
  "<unk>": 0
}

I don't know if that's wrong, but the convert script expects added tokens to be... well, added. This is redefining tokens that are already in the main vocab, but convert expects the added tokens to come with ids past the normal vocab.

I think the error in convert.py is wrong though, it should be vocab_size not len(added_tokens) like:

        if expected_ids != actual_ids:
            raise Exception(f"Expected added token IDs to be sequential and start at {vocab_size}; got {actual_ids}")

Exception: Expected added token IDs to be sequential and start at 32000; got [0, 1, 2]

vs

Exception: Expected added token IDs to be sequential and start at 2; got [0, 1, 2]

(Unfortunately I think I'm responsible for that mistake.)

Anyway, I'm not even sure how that model converter even managed to convert the model to GGUF, maybe they hacked convert.py to bypass those checks.

TheBloke · 2023-10-17T10:34:24Z

Yeah that error message confused me at first :)

I believe the issue with added_tokens < vocab_size is fixed by another PR though, so hopefully no need to look at that? #3585

I pointed out the issue in the base repo to Undi so he fixed it. As to how it happened in the first place - he may have had different config files at the time at the time he made his GGUFs, or yes it's entirely possible he hacked convert.py! I talked to him about it the other day and he couldn't remember how he made them or what config files he had at the time. And he never noticed the llama.cpp problem because he tested with SillyTavern, which must not have the same bug.

I did GGUFs for that model at the weekend, with the corrected config files (EOS should be 2, ie </s>, not 32000), so working files are available now. Even more confusingly, Undi's GGUFs were using the right EOS, ie 2, so he definitely didn't make them from the config files that were in the repo at the time I first made the models, which listed 32000 as EOS.

It's always an adventure with Undi models 🤣

KerfuffleV2 · 2023-10-17T10:48:05Z

I believe the issue with added_tokens < vocab_size is fixed by another PR though, so hopefully no need to look at that?

Ah, thanks.

I did GGUFs for that model at the weekend, with the corrected config files

The thing that confused me is setting

  "eos_token_id": 32000,

in config.json actually doesn't result in the GGUF file having 32000 as the EOS token id. I can only replicate it if I edit tokenizer.json to look like:

{
  "added_tokens": [
    {
      "id": 32000,
      "content": "</s>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    }

It's always an adventure with Undi models

Haha, at least it's a reallly good model for being so small. Looking at the output, I can actually imagine it's coming from one of the weaker 70Bs which really hasn't been the case in the past even with 30Bs.

ggerganov

Needs coordination with #3585 and we can merge

llama.cpp

KerfuffleV2 · 2023-10-17T18:57:27Z

Needs coordination with #3585 and we can merge

I don't think there should be a conflict, they're kind of doing different stuff. #3585 only touches convert.py while my Python changes are in gguf-py mainly (and the part that touches convert.py is just for calling SpecialVocab really).

Pretty sure it wouldn't even cause a merge conflict.

Small optimization for llama_byte_to_token SPM mode

Update conversion scripts accordingly

Co-authored-by: Georgi Gerganov <[email protected]>

KerfuffleV2 · 2023-10-20T00:02:13Z

Is this actually approved or am I missing something about the #3585 coordination issue?

KerfuffleV2 marked this pull request as draft October 15, 2023 14:59

KerfuffleV2 changed the title ~~Valid special token ids are in range when loading GGUF model~~ Validate special token ids are in range when loading GGUF model Oct 15, 2023

KerfuffleV2 marked this pull request as ready for review October 15, 2023 15:57

cebtenzzre reviewed Oct 16, 2023

View reviewed changes

llama.cpp Outdated Show resolved Hide resolved

llama.cpp Show resolved Hide resolved

llama.cpp Show resolved Hide resolved

llama.cpp Outdated Show resolved Hide resolved

llama.cpp Outdated Show resolved Hide resolved

llama.cpp Outdated Show resolved Hide resolved

ggerganov approved these changes Oct 17, 2023

View reviewed changes

ggerganov reviewed Oct 17, 2023

View reviewed changes

llama.cpp Outdated Show resolved Hide resolved

KerfuffleV2 and others added 8 commits October 17, 2023 13:03

Add validation for special token ids to llama.cpp

d1075f6

Small optimization for llama_byte_to_token SPM mode

Fix BPE newline check, only I could break something so simple

14be9d9

Killll meeeeee

32383bb

Account for GGUF_KEY_KEY only setting when the key exists

4079668

Minor code cleanups.

22b914e

Fix convert.py error msg when added tokens are out of range

3a007e2

Make gguf SpecialVocab vocab size-aware

8796025

Update conversion scripts accordingly

Avoid a string copy

76b05fc

Co-authored-by: Georgi Gerganov <[email protected]>

KerfuffleV2 force-pushed the fix-handle-bad-special-token-ids branch from 9d14041 to 76b05fc Compare October 17, 2023 19:04

ggerganov merged commit a5e7dbd into ggml-org:master Oct 22, 2023

KerfuffleV2 deleted the fix-handle-bad-special-token-ids branch November 17, 2023 03:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate special token ids are in range when loading GGUF model #3635

Validate special token ids are in range when loading GGUF model #3635

KerfuffleV2 commented Oct 15, 2023 •

edited

Loading

staviq commented Oct 15, 2023

KerfuffleV2 commented Oct 15, 2023 •

edited by staviq

Loading

KerfuffleV2 commented Oct 17, 2023

TheBloke commented Oct 17, 2023

KerfuffleV2 commented Oct 17, 2023

ggerganov left a comment

KerfuffleV2 commented Oct 17, 2023

KerfuffleV2 commented Oct 20, 2023

Validate special token ids are in range when loading GGUF model #3635

Validate special token ids are in range when loading GGUF model #3635

Conversation

KerfuffleV2 commented Oct 15, 2023 • edited Loading

staviq commented Oct 15, 2023

KerfuffleV2 commented Oct 15, 2023 • edited by staviq Loading

KerfuffleV2 commented Oct 17, 2023

TheBloke commented Oct 17, 2023

KerfuffleV2 commented Oct 17, 2023

ggerganov left a comment

Choose a reason for hiding this comment

KerfuffleV2 commented Oct 17, 2023

KerfuffleV2 commented Oct 20, 2023

KerfuffleV2 commented Oct 15, 2023 •

edited

Loading

KerfuffleV2 commented Oct 15, 2023 •

edited by staviq

Loading