Ignore tokens if their IDs in added token list are below the vocab's base size #3585

seungduk-yanolja · 2023-10-11T17:54:56Z

MistralOrca has added tokens that the IDs are below 30,000. It causes some issues when converting models.
https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca/blob/main/added_tokens.json
Actually, these tokens are the same with the base model so they are unnecessary. But handling this kind of exception might be better.

… size Mistral Orca has added tokens that the ids are below 30,000. It causes some issues when converting models.

goerch · 2023-10-11T18:05:39Z

This probably isn't the only problem. @staviq is working on similar problems. How do you validate the tokenizer of this model?

staviq · 2023-10-11T18:23:32Z

Mistral Orca has added tokens that the ids are below 30,000. It causes some issues when converting models.

Can you provide link to the particular model ? I'd like to check this out

seungduk-yanolja · 2023-10-11T18:33:03Z

Mistral Orca has added tokens that the ids are below 30,000. It causes some issues when converting models.

Can you provide link to the particular model ? I'd like to check this out

https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca/blob/main/added_tokens.json

staviq · 2023-10-11T19:29:50Z

Mistral Orca has added tokens that the ids are below 30,000. It causes some issues when converting models.

Can you provide link to the particular model ? I'd like to check this out

https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca/blob/main/added_tokens.json

That's the one I'm testing #3538 on, specifically because it has a mess in JSON files.

If you simply delete tokens 0,1,2 from added_tokens.json, everything works just fine, and those tokens are still in the vocab and are recognised. It makes no sense for them to even be there.

goerch · 2023-10-11T20:30:44Z

If you simply delete tokens 0,1,2 from added_tokens.json

According to the explanations in #3538 we'll have to ignore (or 'special';) case) them in the future, I think.

seungduk-yanolja · 2023-10-12T07:40:01Z

If you simply delete tokens 0,1,2 from added_tokens.json

According to the explanations in #3538 we'll have to ignore (or 'special';) case) them in the future, I think.

changed the code to ignore, not override. PTAL

seungduk-yanolja · 2023-10-15T06:14:15Z

@goerch should I abandon this PR or you will review it?

v-kamelowy

Can confirm it is working.

Loading model file Mistral-7B-OpenOrca\pytorch_model-00001-of-00002.bin Loading model file Mistral-7B-OpenOrca\pytorch_model-00001-of-00002.bin Loading model file Mistral-7B-OpenOrca\pytorch_model-00002-of-00002.bin params = Params(n_vocab=32002, n_embd=4096, n_layer=32, n_ctx=32768, n_ff=14336, n_head=32, n_head_kv=8, f_norm_eps=1e-05, f_rope_freq_base=10000.0, f_rope_scale=None, ftype=<GGMLFileType.MostlyF16: 1>, path_model=WindowsPath('Mistral-7B-OpenOrca')) [...] [291/291] Writing tensor output.weight | size 32002 x 4096 | type F16 | T+ 21 Wrote Mistral-7B-OpenOrca_f16.gguf

TheBloke · 2023-10-15T12:21:03Z

Thanks for this! I had to implement my own code to strip out tokens from added_tokens.json where their ID was less than vocab_size. It'd be great if this was handled in convert.py instead.

convert.py

ggerganov

@seungduk-yanolja We should merge after addressing @cebtenzzre's review

KerfuffleV2 · 2023-10-27T16:32:35Z

This is what the diff looks like with the merge conflict resolved and @cebtenzzre's suggestions applied to c8d6a1f:

diff --git a/convert.py b/convert.py
index 0680f71..bfbfab2 100755
--- a/convert.py
+++ b/convert.py
@@ -366,16 +366,19 @@ class SentencePieceVocab:
             added_tokens = {}
 
         vocab_size: int = self.sentencepiece_tokenizer.vocab_size()
-        expected_ids = list(range(vocab_size, vocab_size + len(added_tokens)))
-        actual_ids   = sorted(added_tokens.values())
-        if expected_ids != actual_ids:
-            raise Exception(f"Expected added token IDs to be sequential and start at {vocab_size}; got {actual_ids}")
 
-        items = sorted(added_tokens.items(), key=lambda text_idx: text_idx[1])
-        self.added_tokens_list = [text for (text, idx) in items]
-        self.vocab_size_base: int = vocab_size
-        self.vocab_size: int = self.vocab_size_base + len(self.added_tokens_list)
-        self.fname_tokenizer = fname_tokenizer
+        new_tokens       = {id: piece for piece, id in added_tokens.items() if id >= vocab_size}
+        expected_new_ids = list(range(vocab_size, vocab_size + len(new_tokens)))
+        actual_new_ids   = sorted(new_tokens.keys())
+
+        if expected_new_ids != actual_new_ids:
+            raise ValueError(f"Expected new token IDs {expected_new_ids} to be sequential; got {actual_new_ids}")
+
+        # Token pieces that were added to the base vocabulary.
+        self.added_tokens_list  = [new_tokens[id] for id in actual_new_ids]
+        self.vocab_size_base    = vocab_size
+        self.vocab_size         = self.vocab_size_base + len(self.added_tokens_list)
+        self.fname_tokenizer    = fname_tokenizer
         self.fname_added_tokens = fname_added_tokens
 
     def sentencepiece_tokens(self) -> Iterable[tuple[bytes, float, gguf.TokenType]]:

I changed it to just use self.added_tokens_list instead of self.new_tokens_list because otherwise the __repr__ implementation is broken and __init__ higher up creates the added_tokens_list field. There's no particular reason to change the name here.

Seems like the original author of this pull hasn't been active for a bit.

edit: Actually there's no compelling reason to change i to id in sentencepiece_tokens either. Is there a way we can go ahead with getting these changes merged somehow since this was approved quite some time ago?

ggerganov · 2023-10-28T12:03:31Z

Made a new PR with the proposed patch: #3831

Override tokens if their ids in added tokens are below the vocab_base…

9b5907e

… size Mistral Orca has added tokens that the ids are below 30,000. It causes some issues when converting models.

seungduk-yanolja mentioned this pull request Oct 11, 2023

Unable to convert Mistral-7B-OpenOrca to GGUF #3583

Closed

beautify

576df77

simply ignore added tokens that id < vocab size

e6ea63c

seungduk-yanolja changed the title ~~Override tokens if their IDs in added token list are below the vocab's base size~~ Ignore tokens if their IDs in added token list are below the vocab's base size Oct 12, 2023

v-kamelowy approved these changes Oct 15, 2023

View reviewed changes

TheBloke mentioned this pull request Oct 15, 2023

support loading vocab from fast tokenizer config in convert.py #3633

Merged

seungduk-yanolja added 2 commits October 15, 2023 21:25

Merge branch 'ggerganov:master' into master

9eb2b4f

move back vocab_size where it was

ff12b8f

cebtenzzre reviewed Oct 16, 2023

View reviewed changes

convert.py Show resolved Hide resolved

convert.py Show resolved Hide resolved

TheBloke mentioned this pull request Oct 17, 2023

Validate special token ids are in range when loading GGUF model #3635

Merged

ggerganov approved these changes Oct 17, 2023

View reviewed changes

ggerganov mentioned this pull request Oct 18, 2023

convert.py : handle special tokens #2820

Closed

ggerganov mentioned this pull request Oct 28, 2023

convert : ignore tokens if their IDs are within [0, vocab_size) #3831

Merged

ggerganov closed this Oct 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignore tokens if their IDs in added token list are below the vocab's base size #3585

Ignore tokens if their IDs in added token list are below the vocab's base size #3585

seungduk-yanolja commented Oct 11, 2023 •

edited

Loading

goerch commented Oct 11, 2023

staviq commented Oct 11, 2023

seungduk-yanolja commented Oct 11, 2023

staviq commented Oct 11, 2023

goerch commented Oct 11, 2023

seungduk-yanolja commented Oct 12, 2023

seungduk-yanolja commented Oct 15, 2023

v-kamelowy left a comment

TheBloke commented Oct 15, 2023 •

edited

Loading

ggerganov left a comment

KerfuffleV2 commented Oct 27, 2023 •

edited

Loading

ggerganov commented Oct 28, 2023

Ignore tokens if their IDs in added token list are below the vocab's base size #3585

Ignore tokens if their IDs in added token list are below the vocab's base size #3585

Conversation

seungduk-yanolja commented Oct 11, 2023 • edited Loading

goerch commented Oct 11, 2023

staviq commented Oct 11, 2023

seungduk-yanolja commented Oct 11, 2023

staviq commented Oct 11, 2023

goerch commented Oct 11, 2023

seungduk-yanolja commented Oct 12, 2023

seungduk-yanolja commented Oct 15, 2023

v-kamelowy left a comment

Choose a reason for hiding this comment

TheBloke commented Oct 15, 2023 • edited Loading

ggerganov left a comment

Choose a reason for hiding this comment

KerfuffleV2 commented Oct 27, 2023 • edited Loading

ggerganov commented Oct 28, 2023

seungduk-yanolja commented Oct 11, 2023 •

edited

Loading

TheBloke commented Oct 15, 2023 •

edited

Loading

KerfuffleV2 commented Oct 27, 2023 •

edited

Loading