Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore tokens if their IDs in added token list are below the vocab's base size #3585

Closed
wants to merge 5 commits into from

Conversation

seungduk-yanolja
Copy link

@seungduk-yanolja seungduk-yanolja commented Oct 11, 2023

MistralOrca has added tokens that the IDs are below 30,000. It causes some issues when converting models.
https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca/blob/main/added_tokens.json
Actually, these tokens are the same with the base model so they are unnecessary. But handling this kind of exception might be better.

… size

Mistral Orca has added tokens that the ids are below 30,000. It causes some issues when converting models.
@goerch
Copy link
Collaborator

goerch commented Oct 11, 2023

This probably isn't the only problem. @staviq is working on similar problems. How do you validate the tokenizer of this model?

@staviq
Copy link
Contributor

staviq commented Oct 11, 2023

Mistral Orca has added tokens that the ids are below 30,000. It causes some issues when converting models.

Can you provide link to the particular model ? I'd like to check this out

@seungduk-yanolja
Copy link
Author

Mistral Orca has added tokens that the ids are below 30,000. It causes some issues when converting models.

Can you provide link to the particular model ? I'd like to check this out

https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca/blob/main/added_tokens.json

@staviq
Copy link
Contributor

staviq commented Oct 11, 2023

Mistral Orca has added tokens that the ids are below 30,000. It causes some issues when converting models.

Can you provide link to the particular model ? I'd like to check this out

https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca/blob/main/added_tokens.json

That's the one I'm testing #3538 on, specifically because it has a mess in JSON files.

If you simply delete tokens 0,1,2 from added_tokens.json, everything works just fine, and those tokens are still in the vocab and are recognised. It makes no sense for them to even be there.

Not to mention they specify <s> </s> <|im_start|> <|im_end|> but then say bos is <s> but eos is <|im_end|> while still using <|im_start|>

@goerch
Copy link
Collaborator

goerch commented Oct 11, 2023

If you simply delete tokens 0,1,2 from added_tokens.json

According to the explanations in #3538 we'll have to ignore (or 'special';) case) them in the future, I think.

@seungduk-yanolja
Copy link
Author

If you simply delete tokens 0,1,2 from added_tokens.json

According to the explanations in #3538 we'll have to ignore (or 'special';) case) them in the future, I think.

changed the code to ignore, not override. PTAL

@seungduk-yanolja seungduk-yanolja changed the title Override tokens if their IDs in added token list are below the vocab's base size Ignore tokens if their IDs in added token list are below the vocab's base size Oct 12, 2023
@seungduk-yanolja
Copy link
Author

@goerch should I abandon this PR or you will review it?

Copy link

@v-kamelowy v-kamelowy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can confirm it is working.

Loading model file Mistral-7B-OpenOrca\pytorch_model-00001-of-00002.bin Loading model file Mistral-7B-OpenOrca\pytorch_model-00001-of-00002.bin Loading model file Mistral-7B-OpenOrca\pytorch_model-00002-of-00002.bin params = Params(n_vocab=32002, n_embd=4096, n_layer=32, n_ctx=32768, n_ff=14336, n_head=32, n_head_kv=8, f_norm_eps=1e-05, f_rope_freq_base=10000.0, f_rope_scale=None, ftype=<GGMLFileType.MostlyF16: 1>, path_model=WindowsPath('Mistral-7B-OpenOrca')) [...] [291/291] Writing tensor output.weight | size 32002 x 4096 | type F16 | T+ 21 Wrote Mistral-7B-OpenOrca_f16.gguf

@TheBloke
Copy link
Contributor

TheBloke commented Oct 15, 2023

Thanks for this! I had to implement my own code to strip out tokens from added_tokens.json where their ID was less than vocab_size. It'd be great if this was handled in convert.py instead.

convert.py Show resolved Hide resolved
convert.py Show resolved Hide resolved
Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seungduk-yanolja We should merge after addressing @cebtenzzre's review

@KerfuffleV2
Copy link
Collaborator

KerfuffleV2 commented Oct 27, 2023

This is what the diff looks like with the merge conflict resolved and @cebtenzzre's suggestions applied to c8d6a1f:

diff --git a/convert.py b/convert.py
index 0680f71..bfbfab2 100755
--- a/convert.py
+++ b/convert.py
@@ -366,16 +366,19 @@ class SentencePieceVocab:
             added_tokens = {}
 
         vocab_size: int = self.sentencepiece_tokenizer.vocab_size()
-        expected_ids = list(range(vocab_size, vocab_size + len(added_tokens)))
-        actual_ids   = sorted(added_tokens.values())
-        if expected_ids != actual_ids:
-            raise Exception(f"Expected added token IDs to be sequential and start at {vocab_size}; got {actual_ids}")
 
-        items = sorted(added_tokens.items(), key=lambda text_idx: text_idx[1])
-        self.added_tokens_list = [text for (text, idx) in items]
-        self.vocab_size_base: int = vocab_size
-        self.vocab_size: int = self.vocab_size_base + len(self.added_tokens_list)
-        self.fname_tokenizer = fname_tokenizer
+        new_tokens       = {id: piece for piece, id in added_tokens.items() if id >= vocab_size}
+        expected_new_ids = list(range(vocab_size, vocab_size + len(new_tokens)))
+        actual_new_ids   = sorted(new_tokens.keys())
+
+        if expected_new_ids != actual_new_ids:
+            raise ValueError(f"Expected new token IDs {expected_new_ids} to be sequential; got {actual_new_ids}")
+
+        # Token pieces that were added to the base vocabulary.
+        self.added_tokens_list  = [new_tokens[id] for id in actual_new_ids]
+        self.vocab_size_base    = vocab_size
+        self.vocab_size         = self.vocab_size_base + len(self.added_tokens_list)
+        self.fname_tokenizer    = fname_tokenizer
         self.fname_added_tokens = fname_added_tokens
 
     def sentencepiece_tokens(self) -> Iterable[tuple[bytes, float, gguf.TokenType]]:

I changed it to just use self.added_tokens_list instead of self.new_tokens_list because otherwise the __repr__ implementation is broken and __init__ higher up creates the added_tokens_list field. There's no particular reason to change the name here.

Seems like the original author of this pull hasn't been active for a bit.

edit: Actually there's no compelling reason to change i to id in sentencepiece_tokens either. Is there a way we can go ahead with getting these changes merged somehow since this was approved quite some time ago?

@ggerganov
Copy link
Owner

Made a new PR with the proposed patch: #3831

@ggerganov ggerganov closed this Oct 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants