-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ignore tokens if their IDs in added token list are below the vocab's base size #3585
Conversation
… size Mistral Orca has added tokens that the ids are below 30,000. It causes some issues when converting models.
Can you provide link to the particular model ? I'd like to check this out |
https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca/blob/main/added_tokens.json |
That's the one I'm testing #3538 on, specifically because it has a mess in JSON files. If you simply delete tokens 0,1,2 from added_tokens.json, everything works just fine, and those tokens are still in the vocab and are recognised. It makes no sense for them to even be there. Not to mention they specify |
According to the explanations in #3538 we'll have to ignore (or 'special';) case) them in the future, I think. |
changed the code to ignore, not override. PTAL |
@goerch should I abandon this PR or you will review it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can confirm it is working.
Loading model file Mistral-7B-OpenOrca\pytorch_model-00001-of-00002.bin Loading model file Mistral-7B-OpenOrca\pytorch_model-00001-of-00002.bin Loading model file Mistral-7B-OpenOrca\pytorch_model-00002-of-00002.bin params = Params(n_vocab=32002, n_embd=4096, n_layer=32, n_ctx=32768, n_ff=14336, n_head=32, n_head_kv=8, f_norm_eps=1e-05, f_rope_freq_base=10000.0, f_rope_scale=None, ftype=<GGMLFileType.MostlyF16: 1>, path_model=WindowsPath('Mistral-7B-OpenOrca')) [...] [291/291] Writing tensor output.weight | size 32002 x 4096 | type F16 | T+ 21 Wrote Mistral-7B-OpenOrca_f16.gguf
Thanks for this! I had to implement my own code to strip out tokens from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@seungduk-yanolja We should merge after addressing @cebtenzzre's review
This is what the diff looks like with the merge conflict resolved and @cebtenzzre's suggestions applied to c8d6a1f: diff --git a/convert.py b/convert.py
index 0680f71..bfbfab2 100755
--- a/convert.py
+++ b/convert.py
@@ -366,16 +366,19 @@ class SentencePieceVocab:
added_tokens = {}
vocab_size: int = self.sentencepiece_tokenizer.vocab_size()
- expected_ids = list(range(vocab_size, vocab_size + len(added_tokens)))
- actual_ids = sorted(added_tokens.values())
- if expected_ids != actual_ids:
- raise Exception(f"Expected added token IDs to be sequential and start at {vocab_size}; got {actual_ids}")
- items = sorted(added_tokens.items(), key=lambda text_idx: text_idx[1])
- self.added_tokens_list = [text for (text, idx) in items]
- self.vocab_size_base: int = vocab_size
- self.vocab_size: int = self.vocab_size_base + len(self.added_tokens_list)
- self.fname_tokenizer = fname_tokenizer
+ new_tokens = {id: piece for piece, id in added_tokens.items() if id >= vocab_size}
+ expected_new_ids = list(range(vocab_size, vocab_size + len(new_tokens)))
+ actual_new_ids = sorted(new_tokens.keys())
+
+ if expected_new_ids != actual_new_ids:
+ raise ValueError(f"Expected new token IDs {expected_new_ids} to be sequential; got {actual_new_ids}")
+
+ # Token pieces that were added to the base vocabulary.
+ self.added_tokens_list = [new_tokens[id] for id in actual_new_ids]
+ self.vocab_size_base = vocab_size
+ self.vocab_size = self.vocab_size_base + len(self.added_tokens_list)
+ self.fname_tokenizer = fname_tokenizer
self.fname_added_tokens = fname_added_tokens
def sentencepiece_tokens(self) -> Iterable[tuple[bytes, float, gguf.TokenType]]:
I changed it to just use Seems like the original author of this pull hasn't been active for a bit. edit: Actually there's no compelling reason to change |
Made a new PR with the proposed patch: #3831 |
MistralOrca has added tokens that the IDs are below 30,000. It causes some issues when converting models.
https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca/blob/main/added_tokens.json
Actually, these tokens are the same with the base model so they are unnecessary. But handling this kind of exception might be better.