-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
convert.py
: Mistral models converted from tokenizer.json
display <0x0A>
instead of newlines.
#4622
Comments
The HF tokenizer treats all tokens in the format let bytes = if token.len() == 6 && token.starts_with("<0x") && token.ends_with('>') {
if let Ok(byte) = u8::from_str_radix(&token[3..5], 16) {
Some(byte)
} else {
None
}
} else {
None
}; |
This should fix it: diff --git a/convert.py b/convert.py
index 7a3cd615..710f196b 100755
--- a/convert.py
+++ b/convert.py
@@ -394,10 +394,13 @@ class VocabLoader:
if self.spm.is_byte(token_id):
toktype = gguf.TokenType.BYTE
else:
+ token = self.reverse_vocab[token_id]
if token_id == self.unk_token_id:
toktype = gguf.TokenType.UNKNOWN
- if token_id in self.special_ids:
+ elif token_id in self.special_ids:
toktype = gguf.TokenType.CONTROL
+ elif len(token) == 6 and token.startswith("<0x") and token.endswith(">"):
+ toktype = gguf.TokenType.BYTE
return toktype |
Generation
Vocab Type Check from pathlib import Path
def is_same_vocab(v1, v2):
v1_set = set()
v2_set = set()
for text, score, toktype in v1.all_tokens():
v1_set.add((text, toktype))
for text, score, toktype in v2.all_tokens():
v2_set.add((text, toktype))
return v1_set == v2_set
model_path = Path("/workspace/Mistral-7B-v0.1")
params = Params.load(load_some_model(model_path))
vocab_tokenizer_model = VocabLoader(params, model_path)
# remove tokenizer.model here
params = Params.load(load_some_model(model_path))
vocab_hf = VocabLoader(params, model_path)
is_same_vocab(vocab_hf, vocab_tokenizer_model)
>> True Code DIff diff --git a/convert.py b/convert.py
index 7a3cd61..a9ccb69 100755
--- a/convert.py
+++ b/convert.py
@@ -357,6 +357,7 @@ class VocabLoader:
for tok in self.tokenizer.all_special_tokens
}
self.special_ids: set[int] = set(self.tokenizer.all_special_ids)
+ self.reverse_vocab = {id: encoded_tok for encoded_tok, id in self.tokenizer.get_vocab().items()}
self.vocab_size_base: int = self.tokenizer.vocab_size
self.vocab_size: int = self.vocab_size_base + len(self.added_tokens_dict)
self.fname_tokenizer: Path = fname_tokenizer
@@ -371,14 +372,13 @@ class VocabLoader:
def hf_tokens(self) -> Iterable[tuple[bytes, float, gguf.TokenType]]:
tokenizer = self.tokenizer
- reverse_vocab = {id: encoded_tok for encoded_tok, id in tokenizer.get_vocab().items()}
added_tokens_ids = set(self.added_tokens_dict.values())
for i in range(self.vocab_size_base):
if i in added_tokens_ids:
continue
- text = reverse_vocab[i].encode("utf-8")
+ text = self.reverse_vocab[i].encode("utf-8")
yield text, self.get_token_score(i), self.get_token_type(i)
def get_token_type(self, token_id: int) -> gguf.TokenType:
@@ -394,10 +394,13 @@ class VocabLoader:
if self.spm.is_byte(token_id):
toktype = gguf.TokenType.BYTE
else:
+ token = self.reverse_vocab[token_id]
if token_id == self.unk_token_id:
toktype = gguf.TokenType.UNKNOWN
- if token_id in self.special_ids:
+ elif token_id in self.special_ids:
toktype = gguf.TokenType.CONTROL
+ elif len(token) == 6 and token.startswith("<0x") and token.endswith(">"):
+ toktype = gguf.TokenType.BYTE
return toktype
I've checked that the code you posted works and I've also checked the vocab type and it's the same. Thank you. |
I'll look into it. There are a bunch of issues related to the vocab and conversion script. e.g. iss #4493, #4360, etc. There's also vocab mismatches occurring in a higher frequency after the latest merge with the updated convert.py.
Phi models aren't the only models affected by it. It shows up with Mistral, Mixtral, Llama-1, Llama-2, etc. |
@strutive07 Please open a PR yourself, I am not familiar with the convert.py code. |
Thanks very much for the diagnosis and fixes, @slaren and @strutive07 ! |
This issue follows on from the discussions we had at the end of @strutive07 's PR which added support for
tokenizer.json
, here: #3633Summary
Llama and Mistral models GGUF converted from
tokenizer.json
experience an issue with newlines, printing<0x0A
instead of\n
. The issue does not exist whentokenizer.model
is used for the same model.This represents an issue for some new fine tunes which do not include
tokenizer.model
. Sometimes this is simply a mistake, and the base model file can be used. But in some cases the models have extended or changed the vocab intokenizer.json
, and a new SPM model would need to be created. (Something that I've not yet been able to figure out how to do.)Steps to reproduce
tokenizer.model
andtokenizer.json
:convert.py
on it, and verify that output is as expected. Becausetokenizer.model
is present, it will be used in preference totokenizer.json
, and no issue will exist.tokenizer.model
to forcetokenizer.json
to be used and re-runconvert.py
\n
is now represented as<0x0A>
in the output:Testing the same using Hugging Face transformers does not show an issue:
Llama example
CC @ArthurZucker
The text was updated successfully, but these errors were encountered: