Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert.py: Mistral models converted from tokenizer.json display <0x0A> instead of newlines. #4622

Closed
TheBloke opened this issue Dec 24, 2023 · 6 comments · Fixed by #4641
Closed

Comments

@TheBloke
Copy link
Contributor

This issue follows on from the discussions we had at the end of @strutive07 's PR which added support for tokenizer.json, here: #3633

Summary

Llama and Mistral models GGUF converted from tokenizer.json experience an issue with newlines, printing <0x0A instead of \n. The issue does not exist when tokenizer.model is used for the same model.

This represents an issue for some new fine tunes which do not include tokenizer.model. Sometimes this is simply a mistake, and the base model file can be used. But in some cases the models have extended or changed the vocab in tokenizer.json, and a new SPM model would need to be created. (Something that I've not yet been able to figure out how to do.)

Steps to reproduce

  1. Download any Llama or Mistral 7B repo which contains tokenizer.model and tokenizer.json:
pip3 install --upgrade 'huggingface-hub>=0.18'     # if not installed
huggingface-cli download mistralai/Mistral-7B-v0.1 --local-dir test-mistral  --local-dir-use-symlinks False
  1. Run convert.py on it, and verify that output is as expected. Because tokenizer.model is present, it will be used in preference to tokenizer.json, and no issue will exist.
$ ls -al /workspace/test-mistral/tokenizer.model
-rw-rw-r-- 1 quant quant 482K Dec 24 17:14 /workspace/test-mistral/tokenizer.model

$ python3 ./convert.py /workspace/test-mistral --outtype f16 --outfile /workspace/test-mistral/with-tokenizer.model.fp16.gguf

$ ./main -m /workspace/test-mistral/with-tokenizer.model.fp16.gguf -p "A haiku example is " -n 30 --temp 0

 A haiku example is 5-7-5 syllables.

The first line has five syllables, the second line has seven syllables and the
  1. Remove tokenizer.model to force tokenizer.json to be used and re-run convert.py
$ mv /workspace/test-mistral/tokenizer.model /workspace/test-mistral/dead.tokenizer.model

$ python3 ./convert.py /workspace/test-mistral --outtype f16 --outfile /workspace/test-mistral/no-tokenizer.model.fp16.gguf
  1. Test inference and note that \n is now represented as <0x0A> in the output:
$ ./main -m /workspace/test-mistral/no-tokenizer.model.fp16.gguf -p "A haiku example is " -n 30 --temp 0

 A haiku example is 5-7-5 syllables.<0x0A><0x0A>The first line has five syllables, the second line has seven syllables and the

Testing the same using Hugging Face transformers does not show an issue:

In [1]: import os
   ...: from transformers import AutoTokenizer
   ...: print ("tokenizer.model exists:", os.path.exists("/workspace/test-mistral/tokenizer.model"))
   ...: tokenizer = AutoTokenizer.from_pretrained("/workspace/test-mistral/")
   ...: encoded = tokenizer(""" A haiku example is 5-7-5 syllables.
   ...:
   ...: The first line has five syllables, the second line has seven syllables and the""")
   ...: print(f"Tokens: {encoded.input_ids}")
   ...: print(f"Decoded again: '{tokenizer.decode(encoded.input_ids)}'")
tokenizer.model exists: False
Tokens: [1, 28705, 330, 3631, 23550, 2757, 349, 28705, 28782, 28733, 28787, 28733, 28782, 5747, 584, 2561, 28723, 13, 13, 1014, 907, 1407, 659, 3359, 5747, 584, 2561, 28725, 272, 1676, 1407, 659, 6671, 5747, 584, 2561, 304, 272]
Decoded again: '<s>  A haiku example is 5-7-5 syllables.

The first line has five syllables, the second line has seven syllables and the'

In [2]: tokenizer.decode(13)
Out[2]: '\n'

Llama example

$ huggingface-cli download meta-llama/Llama-2-7b-chat-hf  --local-dir test-llama2  --local-dir-use-symlinks False

$ mv test-llama2/tokenizer.model test-llama2/dead.tokenizer.model

$ python3 ./convert.py /workspace/test-llama2 --outtype f16 --outfile /workspace/test-llama2/no-tokenizer.model.fp16.gguf

$ ./main -m /workspace/test-llama2/no-tokenizer.model.fp16.gguf -p "A haiku example is " -n 30 --temp 0

A haiku example is 5 syllables, 7 syllables, and 5 syllables.<0x0A>A haiku is a traditional form of Japanese poetry that

CC @ArthurZucker

@slaren
Copy link
Collaborator

slaren commented Dec 24, 2023

The HF tokenizer treats all tokens in the format <0x..> as special byte tokens:

            let bytes = if token.len() == 6 && token.starts_with("<0x") && token.ends_with('>') {
                if let Ok(byte) = u8::from_str_radix(&token[3..5], 16) {
                    Some(byte)
                } else {
                    None
                }
            } else {
                None
            };

https://github.com/huggingface/tokenizers/blob/11462596d11d886e501091e28acbe1174385087a/tokenizers/src/decoders/byte_fallback.rs#L30-L38

@slaren
Copy link
Collaborator

slaren commented Dec 24, 2023

This should fix it:

diff --git a/convert.py b/convert.py
index 7a3cd615..710f196b 100755
--- a/convert.py
+++ b/convert.py
@@ -394,10 +394,13 @@ class VocabLoader:
             if self.spm.is_byte(token_id):
                 toktype = gguf.TokenType.BYTE
         else:
+            token = self.reverse_vocab[token_id]
             if token_id == self.unk_token_id:
                 toktype = gguf.TokenType.UNKNOWN
-            if token_id in self.special_ids:
+            elif token_id in self.special_ids:
                 toktype = gguf.TokenType.CONTROL
+            elif len(token) == 6 and token.startswith("<0x") and token.endswith(">"):
+                toktype = gguf.TokenType.BYTE

         return toktype

@strutive07
Copy link
Contributor

strutive07 commented Dec 25, 2023

Generation

 A haiku example is 5-7-5 syllables.

The first line has five syllables, the second line has seven syllables and the

Vocab Type Check

from pathlib import Path

def is_same_vocab(v1, v2):
    v1_set = set()
    v2_set = set()

    for text, score, toktype in v1.all_tokens():
        v1_set.add((text, toktype))
        
    for text, score, toktype in v2.all_tokens():
        v2_set.add((text, toktype))

    return v1_set == v2_set

model_path = Path("/workspace/Mistral-7B-v0.1")

params = Params.load(load_some_model(model_path))
vocab_tokenizer_model = VocabLoader(params, model_path)

# remove tokenizer.model here

params = Params.load(load_some_model(model_path))
vocab_hf = VocabLoader(params, model_path)

is_same_vocab(vocab_hf, vocab_tokenizer_model)
>> True

Code DIff

diff --git a/convert.py b/convert.py
index 7a3cd61..a9ccb69 100755
--- a/convert.py
+++ b/convert.py
@@ -357,6 +357,7 @@ class VocabLoader:
             for tok in self.tokenizer.all_special_tokens
         }
         self.special_ids: set[int] = set(self.tokenizer.all_special_ids)
+        self.reverse_vocab = {id: encoded_tok for encoded_tok, id in self.tokenizer.get_vocab().items()}
         self.vocab_size_base: int = self.tokenizer.vocab_size
         self.vocab_size: int = self.vocab_size_base + len(self.added_tokens_dict)
         self.fname_tokenizer: Path = fname_tokenizer
@@ -371,14 +372,13 @@ class VocabLoader:
 
     def hf_tokens(self) -> Iterable[tuple[bytes, float, gguf.TokenType]]:
         tokenizer = self.tokenizer
-        reverse_vocab = {id: encoded_tok for encoded_tok, id in tokenizer.get_vocab().items()}
         added_tokens_ids = set(self.added_tokens_dict.values())
 
         for i in range(self.vocab_size_base):
             if i in added_tokens_ids:
                 continue
 
-            text = reverse_vocab[i].encode("utf-8")
+            text = self.reverse_vocab[i].encode("utf-8")
             yield text, self.get_token_score(i), self.get_token_type(i)
 
     def get_token_type(self, token_id: int) -> gguf.TokenType:
@@ -394,10 +394,13 @@ class VocabLoader:
             if self.spm.is_byte(token_id):
                 toktype = gguf.TokenType.BYTE
         else:
+            token = self.reverse_vocab[token_id]
             if token_id == self.unk_token_id:
                 toktype = gguf.TokenType.UNKNOWN
-            if token_id in self.special_ids:
+            elif token_id in self.special_ids:
                 toktype = gguf.TokenType.CONTROL
+            elif len(token) == 6 and token.startswith("<0x") and token.endswith(">"):
+                toktype = gguf.TokenType.BYTE
 
         return toktype
 

@slaren

I've checked that the code you posted works and I've also checked the vocab type and it's the same. Thank you.
Since you suggested the code, I was wondering if you could give it a PR? If you reposition the reverse_vocab to init like the code I posted above, it works fine. Could you please PR the code including the above code?

@teleprint-me
Copy link
Contributor

teleprint-me commented Dec 25, 2023

I'll look into it. There are a bunch of issues related to the vocab and conversion script. e.g. iss #4493, #4360, etc.

There's also vocab mismatches occurring in a higher frequency after the latest merge with the updated convert.py.

llm_load_vocab: mismatch in special tokens definition ( 910/51200 vs 944/51200 ).

Phi models aren't the only models affected by it. It shows up with Mistral, Mixtral, Llama-1, Llama-2, etc.

@slaren
Copy link
Collaborator

slaren commented Dec 25, 2023

@strutive07 Please open a PR yourself, I am not familiar with the convert.py code.

@TheBloke
Copy link
Contributor Author

Thanks very much for the diagnosis and fixes, @slaren and @strutive07 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants