Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer fixes #8379

Draft
wants to merge 9 commits into
base: master
Choose a base branch
from
Draft

Conversation

jaime-m-p
Copy link
Collaborator

@jaime-m-p jaime-m-p commented Jul 8, 2024

More tokenizer fixes.



Examples of vocab differences:

INFO VOCABFILE: './models/ggml-vocab-t5.gguf'
ERROR  detokenize=False id=32000 expected='<extra_id_99>' result='[PAD32000]'
ERROR  detokenize=False id=32001 expected='<extra_id_98>' result='[PAD32001]'
ERROR  detokenize=False id=32002 expected='<extra_id_97>' result='[PAD32002]'
ERROR  detokenize=False id=32003 expected='<extra_id_96>' result='[PAD32003]'
ERROR  detokenize=False id=32004 expected='<extra_id_95>' result='[PAD32004]'
ERROR  detokenize=False id=32005 expected='<extra_id_94>' result='[PAD32005]'
ERROR  detokenize=False id=32006 expected='<extra_id_93>' result='[PAD32006]'
ERROR  detokenize=False id=32007 expected='<extra_id_92>' result='[PAD32007]'
ERROR  detokenize=False id=32008 expected='<extra_id_91>' result='[PAD32008]'
ERROR  detokenize=False id=32009 expected='<extra_id_90>' result='[PAD32009]'
INFO VOCABFILE: './models/ggml-vocab-deepseek-llm.gguf'
ERROR  detokenize=True id=100002 expected='�' result='ø'
ERROR  detokenize=True id=100003 expected='�' result='ö'
ERROR  detokenize=True id=100004 expected='�' result='ú'
ERROR  detokenize=True id=100005 expected='�' result='ÿ'
ERROR  detokenize=True id=100006 expected='�' result='õ'
ERROR  detokenize=True id=100007 expected='�' result='÷'
ERROR  detokenize=True id=100008 expected='�' result='û'
ERROR  detokenize=True id=100009 expected='�' result='ý'
ERROR  detokenize=True id=100010 expected='�' result='À'
ERROR  detokenize=True id=100011 expected='�' result='ù'
INFO VOCABFILE: './models/ggml-vocab-command-r.gguf'
ERROR  detokenize=True id=264 expected='\u200d' result='[UNK_BYTE_0xe2808d\u200d]'
ERROR  detokenize=True id=265 expected='‼' result='[UNK_BYTE_0xe280bc‼]'
ERROR  detokenize=True id=266 expected='⁉' result='[UNK_BYTE_0xe28189⁉]'
ERROR  detokenize=True id=267 expected='⃣' result='[UNK_BYTE_0xe283a3⃣]'
ERROR  detokenize=True id=268 expected='™' result='[UNK_BYTE_0xe284a2™]'
ERROR  detokenize=True id=269 expected='ℹ' result='[UNK_BYTE_0xe284b9ℹ]'
ERROR  detokenize=True id=270 expected='↔' result='[UNK_BYTE_0xe28694↔]'
ERROR  detokenize=True id=271 expected='↕' result='[UNK_BYTE_0xe28695↕]'
ERROR  detokenize=True id=272 expected='↖' result='[UNK_BYTE_0xe28696↖]'
ERROR  detokenize=True id=273 expected='↗' result='[UNK_BYTE_0xe28697↗]'

jaime-m-p added 5 commits July 9, 2024 00:55
Some models ('jais' and 'command-r') copy original utf8 on error.
Others ('deepseek') seems to use the replacement character 0xFFFD.
@jaime-m-p jaime-m-p marked this pull request as draft July 8, 2024 23:34
@github-actions github-actions bot added testing Everything test related python python script changes labels Jul 8, 2024
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Jul 9, 2024
Fix pyparse problems: gcc inline functions

Test l/r-strip for more than 4 spaces

Improve mismatch range localization

Compare vocabs

Options to mange token text decoding errors:

Some models ('jais' and 'command-r') copy original utf8 on error.
Others ('deepseek') seems to use the replacement character 0xFFFD.
max_token_id = max(self.model.get_vocab().values())
if detokenize:
ids = list(range(max_token_id + 1))
vocab = self.model.batch_decode(ids, skip_special_tokens=False)
Copy link
Collaborator

@compilade compilade Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this should be used in the convert script(s) instead of directly getting the strings from tokenizer.vocab?

EDIT: this might be a bad idea, since the tokenizer merges won't directly match with the strings from the vocab if that's done

@@ -36,7 +36,7 @@ def __init__(self, path_llama_h: str = None, path_includes: list[str] = [], path
self.lib.llama_backend_init()

def _load_libllama_cffi(self, path_llama_h: str, path_includes: list[str], path_libllama: str):
cmd = ["gcc", "-E", "-P", "-D__restrict=", "-D__attribute__(x)=", "-D__asm__(x)="]
cmd = ["gcc", "-O0", "-fno-inline", "-E", "-P", "-D__restrict=", "-D__attribute__(x)=", "-D__asm__(x)="]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think -fno-inline is redundant with -O0. And -O0 alone works, while -fno-inline alone doesn't.

Anyway, I suggest resolving the conflict with master.

Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Jul 9, 2024
Fix pyparse problems: gcc inline functions

Test l/r-strip for more than 4 spaces

Improve mismatch range localization

Compare vocabs

Options to mange token text decoding errors:

Some models ('jais' and 'command-r') copy original utf8 on error.
Others ('deepseek') seems to use the replacement character 0xFFFD.
@compilade
Copy link
Collaborator

compilade commented Jul 9, 2024

INFO VOCABFILE: './models/ggml-vocab-deepseek-llm.gguf'
ERROR  detokenize=True id=100002 expected='�' result='ø'
ERROR  detokenize=True id=100003 expected='�' result='ö'
ERROR  detokenize=True id=100004 expected='�' result='ú'
ERROR  detokenize=True id=100005 expected='�' result='ÿ'
ERROR  detokenize=True id=100006 expected='�' result='õ'
ERROR  detokenize=True id=100007 expected='�' result='÷'
ERROR  detokenize=True id=100008 expected='�' result='û'
ERROR  detokenize=True id=100009 expected='�' result='ý'
ERROR  detokenize=True id=100010 expected='�' result='À'
ERROR  detokenize=True id=100011 expected='�' result='ù'

These are part of the added_tokens of deepseek-llm, and are exactly as in result. Not sure where expected takes its tokens, but this is not correct if it doesn't take into account the added_tokens.

INFO VOCABFILE: './models/ggml-vocab-t5.gguf'
ERROR  detokenize=False id=32000 expected='<extra_id_99>' result='[PAD32000]'
ERROR  detokenize=False id=32001 expected='<extra_id_98>' result='[PAD32001]'
ERROR  detokenize=False id=32002 expected='<extra_id_97>' result='[PAD32002]'
...

These are also part of the added tokens (of t5), but in this case it's llama.cpp which is wrong. This does seem useful for debugging the convert script(s)!

compilade added a commit that referenced this pull request Jul 13, 2024
* test-tokenizer-random : add a failing edge case for falcon
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Jul 13, 2024
test-tokenizer-random : reduce potential confilcts with ggerganov#8379
@mofosyne mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Jul 13, 2024
compilade added a commit that referenced this pull request Jul 14, 2024
* llama : fix mpt and olmo pre-tokenizer

* llama : pre-tokenize non-special user-defined tokens first

* llama : fix detection of control-like user-defined tokens

* convert_hf : identify which user-defined tokens are control tokens

Only used in _set_vocab_gpt2() for now.

* convert_hf : identify more added control tokens for SPM tokenziers

This makes Gemma and Gemma-2 tokenize pretty much EVERYTHING correctly,
including HTML tags and consecutive spaces,
but it unfortunately requires model re-conversion.

There seems to be a weird behavior of the HF tokenizer for Gemma,
which prefers to use the 16-space token over more lengthy space tokens,
while using the SentencePiece tokenizer does not do this.
(the implementation in llama.cpp has the same behavior as SentencePiece)

* llama : fix wrong pre-tokenization of byte tokens

* llama : fix Viking pre-tokenizer regex

The order was previously wrong, which caused errors in some tests.

* llama : fix command-r detokenization

* convert_hf : reduce usages of the UNKNOWN token type

* llama : add UNKNOWN tokens in the special tokens cache

* convert_hf : reduce usages of UNKNOWN for InternLM2

This makes the changes from #8321 more consistent
with the other changes made here.

* test-tokenizer-random : reduce potential confilcts with #8379

* test-tokenizer-random : add a failing edge case for falcon
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Jul 15, 2024
)

* llama : fix mpt and olmo pre-tokenizer

* llama : pre-tokenize non-special user-defined tokens first

* llama : fix detection of control-like user-defined tokens

* convert_hf : identify which user-defined tokens are control tokens

Only used in _set_vocab_gpt2() for now.

* convert_hf : identify more added control tokens for SPM tokenziers

This makes Gemma and Gemma-2 tokenize pretty much EVERYTHING correctly,
including HTML tags and consecutive spaces,
but it unfortunately requires model re-conversion.

There seems to be a weird behavior of the HF tokenizer for Gemma,
which prefers to use the 16-space token over more lengthy space tokens,
while using the SentencePiece tokenizer does not do this.
(the implementation in llama.cpp has the same behavior as SentencePiece)

* llama : fix wrong pre-tokenization of byte tokens

* llama : fix Viking pre-tokenizer regex

The order was previously wrong, which caused errors in some tests.

* llama : fix command-r detokenization

* convert_hf : reduce usages of the UNKNOWN token type

* llama : add UNKNOWN tokens in the special tokens cache

* convert_hf : reduce usages of UNKNOWN for InternLM2

This makes the changes from ggerganov#8321 more consistent
with the other changes made here.

* test-tokenizer-random : reduce potential confilcts with ggerganov#8379

* test-tokenizer-random : add a failing edge case for falcon
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Jul 25, 2024
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jul 27, 2024
)

* llama : fix mpt and olmo pre-tokenizer

* llama : pre-tokenize non-special user-defined tokens first

* llama : fix detection of control-like user-defined tokens

* convert_hf : identify which user-defined tokens are control tokens

Only used in _set_vocab_gpt2() for now.

* convert_hf : identify more added control tokens for SPM tokenziers

This makes Gemma and Gemma-2 tokenize pretty much EVERYTHING correctly,
including HTML tags and consecutive spaces,
but it unfortunately requires model re-conversion.

There seems to be a weird behavior of the HF tokenizer for Gemma,
which prefers to use the 16-space token over more lengthy space tokens,
while using the SentencePiece tokenizer does not do this.
(the implementation in llama.cpp has the same behavior as SentencePiece)

* llama : fix wrong pre-tokenization of byte tokens

* llama : fix Viking pre-tokenizer regex

The order was previously wrong, which caused errors in some tests.

* llama : fix command-r detokenization

* convert_hf : reduce usages of the UNKNOWN token type

* llama : add UNKNOWN tokens in the special tokens cache

* convert_hf : reduce usages of UNKNOWN for InternLM2

This makes the changes from ggerganov#8321 more consistent
with the other changes made here.

* test-tokenizer-random : reduce potential confilcts with ggerganov#8379

* test-tokenizer-random : add a failing edge case for falcon
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Jul 27, 2024
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Jul 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python python script changes Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants