-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenizer fixes #8379
base: master
Are you sure you want to change the base?
Tokenizer fixes #8379
Conversation
Some models ('jais' and 'command-r') copy original utf8 on error. Others ('deepseek') seems to use the replacement character 0xFFFD.
Fix pyparse problems: gcc inline functions Test l/r-strip for more than 4 spaces Improve mismatch range localization Compare vocabs Options to mange token text decoding errors: Some models ('jais' and 'command-r') copy original utf8 on error. Others ('deepseek') seems to use the replacement character 0xFFFD.
max_token_id = max(self.model.get_vocab().values()) | ||
if detokenize: | ||
ids = list(range(max_token_id + 1)) | ||
vocab = self.model.batch_decode(ids, skip_special_tokens=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think this should be used in the convert script(s) instead of directly getting the strings from tokenizer.vocab
?
EDIT: this might be a bad idea, since the tokenizer merges won't directly match with the strings from the vocab if that's done
tests/test-tokenizer-random.py
Outdated
@@ -36,7 +36,7 @@ def __init__(self, path_llama_h: str = None, path_includes: list[str] = [], path | |||
self.lib.llama_backend_init() | |||
|
|||
def _load_libllama_cffi(self, path_llama_h: str, path_includes: list[str], path_libllama: str): | |||
cmd = ["gcc", "-E", "-P", "-D__restrict=", "-D__attribute__(x)=", "-D__asm__(x)="] | |||
cmd = ["gcc", "-O0", "-fno-inline", "-E", "-P", "-D__restrict=", "-D__attribute__(x)=", "-D__asm__(x)="] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think -fno-inline
is redundant with -O0
. And -O0
alone works, while -fno-inline
alone doesn't.
Anyway, I suggest resolving the conflict with master
.
Fix pyparse problems: gcc inline functions Test l/r-strip for more than 4 spaces Improve mismatch range localization Compare vocabs Options to mange token text decoding errors: Some models ('jais' and 'command-r') copy original utf8 on error. Others ('deepseek') seems to use the replacement character 0xFFFD.
These are part of the
These are also part of the added tokens (of |
* test-tokenizer-random : add a failing edge case for falcon
test-tokenizer-random : reduce potential confilcts with ggerganov#8379
* llama : fix mpt and olmo pre-tokenizer * llama : pre-tokenize non-special user-defined tokens first * llama : fix detection of control-like user-defined tokens * convert_hf : identify which user-defined tokens are control tokens Only used in _set_vocab_gpt2() for now. * convert_hf : identify more added control tokens for SPM tokenziers This makes Gemma and Gemma-2 tokenize pretty much EVERYTHING correctly, including HTML tags and consecutive spaces, but it unfortunately requires model re-conversion. There seems to be a weird behavior of the HF tokenizer for Gemma, which prefers to use the 16-space token over more lengthy space tokens, while using the SentencePiece tokenizer does not do this. (the implementation in llama.cpp has the same behavior as SentencePiece) * llama : fix wrong pre-tokenization of byte tokens * llama : fix Viking pre-tokenizer regex The order was previously wrong, which caused errors in some tests. * llama : fix command-r detokenization * convert_hf : reduce usages of the UNKNOWN token type * llama : add UNKNOWN tokens in the special tokens cache * convert_hf : reduce usages of UNKNOWN for InternLM2 This makes the changes from #8321 more consistent with the other changes made here. * test-tokenizer-random : reduce potential confilcts with #8379 * test-tokenizer-random : add a failing edge case for falcon
) * llama : fix mpt and olmo pre-tokenizer * llama : pre-tokenize non-special user-defined tokens first * llama : fix detection of control-like user-defined tokens * convert_hf : identify which user-defined tokens are control tokens Only used in _set_vocab_gpt2() for now. * convert_hf : identify more added control tokens for SPM tokenziers This makes Gemma and Gemma-2 tokenize pretty much EVERYTHING correctly, including HTML tags and consecutive spaces, but it unfortunately requires model re-conversion. There seems to be a weird behavior of the HF tokenizer for Gemma, which prefers to use the 16-space token over more lengthy space tokens, while using the SentencePiece tokenizer does not do this. (the implementation in llama.cpp has the same behavior as SentencePiece) * llama : fix wrong pre-tokenization of byte tokens * llama : fix Viking pre-tokenizer regex The order was previously wrong, which caused errors in some tests. * llama : fix command-r detokenization * convert_hf : reduce usages of the UNKNOWN token type * llama : add UNKNOWN tokens in the special tokens cache * convert_hf : reduce usages of UNKNOWN for InternLM2 This makes the changes from ggerganov#8321 more consistent with the other changes made here. * test-tokenizer-random : reduce potential confilcts with ggerganov#8379 * test-tokenizer-random : add a failing edge case for falcon
fb46a15
to
3db5058
Compare
by jaime-m-p
) * llama : fix mpt and olmo pre-tokenizer * llama : pre-tokenize non-special user-defined tokens first * llama : fix detection of control-like user-defined tokens * convert_hf : identify which user-defined tokens are control tokens Only used in _set_vocab_gpt2() for now. * convert_hf : identify more added control tokens for SPM tokenziers This makes Gemma and Gemma-2 tokenize pretty much EVERYTHING correctly, including HTML tags and consecutive spaces, but it unfortunately requires model re-conversion. There seems to be a weird behavior of the HF tokenizer for Gemma, which prefers to use the 16-space token over more lengthy space tokens, while using the SentencePiece tokenizer does not do this. (the implementation in llama.cpp has the same behavior as SentencePiece) * llama : fix wrong pre-tokenization of byte tokens * llama : fix Viking pre-tokenizer regex The order was previously wrong, which caused errors in some tests. * llama : fix command-r detokenization * convert_hf : reduce usages of the UNKNOWN token type * llama : add UNKNOWN tokens in the special tokens cache * convert_hf : reduce usages of UNKNOWN for InternLM2 This makes the changes from ggerganov#8321 more consistent with the other changes made here. * test-tokenizer-random : reduce potential confilcts with ggerganov#8379 * test-tokenizer-random : add a failing edge case for falcon
by jaime-m-p
by jaime-m-p
More tokenizer fixes.
Examples of vocab differences: