llama : support raw NUL bytes in tokens #8992
Open
+28
−18
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There are some models with NUL (
\0
) in their vocab, notably InternLM2 which has both<0x00>
and a literal\0
token.The GGUF format uses counted strings, so using NUL would not be a problem if
gguf_get_val_str
returned something more than a NUL-terminatedconst char *
.This adds
gguf_get_val_str_n
andgguf_get_arr_str_n
to get the length of a string stored in GGUF metadata. The constructor forstd::string
supports using a length, so it's used insrc/llama.cpp
.Keeping the NUL token in InternLM2 no longer makes
src/llama.cpp
crash, and so it should now correctly tokenize strings containing NUL bytes when using that tokenizer.Technically,
gguf_get_val_str_n
might not strictly be needed because metadata usually never contains NUL bytes, but I'm including it for consistency withgguf_get_arr_str_n
.Alternative implementations
A separate function to get the length of a string is not ideal, because each string accessed in GGUF needs 2 calls to be safe with NUL bytes.
Some other ideas:
gguf_get_arr_str
could return agguf_str
ggml.h
std::string
gguf_get_arr_str
could return the length by modifying a value by referencestd::string
I could be missing something, in which case I'd welcome a cleaner solution.
TODO
gguf_get_val_str_n
andgguf_get_arr_str_n
be something else thanint
, likesize_t
oruint64_t
?gguf_set_val_str
also support NUL bytes in strings?test-tokenizer-random.py
Note that I consider this to be somewhat low priority.