-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Unigram tokenizer needed by T5 and FLAN-T5 model families #8089
Conversation
llama : add Unigram tokenizer
llama : fix preventing crashes when precompiled_charsmap is not present
llama.cpp
Outdated
vocab.n_precompiled_charsmap = gguf_get_arr_n(ctx, precompiled_charsmap_keyidx); | ||
vocab.precompiled_charsmap = (char *) malloc(vocab.n_precompiled_charsmap); | ||
memcpy((void*) vocab.precompiled_charsmap, gguf_get_arr_data(ctx, precompiled_charsmap_keyidx), vocab.n_precompiled_charsmap); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a memory leak here. Use std::vector<char>
instead of:
uint32_t n_precompiled_charsmap = 0;
char * precompiled_charsmap = NULL;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I replaced it with a vector as suggested, but I had to move endianness correction code from llm_tokenizer_ugm
to llm_load_vocab
- reference to vocab is const in tokenizer, so manipulating the precompiled_charsmap
vector buffer would require const casts.
// initialize score_sum to -FLT_MAX so it will be always lower than sums of token scores | ||
std::vector<struct best_tokenization> tokenization_results(input_len + 1, {0, 0, -FLT_MAX}); | ||
// at the beginning tokenization score is zero | ||
tokenization_results[0] = { 0, 0, 0 }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this supposed to be:
// initialize score_sum to -FLT_MAX so it will be always lower than sums of token scores
std::vector<struct best_tokenization> tokenization_results(input_len + 1, {vocab.special_unk_id, 0, -FLT_MAX});
// at the beginning tokenization score is zero
tokenization_results[0] = { vocab.special_unk_id, 0, 0 };
Currently, the string of a single space character
tokenizes to a single PAD token [0]
, while the AutoTokenizer
returns an empty array of tokens in this case. With the change above, llama.cpp
returns a single UNK token [2]
, which is still incorrect though. Or at least, it does not match the AutoTokenizer
result
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a second PR from a series of PRs adding support for T5 and FLAN-T5 models.
This PR adds implementation of the Unigram tokenizer used in T5 and FLAN-T5 models. It also adds T5 model architecture, tensors and model header parameters to allow testing the tokenizer with llama-tokenize command.