-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DO NOT MERGE Add olmo2 tokenizer to convert script (leaving open for discussion) #10535
base: master
Are you sure you want to change the base?
Conversation
Hmm this still fails to run for some reason, saying: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'olmo2' Must have missed adding it somewhere, looking... |
Okay confirmed with these changes it seems to recognize the model without issue (haven't confirmed coherency, but imatrix is working now which points to generation working) |
It is necessary to check that the pre-tokenizer regex matches the one being used in llama.cpp as well, otherwise a new one needs to be added. |
@slaren are you referring to this: https://github.com/ggerganov/llama.cpp/pull/10535/files#diff-57725fa4c99fac28d0283eaa8b59c5e030355af0c8fbaf33145a64e3f2aa3420R6409 or is there something else i'm missing |
That's one part of it, but as it is, it is just using the default BPE regex. If it is different it needs to be added here: Line 369 in c9b00a7
|
Ahh I see, looking in https://huggingface.co/allenai/OLMo-2-1124-7B-Instruct/raw/main/tokenizer.json I don't see any regex defined, does that imply the default is fine? |
It seems to use the GPT2 tokenizer so I imagine it needs to use the GPT2 regex (which olmo also uses). @2015aroras might know if there are changes to the tokenizer compared to olmo. |
I'm not very familiar with tokenizers, so bare with me. OLMo 1 and OLMo 2 have different tokenizers with different vocab sizes. When I implemented OLMo 2 here, I tested only the base OLMo 2 model since that's all I had at hand. It seems to have check hash |
Yes I noticed the base model went fine but the instruct versions didn't, there must be some other change to the tokenizer that's not as obvious |
Using this with the 13B model seems to yield a fair conversation. No idea why one of the prompts got no response the first time, though.
|
Seems reasonable enough, is this made using my PR? |
Can you use llama-tokenize to see if it's tokenizing stuff properly? |
After trying it with llama-tokenize I can see that no it's not tokenizing correctly even with my changes |
I used the PR, but didn't try llama-tokenize. We're looking internally into the tokenizer differences. I'll report back when we have a conclusion. |
Okay thanks sounds good, I'll also take a look now that I'm home |
well here's something @2015aroras the non-instruct has a regex string in the pre_tokenizer, the instruct doesn't non-instruct:
instruct:
I'm guessing that's what's making the difference Also the post_processor's are different, but not sure it matters since llama.cpp is detecting it fine and presumably won't use that anyways post-conversion |
Yup so adding the pre_tokenizer makes it pass without my changes |
However even with that change it doesn't seem to be tokenizing correctly :/
|
but maybe that's just how the model behaves by default..? I don't see any tokens for |
any updates @2015aroras ? |
This isn't the correct solution, the tokenizer is wrong for the instruct models, shouldn't add anything to support it but instead should change the tokenizer.json
The checksum for the new tokenizer was missing from convert_hf_to_gguf.py, added to it and convert_hf_to_gguf_update.py