Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: BLOOM pre-tokenizer is missing #8741

Closed
Exploder98 opened this issue Jul 28, 2024 · 1 comment · Fixed by #8850
Closed

Bug: BLOOM pre-tokenizer is missing #8741

Exploder98 opened this issue Jul 28, 2024 · 1 comment · Fixed by #8850
Labels
bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)

Comments

@Exploder98
Copy link
Contributor

What happened?

I tried to convert a BLOOM-based model (https://huggingface.co/TurkuNLP/gpt3-finnish-large) to GGUF. First, I had to change the architecture to BloomForCausalLM, and with that change I got the following error from the conversion script:

WARNING:hf-to-gguf:**************************************************************************************
WARNING:hf-to-gguf:** WARNING: The BPE pre-tokenizer was not recognized!
WARNING:hf-to-gguf:**          There are 2 possible reasons for this:
WARNING:hf-to-gguf:**          - the model has not been added to convert_hf_to_gguf_update.py yet
WARNING:hf-to-gguf:**          - the pre-tokenization config has changed upstream
WARNING:hf-to-gguf:**          Check your model files and convert_hf_to_gguf_update.py and update them accordingly.
WARNING:hf-to-gguf:** ref:     https://github.com/ggerganov/llama.cpp/pull/6920
WARNING:hf-to-gguf:**
WARNING:hf-to-gguf:** chkhsh:  bc01ce58980e1db43859146dc51b1758b3b88729b217a74792e9f8d43e479d21
WARNING:hf-to-gguf:**************************************************************************************

I also tried to convert one of the original BLOOM models (560m), and got the same error (but with a different hash). It seems that BLOOM's pre-tokenizer was not added when they were dealt with in #6920. BLOOM is listed as a supported model in the README, so converting should work.

Name and Version

$ ./bin/llama-cli --version
version: 3481 (5e2727f)
built with cc (GCC) 14.1.1 20240720 for x86_64-pc-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

No response

@Exploder98 Exploder98 added bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) labels Jul 28, 2024
@Exploder98
Copy link
Contributor Author

Alright, I did some digging. Looks like the pre-tokenizer regex with both BLOOM and gpt3-finnish is the same as Poro. I hacked together some changes that set the regex for both these models to the one for Poro, and the auto-generated tokenizer tests pass.

Should I add BLOOM and gpt3-finnish as additional models in convert_hf_.py and in the C++ sources? Or should the conversion script just be edited so that the pre-tokenizer is set as Poro for them?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant