Bug: BLOOM pre-tokenizer is missing #8741

Exploder98 · 2024-07-28T19:59:28Z

What happened?

I tried to convert a BLOOM-based model (https://huggingface.co/TurkuNLP/gpt3-finnish-large) to GGUF. First, I had to change the architecture to BloomForCausalLM, and with that change I got the following error from the conversion script:

WARNING:hf-to-gguf:**************************************************************************************
WARNING:hf-to-gguf:** WARNING: The BPE pre-tokenizer was not recognized!
WARNING:hf-to-gguf:**          There are 2 possible reasons for this:
WARNING:hf-to-gguf:**          - the model has not been added to convert_hf_to_gguf_update.py yet
WARNING:hf-to-gguf:**          - the pre-tokenization config has changed upstream
WARNING:hf-to-gguf:**          Check your model files and convert_hf_to_gguf_update.py and update them accordingly.
WARNING:hf-to-gguf:** ref:     https://github.com/ggerganov/llama.cpp/pull/6920
WARNING:hf-to-gguf:**
WARNING:hf-to-gguf:** chkhsh:  bc01ce58980e1db43859146dc51b1758b3b88729b217a74792e9f8d43e479d21
WARNING:hf-to-gguf:**************************************************************************************

I also tried to convert one of the original BLOOM models (560m), and got the same error (but with a different hash). It seems that BLOOM's pre-tokenizer was not added when they were dealt with in #6920. BLOOM is listed as a supported model in the README, so converting should work.

Name and Version

$ ./bin/llama-cli --version
version: 3481 (5e2727f)
built with cc (GCC) 14.1.1 20240720 for x86_64-pc-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

Exploder98 · 2024-07-29T14:59:32Z

Alright, I did some digging. Looks like the pre-tokenizer regex with both BLOOM and gpt3-finnish is the same as Poro. I hacked together some changes that set the regex for both these models to the one for Poro, and the auto-generated tokenizer tests pass.

Should I add BLOOM and gpt3-finnish as additional models in convert_hf_.py and in the C++ sources? Or should the conversion script just be edited so that the pre-tokenizer is set as Poro for them?

Exploder98 added bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) labels Jul 28, 2024

Exploder98 mentioned this issue Aug 4, 2024

Add pre-tokenizer regexes for BLOOM and gpt3-finnish #8850

Merged

4 tasks

ggerganov closed this as completed in #8850 Aug 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: BLOOM pre-tokenizer is missing #8741

Bug: BLOOM pre-tokenizer is missing #8741

Exploder98 commented Jul 28, 2024

Exploder98 commented Jul 29, 2024

Bug: BLOOM pre-tokenizer is missing #8741

Bug: BLOOM pre-tokenizer is missing #8741

Comments

Exploder98 commented Jul 28, 2024

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Exploder98 commented Jul 29, 2024