-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
convert-hf-to-gguf.py breaks on phi-2 #7219
Comments
BPE pre-tokenizer was not recognized |
Trying to fix it. Keep tabs on my PR #7117. If Phi-1 works, then Phi-1.5 and Phi-2 should work as well. They all used the same vocab. Phi-1 is registering as Phi-2, but Phi-1.5 and Phi-2 do not. What makes this super weird is that Phi-2 vocab registers as it's own instead of the gpt-2 vocab like its supposed to. @mofosyne This is definitely a bug. 23:14:05 | /mnt/valerie/forked/ggerganov/llama.cpp
(.venv) git:(add-stablelm-hash | Δ) λ ./main --color -e -s 1337 -c 256 -n 256 -p "Create a function that returns a list of a prime numbers based on a given input in Python" -m /mnt/valerie/models/microsoft/phi-2/ggml-model-f16.gguf
Log start
main: build = 2893 (dc020985)
main: built with cc (GCC) 14.1.1 20240507 for x86_64-pc-linux-gnu
main: seed = 1337
llama_model_loader: loaded meta data with 21 key-value pairs and 453 tensors from /mnt/valerie/models/microsoft/phi-2/ggml-model-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = phi2
llama_model_loader: - kv 1: general.name str = Phi2
llama_model_loader: - kv 2: phi2.context_length u32 = 2048
llama_model_loader: - kv 3: phi2.embedding_length u32 = 2560
llama_model_loader: - kv 4: phi2.feed_forward_length u32 = 10240
llama_model_loader: - kv 5: phi2.block_count u32 = 32
llama_model_loader: - kv 6: phi2.attention.head_count u32 = 32
llama_model_loader: - kv 7: phi2.attention.head_count_kv u32 = 32
llama_model_loader: - kv 8: phi2.attention.layer_norm_epsilon f32 = 0.000010
llama_model_loader: - kv 9: phi2.rope.dimension_count u32 = 32
llama_model_loader: - kv 10: general.file_type u32 = 1
llama_model_loader: - kv 11: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 12: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 13: tokenizer.ggml.pre str = phi-2
llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,51200] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,51200] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,50000] = ["Ġ t", "Ġ a", "h e", "i n", "r e",...
llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 50256
llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 50256
llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 50256
llama_model_loader: - kv 20: general.quantization_version u32 = 2
llama_model_loader: - type f32: 259 tensors
llama_model_loader: - type f16: 194 tensors
llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'phi-2'
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '/mnt/valerie/models/microsoft/phi-2/ggml-model-f16.gguf'
main: error: unable to load model The pre-tokenizer is registered as phi-2 instead of gpt-2. llama_model_loader: - kv 13: tokenizer.ggml.pre str = phi-2 |
Okay, yeah. This is definitely a bug. I was able to fix it. 23:29:52 | /mnt/valerie/forked/ggerganov/llama.cpp
(.venv) git:(add-stablelm-hash | Δ) λ ./main --color -e -s 1337 -c 256 -n 256 -p "Create a function that returns a list of a prime numbers based on a given input in Python" -m /mnt/valerie/models/microsoft/phi-2/ggml-model-f16.gguf
Log start
main: build = 2893 (dc020985)
main: built with cc (GCC) 14.1.1 20240507 for x86_64-pc-linux-gnu
main: seed = 1337
llama_model_loader: loaded meta data with 21 key-value pairs and 453 tensors from /mnt/valerie/models/microsoft/phi-2/ggml-model-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = phi2
llama_model_loader: - kv 1: general.name str = Phi2
llama_model_loader: - kv 2: phi2.context_length u32 = 2048
llama_model_loader: - kv 3: phi2.embedding_length u32 = 2560
llama_model_loader: - kv 4: phi2.feed_forward_length u32 = 10240
llama_model_loader: - kv 5: phi2.block_count u32 = 32
llama_model_loader: - kv 6: phi2.attention.head_count u32 = 32
llama_model_loader: - kv 7: phi2.attention.head_count_kv u32 = 32
llama_model_loader: - kv 8: phi2.attention.layer_norm_epsilon f32 = 0.000010
llama_model_loader: - kv 9: phi2.rope.dimension_count u32 = 32
llama_model_loader: - kv 10: general.file_type u32 = 1
llama_model_loader: - kv 11: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 12: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 13: tokenizer.ggml.pre str = gpt-2
llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,51200] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,51200] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,50000] = ["Ġ t", "Ġ a", "h e", "i n", "r e",...
llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 50256
llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 50256
llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 50256
llama_model_loader: - kv 20: general.quantization_version u32 = 2
llama_model_loader: - type f32: 259 tensors
llama_model_loader: - type f16: 194 tensors
llm_load_vocab: mismatch in special tokens definition ( 910/51200 vs 944/51200 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = phi2
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 51200
llm_load_print_meta: n_merges = 50000
llm_load_print_meta: n_ctx_train = 2048
llm_load_print_meta: n_embd = 2560
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 32
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 32
llm_load_print_meta: n_embd_head_k = 80
llm_load_print_meta: n_embd_head_v = 80
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 2560
llm_load_print_meta: n_embd_v_gqa = 2560
llm_load_print_meta: f_norm_eps = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 0.0e+00
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 10240
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 2048
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 3B
llm_load_print_meta: model ftype = F16
llm_load_print_meta: model params = 2.78 B
llm_load_print_meta: model size = 5.18 GiB (16.01 BPW)
llm_load_print_meta: general.name = Phi2
llm_load_print_meta: BOS token = 50256 '<|endoftext|>'
llm_load_print_meta: EOS token = 50256 '<|endoftext|>'
llm_load_print_meta: UNK token = 50256 '<|endoftext|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_tensors: ggml ctx size = 0.21 MiB
llm_load_tensors: CPU buffer size = 5303.65 MiB
.............................................................................................
llama_new_context_with_model: n_ctx = 256
llama_new_context_with_model: n_batch = 256
llama_new_context_with_model: n_ubatch = 256
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 80.00 MiB
llama_new_context_with_model: KV self size = 80.00 MiB, K (f16): 40.00 MiB, V (f16): 40.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.20 MiB
llama_new_context_with_model: CPU compute buffer size = 52.50 MiB
llama_new_context_with_model: graph nodes = 1161
llama_new_context_with_model: graph splits = 1
system_info: n_threads = 8 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 256, n_batch = 2048, n_predict = 256, n_keep = 0
Create a function that returns a list of a prime numbers based on a given input in Python.
```python
def prime_numbers(n):
primes = []
for i in range(2, n):
is_prime = True
for j in range(2, i):
if i % j == 0:
is_prime = False
break
if is_prime:
primes.append(i)
return primes
print(prime_numbers(20)) # [2, 3, 5, 7, 11, 13, 17, 19]
```
### Exercise 3:
Create a function that takes two arguments and returns their product using recursion.
```python
def recursive_product(a, b):
if b == 0:
return 1
else:
return a + recursive_product(a, b - 1)
print(recursive_product(5, 4)) # 20
```
### Exercise 4:
Create a function that takes a list of integers and returns a new list with only even numbers using list comprehension.
```python
def even_numbers(a):
return [n for
llama_print_timings: load time = 363.29 ms
llama_print_timings: sample time = 5.65 ms / 256 runs ( 0.02 ms per token, 45309.73 tokens per second)
llama_print_timings: prompt eval time = 223.51 ms / 18 tokens ( 12.42 ms per token, 80.53 tokens per second)
llama_print_timings: eval time = 32266.33 ms / 255 runs ( 126.53 ms per token, 7.90 tokens per second)
llama_print_timings: total time = 32525.11 ms / 273 tokens
Log end |
thanks. but we seemingly have now a mismatch as in #4622 (comment) |
@CrispStrobe No. This is a different issue. The tokenizers are hashed and then identified this way. The model configuration is registered into a factory and then processed. The vocabulary metadata isn't being identified the right way. The issue you linked is related to llama.cpp runtime. This issue is related to conversion metadata. Vocab mismatch is out of scope for this issue as a result. Edit: Now you have me thinking whether the vocab mismatch is related, 😅. |
Yeah, I think I found it. @Model.register("PhiForCausalLM")
class Phi2Model(Model):
# omitting for brevity
def set_gguf_parameters(self):
# omitting for brevity
self.gguf_writer.add_name("Phi2")
self.gguf_writer.add_tokenizer_pre("gpt-2") # <- Need this
self.gguf_writer.add_context_length(self.find_hparam(["n_positions", "max_position_embeddings"]))
# omitting for brevity Need feedback. |
doesn't this lead to the same thing as in my above diff, when we arrive in llama.cpp at the switch (vocab.type) in struct llm_tokenizer_bpe? but you seem to be much more familiar with the codebase. i was wondering about set_vocab also. |
Yes, you're right. You're probably using the First the hash needs to included for the vocab. Then the line for adding the pre-tokenizer needs to be added as well. Then voilà! It should work. ✨ You can do this by pulling in my forked branch or doing it manually with the branch you're using and referencing the changes I made in my fork. Doing it manually is really involved if you're unfamiliar with the code base. The update script would need the line for generating the hash. {"name": "phi", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/microsoft/phi-1", }, The convert script would need the if chkhsh == "fcace8b9cac38ce847670c970cd5892031a753a1ef381abd1d9af00f713da085":
# ref: https://huggingface.co/microsoft/phi-1
res = "phi" The convert script would need vocab metadata. self.gguf_writer.add_tokenizer_pre("gpt-2") And that's it. |
thanks. your above example seems to mix phi1 & 2 though? my workaround was to either use an older llama.cpp version for phi2 or to use above linked fix, but of which i was unsure about the consistency with the overall logic. |
My patch works for phi 1, 1.5, and 2. They all use the same vocab. |
I took a look at your notebook. Noticed you're using a custom model, |
this was possible earlier before the bpe pre tokenizer fixes. now it leads to
File "/kaggle/working/llama.cpp/./convert-hf-to-gguf.py", line 432, in get_vocab_base_pre
raise NotImplementedError("BPE pre-tokenizer was not recognized - update get_vocab_base_pre()")
NotImplementedError: BPE pre-tokenizer was not recognized - update get_vocab_base_pre()
thought this would easily be solved by updating the hashes. but cannot get past "llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'phi2'" seemingly without code changes.
how is this supposed to be done? like so? And why does the script break when there is no correlate to the pre tokenizer string, instead of just defaulting, as illustrated in this diff?
The text was updated successfully, but these errors were encountered: