Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Converted HF LoRA adapter on Llama 3.1 not loading. #9114

Closed
Ujjawal-K-Panchal opened this issue Aug 21, 2024 · 4 comments · Fixed by #9117
Closed

Bug: Converted HF LoRA adapter on Llama 3.1 not loading. #9114

Ujjawal-K-Panchal opened this issue Aug 21, 2024 · 4 comments · Fixed by #9117
Labels
bug-unconfirmed critical severity Used to report critical severity bugs in llama.cpp (e.g. Crashing, Corrupted, Dataloss)

Comments

@Ujjawal-K-Panchal
Copy link
Contributor

Ujjawal-K-Panchal commented Aug 21, 2024

What happened?

In short: Using the standard procedure from documents, I am unable to attach a converted LoRA adapter (hf -> GGUF) to a Llama3.1 GGUF model.

Procedure:

  1. Finetune llama3.1 hf repo using peft LoRA adapter, then save adapter in a specific directory, say lora-dir/ for later access. (Using trl.SFTTrainer; saved using output_dir parameter).
  2. Convert Llama 3.1 model from hf repo to .gguf format via the prescribed method (convert_hf_to_gguf.py).
  3. Quantize the Llama 3.1 GGUF to Q4_K_M using instructions in examples/quantize/Readme.md.
  4. Convert saved LoRA adapter to bf16 GGUF format using the following command:
    python convert_lora_to_gguf.py ../lora-dir/ --outfile ../lora-dir/llama31-lora.gguf --outtype bf16 --base ../models/models--meta-llama--Meta-Llama-3.1-8B-Instruct/snapshots/8c22764a7e3675c50d4c7c9a4edb474456022b16/
  5. Try running CLI using the above:
    ./llama-cli -m ../modelstore/llama31-Q4_K_M-v2.gguf --lora ./lora-dir/llama31-freedom-lora-v010.gguf.
  • Step 5 shows: llama_lora_adapter_init: failed to apply lora adapter: LoRA tensor 'rope_freqs.weight' has unexpected suffix'. More in log output.

Additional notes:

  • Note: I tried running the llama-cli on the base model which worked flawlessly.
  • Theory: Recently I saw a bug here regarding eccentric rope scaling changes in Llama 3.1 relative to Llama 3 and a PR that fixed that bug. Those changes might not have been applied to convert_lora_to_gguf.py.
  • Note: snapshot: 8c22764a7e3675c50d4c7c9a4edb474456022b16 is the current default in llama 3.1.

Name and Version

version: 3484 (4730fac)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

........................................................................................
llama_new_context_with_model: n_ctx      = 131072
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size = 16384.00 MiB
llama_new_context_with_model: KV self size  = 16384.00 MiB, K (f16): 8192.00 MiB, V (f16): 8192.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  8984.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   264.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 420
llama_lora_adapter_init_internal: loading lora adapter from '../lora-dir/llama31-lora.gguf' ...
llama_lora_adapter_init: failed to apply lora adapter: LoRA tensor 'rope_freqs.weight' has unexpected suffix'
llama_init_from_gpt_params: error: failed to apply lora adapter '../lora-dir/llama31-lora.gguf'
main: error: unable to load model
@Ujjawal-K-Panchal Ujjawal-K-Panchal added bug-unconfirmed critical severity Used to report critical severity bugs in llama.cpp (e.g. Crashing, Corrupted, Dataloss) labels Aug 21, 2024
@ngxson
Copy link
Collaborator

ngxson commented Aug 21, 2024

Can you try the conversion script from #9117 ?

@Ujjawal-K-Panchal
Copy link
Contributor Author

Thank you so much for the quick response! Testing this.

@Ujjawal-K-Panchal
Copy link
Contributor Author

Now the above mentioned works perfectly. I also tried different quantizations. I see no problems. Putting the log snippet of the received output below:

llama_lora_adapter_init_internal:        CPU LoRA buffer size =    72.00 MiB
llama_lora_adapter_init_internal: loaded 192 tensors from lora file

system_info: n_threads = 8 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 1 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 131072, n_batch = 2048, n_predict = -1, n_keep = 1


 [end of text]

llama_print_timings:        load time =   15135.41 ms
llama_print_timings:      sample time =       0.59 ms /     4 runs   (    0.15 ms per token,  6779.66 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =     428.97 ms /     4 runs   (  107.24 ms per token,     9.32 tokens per second)
llama_print_timings:       total time =     430.61 ms /     4 tokens
Log end

Thanks for the help!

@Ujjawal-K-Panchal
Copy link
Contributor Author

Keeping this open till the PR is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed critical severity Used to report critical severity bugs in llama.cpp (e.g. Crashing, Corrupted, Dataloss)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants