Feature Request: Instructions how to correctly use/convert original llama3.1 instruct .pth model #8808

scalvin1 · 2024-08-01T09:53:20Z

Prerequisites

I am running the latest code. Mention the version if possible as well.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Can someone please add (or point me to) instructions to correctly set everything to get from FaceMeta-.pth downloaded weights to .gguf (and then onwards to Q8_0)?

I am running a local 8B instance with llama-server and CUDA.

Keep up the great work!

Motivation

With all the half-broken llama3.1 gguf files uploaded to hf by brownie point kids, it would make sense to drop a few words on how to convert and quantize the original/official Meta llama 3.1 weights for use with a local llama.cpp. (Somehow everyone seems to get the weights from hf, but why not source these freely available weights from the actual source?)

My tries still leave me hazy on whether the rope scaling is done correctly, even though I use latest transformers (for .pth to .safetensors) and then latest git version of llama.cpp for convert_hf_to_gguf.py.

The closest description I could find (edit: note that this is valid for llama3, not llama3.1 with the larger 128k token context) is here: https://voorloopnul.com/blog/quantize-and-run-the-original-llama3-8b-with-llama-cpp/

Possible Implementation

Please add two lines on llama3.1 "from META.pth to GGUF" to a README or to an answer to this issue.

xocite · 2024-08-02T00:44:41Z

Hey, here's what I've captured so far. I have the model stored at ~/meta-llama-3.1-8b-instruct.

vector jiff ~/repo $ git clone [email protected]:meta-llama/llama-recipes.git
vector jiff ~/repo $ git clone [email protected]:huggingface/transformers.git
vector jiff ~/repo $ cd transformers
vector jiff ~/repo/transformers $ python3 -m venv .
vector jiff ~/repo/transformers $ source bin/activate
(transformers) vector jiff ~/repo/transformers $ pip install -r ../llama-recipes/requirements.txt
(transformers) vector jiff ~/repo/transformers $ pip install protobuf blobfile
(transformers) vector jiff ~/repo/transformers $ python3 src/transformers/models/llama/convert_llama_weights_to_hf.py --input_dir ~/meta-llama-3.1-8b-instruct --model_size 8B --llama_version 3.1 --output_dir ~/meta-llama-3.1-8b-instruct

Source

Next you'll need to also pull some files from the Hugging Face repository into the model directory.

vector jiff ~/meta-llama-3.1-8b-instruct $ git init .
vector jiff ~/meta-llama-3.1-8b-instruct $ git remote add origin https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct
vector jiff ~/meta-llama-3.1-8b-instruct $ GIT_LFS_SKIP_SMUDGE=1 git fetch
vector jiff ~/meta-llama-3.1-8b-instruct $ git checkout origin/main -- config.json generation_config.json special_tokens_map.json tokenizer.json tokenizer_config.json

Finally, run the quantisation steps as instructed.

vector jiff ~/repo $ git clone https://github.com/ggerganov/llama.cpp
vector jiff ~/repo $ cd llama.cpp
vector jiff ~/repo/llama.cpp $ git log -1 --pretty=format:"%H - %s" origin/HEAD
afbb4c1322a747d2a7b4bf67c868148f8afcc6c8 - ggml-cuda: Adding support for unified memory (#8035)
vector jiff ~/repo/llama.cpp $ python3 -m venv .
vector jiff ~/repo/llama.cpp $ source bin/activate
(llama.cpp) vector jiff ~/repo/llama.cpp $ pip install -r requirements.txt
(llama.cpp) vector jiff ~/repo/llama.cpp $ pip install transformers~=4.43.3
(llama.cpp) vector jiff ~/repo/llama.cpp $ ln -s ~/meta-llama-3.1-8b-instruct models/meta-llama-3.1-8b-instruct 
(llama.cpp) vector jiff ~/repo/llama.cpp $ python3 convert_hf_to_gguf.py --outtype bf16 models/meta-llama-3.1-8b-instruct --outfile models/meta-llama-3.1-8b-instruct/meta-llama-3.1-8b-instruct-bf16.gguf

Unfortunately, I get stuck at this stage with the following error. I haven't been able to resolve this yet.

Traceback (most recent call last):
  File "/home/jiff/predictors/repos/llama.cpp/convert_hf_to_gguf.py", line 3717, in <module>
    main()
  File "/home/jiff/predictors/repos/llama.cpp/convert_hf_to_gguf.py", line 3711, in main
    model_instance.write()
  File "/home/jiff/predictors/repos/llama.cpp/convert_hf_to_gguf.py", line 401, in write
    self.prepare_metadata(vocab_only=False)
  File "/home/jiff/predictors/repos/llama.cpp/convert_hf_to_gguf.py", line 394, in prepare_metadata
    self.set_vocab()
  File "/home/jiff/predictors/repos/llama.cpp/convert_hf_to_gguf.py", line 1470, in set_vocab
    self._set_vocab_sentencepiece()
  File "/home/jiff/predictors/repos/llama.cpp/convert_hf_to_gguf.py", line 693, in _set_vocab_sentencepiece
    tokens, scores, toktypes = self._create_vocab_sentencepiece()
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jiff/predictors/repos/llama.cpp/convert_hf_to_gguf.py", line 713, in _create_vocab_sentencepiece
    tokenizer.LoadFromFile(str(tokenizer_path))
  File "/home/jiff/predictors/repos/llama.cpp/lib/python3.11/site-packages/sentencepiece/__init__.py", line 316, in LoadFromFile
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Internal: could not parse ModelProto from models/meta-llama-3.1-8b-instruct/tokenizer.model

scalvin1 · 2024-08-02T07:57:44Z

You are missing some python package (protobuf?). Pull it in with pip install.

What you describe is what I gathered from https://voorloopnul.com/blog/quantize-and-run-the-original-llama3-8b-with-llama-cpp/ and I managed to get a working gguf file. I could also quantize to Q8_0 without problems and I have a model that seems to work alright. What I am just not very sure about is whether the rope scaling for context buffers up to 128k work correctly.
In the light of some comments to similar issue here https://www.reddit.com/r/LocalLLaMA/comments/1eeyw0v/i_keep_getting_this_error_in_llama_31_8b_llamacpp/ for example, I always get 291 tensors after my conversions, and I never got any complaints about llama.cpp expecting 292 (the correctly converted model would have this extra one). This is puzzling me.

Do we need to go through the described procedure (above, or voorloop link) with special flags/parameters?
Does the quantization need special flags?
Does the llama-cli or llama-server call need special flags/parameters?

As said before, I would like to see an official description somewhere on how to correctly make use of the official Meta Llama 3.1 (Instruct) weights with llama.cpp. Also maybe, a few words on how to test and make sure large contexts work correctly.

brandenvs · 2024-08-02T11:23:48Z

@xocite, I ran into a similar problem. And found it could be due to a missing config file. Specifically, the tokenizer_config.json file? I haven't tried to resolve the issue fully, but I did make a gist where I used the save_pretrained method to download the tokenizer and tokenizer_config JSON file: https://gist.github.com/brandenvs/2dad0e864fc2a0a5ee176213ae43b902

Justin-12138 · 2024-08-03T03:15:34Z

@scalvin1 I ran it well by following your instructions

xocite · 2024-08-04T01:25:12Z

I tried again by downloading the model directly from HF and it quantised fine using the convert_hf_to_gguf.py script.

scalvin1 · 2024-08-05T00:35:55Z

Your comments are beside the point.
What I want is clear instructions to convert the model downloaded directly from Meta to the best-possible feature complete Q8_0 gguf.
I want to cut out the dodgy middle men and make sure that I know what I get.

To spell it out again: huggingface is not the primary source of the llama weights and there is no reason to trust the uploaders to have done the conversion optimally. Especially if there are no clear instructions in this project on how this should be done.

quitrk · 2024-08-07T06:51:55Z

I'm a bit confused, isn't https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct the official Meta repo? It already contans the model converted from pytorch, so it can be converted further to gguf using the convert_hf_to_gguf.py script

scalvin1 · 2024-08-07T14:35:30Z

As final words in this neglected issue, I summarize my findings.

To the commenter above: huggingface is not the official repository. You download it from Zuckerboy's facebook here https://llama.meta.com/llama-downloads/

What needs to happen next is converting it to huggingface format using the transformers library and python venv (it will download Gigabytes of wheels...)

(as in voorloop webpage)
Create a virtual environment:

python3 -m venv .venv

Activate the environment

source .venv/bin/activate

Install the transformer libraries and dependencies (possibly protobuf and some other missing dependencies?)

pip install transformers transformers[torch] tiktoken blobfile sentencepiece

Convert it

python .venv/lib/python3.11/site-packages/transformers/models/llama/convert_llama_weights_to_hf.py --input_dir Meta-Llama-3.1-8B-Instruct/ --model_size 8B --output_dir hf --llama_version 3.1 --instruct True

After that, convert it to 32bit gguf (to avoid any loss in fidelity)

python3 convert_hf_to_gguf.py --outtype f32 --outfile ../meta-llama-3.1-8B-instruction_f32.gguf ../hf/

Finally, quantize it to whatever fits your hardware. A good choice is 8bit:

llama.cpp-testing/llama-quantize meta-llama-3.1-8B-instruction_f32.gguf meta-llama-3.1-8B-instruction_Q8_0.gguf Q8_0

The above two steps will generate gguf models with 291 tensors that seem to work with longer contexts (note that longer contexts seem to need a lot more RAM or VRAM)

Note: I have not validated this approach and was hoping someone in the know could drop some official comments as to how correctly apply the process outlined here.

Anyway, it seems to work for me like this with no completely obvious flaws.

quitrk · 2024-08-08T07:03:51Z

What I meant was whether the repo itself is the official representation of Meta on Huggingface, which it seems to be the case

Vaibhavs10 · 2024-08-08T07:16:56Z

Your comments are beside the point.
What I want is clear instructions to convert the model downloaded directly from Meta to the best-possible feature complete Q8_0 gguf.
I want to cut out the dodgy middle men and make sure that I know what I get.

To spell it out again: huggingface is not the primary source of the llama weights and there is no reason to trust the uploaders to have done the conversion optimally. Especially if there are no clear instructions in this project on how this should be done.

Hi @scalvin1 - I'm VB from the open source team at Hugging Face. We're not a middle man. The weights uploaded in the meta-llama org are the official weights and converted together with Meta.

The steps you mentioned is exactly how Meta converted the weights as well. Everything works seamlessly now, but this required changes wrt to RoPE scaling you can read more about it here: #8650

Let me know if you have any questions! 🤗

codiak · 2024-08-14T18:50:15Z

Trying to go from downloading the raw Llama 3.1 weights from Meta and use them for inference in Python led me here. It was partly due to wanting to manually handle the download of the weights (rather than pass a repo as a parameter) and partly because I wanted a better understanding (and control) of the formats for fine-tuning.

Inspired by this thread and the resources linked here, I put together a guide for taking the raw .pth weights and getting inference running in a Python script with llama.cpp:
https://github.com/codiak/llama-raw-to-py

scalvin1 · 2024-08-20T09:39:48Z

There is not much to add anymore, it all seems to be working the way it was described. Closing issue.

sglbl · 2024-09-17T12:22:27Z

Thank you for this reply, I was lloking for this. I have a question: I have 16 GB RAM and 6 GB VRAM [NVIDIA GeForce RTX 4050]

When I try this command:
python ~/.local/lib/python3.10/site-packages/transformers/models/llama/convert_llama_weights_to_hf.py --input_dir ../../Meta-Llama-3.1-8B-Instruct/original --model_size 8B --output_dir hf --llama_version 3.1 --instruct True I got killed error

Is there a workaround for this, also map location is fixed as cpu in convert script.

scalvin1 added the enhancement New feature or request label Aug 1, 2024

scalvin1 closed this as completed Aug 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Instructions how to correctly use/convert original llama3.1 instruct .pth model #8808

Feature Request: Instructions how to correctly use/convert original llama3.1 instruct .pth model #8808

scalvin1 commented Aug 1, 2024 •

edited

Loading

xocite commented Aug 2, 2024

scalvin1 commented Aug 2, 2024

brandenvs commented Aug 2, 2024

Justin-12138 commented Aug 3, 2024

xocite commented Aug 4, 2024

scalvin1 commented Aug 5, 2024

quitrk commented Aug 7, 2024 •

edited

Loading

scalvin1 commented Aug 7, 2024 •

edited

Loading

quitrk commented Aug 8, 2024

Vaibhavs10 commented Aug 8, 2024

codiak commented Aug 14, 2024

scalvin1 commented Aug 20, 2024

sglbl commented Sep 17, 2024

Feature Request: Instructions how to correctly use/convert original llama3.1 instruct .pth model #8808

Feature Request: Instructions how to correctly use/convert original llama3.1 instruct .pth model #8808

Comments

scalvin1 commented Aug 1, 2024 • edited Loading

Prerequisites

Feature Description

Motivation

Possible Implementation

xocite commented Aug 2, 2024

scalvin1 commented Aug 2, 2024

brandenvs commented Aug 2, 2024

Justin-12138 commented Aug 3, 2024

xocite commented Aug 4, 2024

scalvin1 commented Aug 5, 2024

quitrk commented Aug 7, 2024 • edited Loading

scalvin1 commented Aug 7, 2024 • edited Loading

quitrk commented Aug 8, 2024

Vaibhavs10 commented Aug 8, 2024

codiak commented Aug 14, 2024

scalvin1 commented Aug 20, 2024

sglbl commented Sep 17, 2024

scalvin1 commented Aug 1, 2024 •

edited

Loading

quitrk commented Aug 7, 2024 •

edited

Loading

scalvin1 commented Aug 7, 2024 •

edited

Loading