-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: Instructions how to correctly use/convert original llama3.1 instruct .pth model #8808
Comments
Hey, here's what I've captured so far. I have the model stored at
Next you'll need to also pull some files from the Hugging Face repository into the model directory.
Finally, run the quantisation steps as instructed.
Unfortunately, I get stuck at this stage with the following error. I haven't been able to resolve this yet.
|
You are missing some python package (protobuf?). Pull it in with pip install. What you describe is what I gathered from https://voorloopnul.com/blog/quantize-and-run-the-original-llama3-8b-with-llama-cpp/ and I managed to get a working gguf file. I could also quantize to Q8_0 without problems and I have a model that seems to work alright. What I am just not very sure about is whether the rope scaling for context buffers up to 128k work correctly.
As said before, I would like to see an official description somewhere on how to correctly make use of the official Meta Llama 3.1 (Instruct) weights with llama.cpp. Also maybe, a few words on how to test and make sure large contexts work correctly. |
@xocite, I ran into a similar problem. And found it could be due to a missing config file. Specifically, the tokenizer_config.json file? I haven't tried to resolve the issue fully, but I did make a gist where I used the save_pretrained method to download the tokenizer and tokenizer_config JSON file: https://gist.github.com/brandenvs/2dad0e864fc2a0a5ee176213ae43b902 |
@scalvin1 I ran it well by following your instructions |
I tried again by downloading the model directly from HF and it quantised fine using the |
Your comments are beside the point. To spell it out again: huggingface is not the primary source of the llama weights and there is no reason to trust the uploaders to have done the conversion optimally. Especially if there are no clear instructions in this project on how this should be done. |
I'm a bit confused, isn't https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct the official Meta repo? It already contans the model converted from pytorch, so it can be converted further to gguf using the |
As final words in this neglected issue, I summarize my findings. To the commenter above: huggingface is not the official repository. You download it from Zuckerboy's facebook here https://llama.meta.com/llama-downloads/ What needs to happen next is converting it to huggingface format using the transformers library and python venv (it will download Gigabytes of wheels...) (as in voorloop webpage)
Activate the environment
Install the transformer libraries and dependencies (possibly protobuf and some other missing dependencies?)
Convert it
After that, convert it to 32bit gguf (to avoid any loss in fidelity)
Finally, quantize it to whatever fits your hardware. A good choice is 8bit:
The above two steps will generate gguf models with 291 tensors that seem to work with longer contexts (note that longer contexts seem to need a lot more RAM or VRAM) Note: I have not validated this approach and was hoping someone in the know could drop some official comments as to how correctly apply the process outlined here. Anyway, it seems to work for me like this with no completely obvious flaws. |
What I meant was whether the repo itself is the official representation of Meta on Huggingface, which it seems to be the case |
Hi @scalvin1 - I'm VB from the open source team at Hugging Face. We're not a middle man. The weights uploaded in the The steps you mentioned is exactly how Meta converted the weights as well. Everything works seamlessly now, but this required changes wrt to RoPE scaling you can read more about it here: #8650 Let me know if you have any questions! 🤗 |
Trying to go from downloading the raw Llama 3.1 weights from Meta and use them for inference in Python led me here. It was partly due to wanting to manually handle the download of the weights (rather than pass a repo as a parameter) and partly because I wanted a better understanding (and control) of the formats for fine-tuning. Inspired by this thread and the resources linked here, I put together a guide for taking the raw |
There is not much to add anymore, it all seems to be working the way it was described. Closing issue. |
Prerequisites
Feature Description
Can someone please add (or point me to) instructions to correctly set everything to get from FaceMeta-.pth downloaded weights to .gguf (and then onwards to Q8_0)?
I am running a local 8B instance with llama-server and CUDA.
Keep up the great work!
Motivation
With all the half-broken llama3.1 gguf files uploaded to hf by brownie point kids, it would make sense to drop a few words on how to convert and quantize the original/official Meta llama 3.1 weights for use with a local llama.cpp. (Somehow everyone seems to get the weights from hf, but why not source these freely available weights from the actual source?)
My tries still leave me hazy on whether the rope scaling is done correctly, even though I use latest transformers (for .pth to .safetensors) and then latest git version of llama.cpp for convert_hf_to_gguf.py.
The closest description I could find (edit: note that this is valid for llama3, not llama3.1 with the larger 128k token context) is here: https://voorloopnul.com/blog/quantize-and-run-the-original-llama3-8b-with-llama-cpp/
Possible Implementation
Please add two lines on llama3.1 "from META.pth to GGUF" to a README or to an answer to this issue.
The text was updated successfully, but these errors were encountered: