-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: Proper Llama 3.1 Support in llama.cpp #8650
Comments
Also, adding to this, a proper function calling support in the server since llama 3.1 now supports tooling/function calling. |
It looks like they've added a new EOS token called <|eom_id|>, alongside the already existing <|end_of_text|> and <|eot_id|> ones, something to look out for. |
so ? how a proper template is now? |
The new template seems to not use the new eos-token so the existing templates should work fine AFAIK. It might only be used for tool calls or something like that, not sure yet... |
IMO support for function calling can be done easier (and more stable) when using python, for example via I tried implementing the same thing for functionary model before, but the code is very hard to maintain. Edit: yeah so people seem to misunderstand my point. What I'm trying to say is: in reality, most models are trained to call tools in python language, so the tool must be in python from the beginning. |
Converting llama-3.1 seems to make it set the |
...yes currently llama 3.1 8b seems a bit dumber than llama 3 8b ... I do not know it is a gguf problem of llamacpp itself. For instance question with https://groq.com/ Always getting a proper answer - 36 Locally with llama 3.1 8b ( q8 ) - hardly getting proper answer every 5 attempts . |
do you know what parameters are using in groq? maybe they have lower temperature? Edit: just tested with Q8_0 at temp 0.0 and gave me the correct result each time. But usually fails at higher temps |
There seems to be a change in the way RoPE is used, see: https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/commit/13f04ed6f85ef2aa2fd11b960a275c3e31a8069e Also, for long context the model isn't working unless I use 8000000 as RoPE base frequency for 48K context (just an example).
|
Same observation here. Not sure if it's issue with the model, or llama.cpp (tested Q6_K quant with b3438), but for now 3.1 feels way worse than 3.0: Temperature 0 with both those fails. Tested with empy system, and with "You're a helpful assistant." - none of those works well. Tried with |
I did some local tests of Q8_0 8B model in llama.cpp with 4096 context size and with low temperature set (0.01) it often enters generation loops repeating the same sentences over and over. I noticed the same problem with this model when using OpenRouter API. Attached is an example prompt causing problems: prompt-llama-3.1.txt Command line: It happens also when using CUDA backend: Did anyone experience similar problems? |
giving to temp 0 always getting 34 |
yes ... llama 3.1 8b seems dumber even than llama 3 8b - is something off .... gguf nor llamacpp or both ;) |
The |
I have just converted the model from hf to gguf and then quantized to Q8 with the following extra options: --leave-output-tensor --token-embedding-type f16. Model seems to be responding quite good, especially since I prompt in Dutch exclusively. |
Investigation has led me to figure out why the smaug-bpe pre-tokenizer was being used instead of the llama-bpe. It seems to be a problem with the transformers library not prefixing a BOS token. Example: from transformers import AutoTokenizer
llama_3_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
llama_3_1_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
text = "Hello"
print(llama_3_tokenizer.encode(text))
print(llama_3_1_tokenizer.encode(text)) Output:
It seems like the official code prefixes the BOS token. |
ffs ...
Edit - in exchange for function calling is worth it, I suppose |
Dangerous assumption |
I did a Q6_K quant. First added the model to convert_hf_to_gguf_update and ran it, still got smaug pretokenizer, so I just replaced smaug in the convert script with llama:
Seems to be doing fine. I dont use llama.cpp tokenizer for bos or chat template, I do bos+template myself in a modified server and I do the exact same template as 3.0. Tests:
Gemma 27b gave this response to the prompt:
Math also looks OK:
The goldcoin thing also works.
|
I ran some quick benches on Llama 3.1 and it does look to be giving performance boost over 3. As far as I am aware the long ROPE changes should not impact these benchmarks as my max tokens is 2500 for the test (for CoT). Based on these results I think its running well on llama.cpp for short contexts. (I am running version 3428). These benches are my own custom prompts, they are not the standard evaluation harness. I zero shot everything and require the model to follow a circularly shifted answer doublecheck prompt to score a success on all MC (TQA2 and BOOLQ are both A/B MC in my runs). This ensures the model actually solidly knew the answer and did not luck out based on random answer positioning. Gemma 2 9b is still the smartest 8B class model I have ever run. However Llama 3.1 with 128k context becomes very interesting once the long ROPE issue is sorted out. Gemma 2 9b is only 8k context and its context memory has very high overhead (VRAM/token ratio is high).
|
for the record, I wonder if the recognizing as smaug-bpe is because smaug was the llama 3 tokenizer but with some changes to the post_processor that match what llama 3.1 was released with? So they actually tokenize the same way and that's why the chksum is matching it? If you look at the tokenizer.json in llama 3, there's a TemplateProcessing step that doesn't exist in smaug and llama 3.1 that said smaug flips the ignore merges flag, so not sure if that would make a bigger difference.. |
The more I look the more I feel the smaug-bpe is a non-factor If you look through the code, the only thing that being labelled smaug-bpe actually does is select the regex for smaug, which is an exact match of what llama 3 uses, so it's the same It just happens to be that llama 3.1 tokenizes identically to smaug-bpe instead of llama 3, but in the end it doesn't actually matter |
@steampunque can you by chance compare to https://huggingface.co/lmstudio-community/Meta-Llama-3.1-8B-Instruct-GGUF/blob/main/Meta-Llama-3.1-8B-Instruct-Q6_K.gguf to see if it's the same? I got the right answer on math and your gold coin question |
This may actually an Ollama issue with the modelfile as the config.json is different than expected as per the paper it was changed from 3.0 to 3.1 |
Great feature thank you |
I think you're right @bartowski1182! When, I try to do what @m18coppola , the results are not good. But, when I just convert to GGUF without changing the convert_hf_to_gguf.py, the model seems more intelligent. I think the rope settings, which @dranger003 pointed out, might be messing things up for the model's generations. 🤔 |
Could part of the problem be caused by wrong generation parameters? LLaMa-3.1-8B-Instruct's generation_config.json states that: temperature: 0.6 Would this make a difference?
|
Hi, guys can anyone help me out, while trying to load the latest gguf which has a fix mentioned by @tristandruyen i am getting the following error.
Here i have complied the latest version of llama-server with llama-cpp commit id 01245f5. |
I believe this is expected, the changes will break compatibility forward and backwards |
As @bartowski1182 already said this is expected, commit 01245f5 is the latest master, and does not include the rope scaling fixes from #8676, follow the steps from here to add the fixes into your local llama.cpp. |
I made some experiments for the 8B quantized base model: Quantization starting from FP16git lfs install Perplexity./llama-perplexity -m Meta-Llama-3.1-8B.FP16.gguf -f wikitext-2-raw/wiki.test.raw
Quantization starting from BF16 (UPDATE)git lfs install Perplexity./llama-perplexity -m Meta-Llama-3.1-8B-Q6_K.gguf -f wikitext-2-raw/wiki.test.raw
|
can you also add IQ1xx , IQ2xx, IQ3xx and IQ4xx also? |
I will try in the next days. Right now I am processing some additional Q3, Q4 and the Q5s. |
I think you would need to convert to bf16 or fp32 to have better precision, instead of fp16 |
You are right, the difference should be very small, a comparison of the difference between fp16 and bf16 was made for llama 3 and it was negligible. I think that I will repeat the experiments. At least it will be interesting to compare the results of the quantization starting from fp16 to the quantization starting fp32/bf16 |
I am the only one who sees it still even now? |
@bopm you're seeing it work locally or not work locally? If it's not working, can you provide your exact commands? |
@bartowski1182 never mind, it seems like Ollama issue in fact. |
I reported it to Ollama repo, as it done pretty decent job for me on a single run, but now it's not feeling good for llama.cpp too. Details are in the issue here.
|
Your command looks similar like that?
|
With these exact params
it's still hallucinating results like 17, 30, 29, and so on. |
Try with temp 0 . |
Maybe try an imatrix quant? My imatrix q4_k_m gets this right every time even without a low temperature |
previous comment updated with the BF16 experiments. |
|
Yep, way better only mistaken on first run, given me 32, than stable 36 all next retries. |
with -temp 0? |
With |
hello, have a little question, where is llama-quantize? need to build by self. |
you can call it from the directory of llama.cpp #7809 |
FYI after merging #8858 it's now possible to handle |
Its also possible to use custom tool calls https://huggingface.co/blog/llama31#built-in-tool-calling, avoiding the need for ipython shell and eom stuff. Ask the model to make a python code block based on your query, extract it, run it, send its output into the conversation as described in the link. Works fine: bash-5.1$ ./lmf Is it hotter in NYC, Austin, or Houston right now. |
Update the generation_config file. This is a template issue not a model issue of Llama 3.1 |
Prerequisites
Feature Description
Llama 3.1 was just released and it is a significant leg up from the previous series of models: https://huggingface.co/blog/llama31
Whilst the overall architecture is the same, it requires some modelling updates, primarily around RoPE scaling: https://github.com/huggingface/transformers/blob/bc2adb0112b6677b0dfb4105c74570a0f92183eb/src/transformers/modeling_rope_utils.py#L298
It'd be great to add support for those so that the generations are more coherent and make sense.
Motivation
Note: Without the modelling changes, the generation might look coherent, but they are far from great and the true-st potential of the model!
Possible Implementation
Here's the corresponding transformers implementation: https://github.com/huggingface/transformers/blob/bc2adb0112b6677b0dfb4105c74570a0f92183eb/src/transformers/modeling_rope_utils.py#L298
The text was updated successfully, but these errors were encountered: