Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I finally found prove that server output can be different (and vs groq now) - model name : llama3 8b Instruct #6955

Closed
x4080 opened this issue Apr 28, 2024 · 13 comments

Comments

@x4080
Copy link

x4080 commented Apr 28, 2024

Hi, I dont know if this is a bug or not, Previously I was noticing that answer from server is different than using regular llama.cpp.
Now I can prove it, here goes :

First this is using regular llama.cpp (and also output from groq)

./main -ngl 99 -t 0 -m ./models/Meta-Llama-3-8B-Instruct.Q4_K_M.gguf --color --temp 0.1 -n -1 -f prompts/testprompt.txt --min_p 0.1 --top_p 1 --top_k 50 --repeat_penalty 1.1 -c 4096 -r "<|eot_id|>"

testprompt.txt

<|begin_of_text|><|start_header_id|>system<|end_header_id|>
<instruction>
-You are an agents manager with agent call functionality to delegate user's query to appropriate agent. Output only in json format without any additional text :
{"function":"<function name you want to call>", params:<parameters for the function>}
-Dont answer user's query yourself, always use agent calls
-You can only call functions provided below :
<functions>
- ask_expertcoder_agent : {
    "description":"Always use this ask programming or code related queries",
    "parameters": {
        {
            "request":"user's request"
            "type":"string"
        }
    }
    "required":["request"]
<functions/>
<instruction/>

<chatHistory>
</chatHistory>
<|eot_id|><|start_header_id|>user<|end_header_id|>
how to display current date in dd/mm/yyyy format using python
<|eot_id|><|start_header_id|>assistant<|end_header_id|>

output :

{"function":"ask_expertcoder_agent", "params": {"request": "How to display current date in dd/mm/yyyy format using Python?"}}

Here's using server

./build/bin/./server -m ./models/Meta-Llama-3-8B-Instruct.Q4_K_M.gguf -c 4096 -ngl 99 -t 1 --chat-template llama3 

Here's the json to call

{
  "model": "gpt-3.5-turbo",
  "temperature": 0.1,
  "top_p": 1,
  "top_k": 5,
  "min_p": 0.1,
  "repeat_penalty": 1.1,
  "stream": false,
  "messages": [
    {
      "role": "system",
      "content": "-You are an agents manager with agent call functionality to delegate user's query to appropriate agent. Output only in json format without any additional text :\\n{\"function\":\"<function name you want to call>\", params:<parameters for the function>}\\n-Dont answer user's query yourself, always use agent calls\\n-You can only call functions provided below :\\n<functions>\\n- ask_expertcoder_agent : {\\n    \"description\":\"Always use this ask programming or code related queries\",\\n    \"parameters\": {\\n        {\\n            \"request\":\"user's request\"\\n            \"type\":\"string\"\\n        }\\n    }\\n    \"required\":[\"request\"]\\n<functions/>\\n<instruction/>\\n\\n<chatHistory>\\n</chatHistory>"
    },
    {
      "role": "user",
      "content": "how to display current date in dd/mm/yyyy format using python"
    }
  ]
}

here's the output:

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "You can use the `datetime` module in Python to display the current date in dd/mm/yyyy format. Here's an example:\n```\nimport datetime\n\ncurrent_date = datetime.date.today()\nformatted_date = current_date.strftime(\"%d/%m/%Y\")\n\nprint(formatted_date)\n```\nThis will output the current date in the format dd/mm/yyyy.\n\nHere's a breakdown of how it works:\n\n* `datetime.date.today()` returns the current date as a `date` object.\n* `strftime()` is a method that formats the date object into a string. The `%d/%m/%Y` format code specifies the desired output format:\n\t+ `%d`: day of the month (01-31)\n\t+ `%m`: month (01-12)\n\t+ `%Y`: year (in four digits)\n\nBy combining these format codes, you get the desired output format: dd/mm/yyyy.\n\nYou can also use the `datetime.datetime.now()` function to get the current date and time, if needed:\n```\nimport datetime\n\ncurrent_datetime = datetime.datetime.now()\nformatted_date = current_datetime.strftime(\"%d/%m/%Y %H:%M:%S\")\n\nprint(formatted_date)\n```\nThis will output the current date and time in the format dd/mm/yyyy HH:MM:SS.",
        "role": "assistant"
      }
    }
  ],
  "created": 1714266899,
  "model": "gpt-3.5-turbo",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 263,
    "prompt_tokens": 186,
    "total_tokens": 449
  },
  "id": "chatcmpl-pphq2J7AlzKpo5ynIdbsQ74EPm9VadRi"
}

So basically function json is not generated instead it directly write the code

I understand that it seems impossible that result can be different between server and regular llama cpp, but it did happened

PS: I tried also in ollama and it's output is like the server one

Is this a bug ?

Thanks

@phymbert
Copy link
Collaborator

You are not using any seed, isn't?

@Jeximo
Copy link
Contributor

Jeximo commented Apr 28, 2024

Now I can prove it, here goes

@x4080 -s SEED, --seed SEED: Set the random number generator (RNG) seed (default: -1, -1 = random seed).

Please remove RNG by enabling seed, i.e. seed 7.

@x4080
Copy link
Author

x4080 commented Apr 28, 2024

Hi, I didnt use any seed, so I should add seed instead ?

Edit : Tried using seed on both, and the results are still not change :

./main -ngl 99 -t 0 -m ./models/Meta-Llama-3-8B-Instruct.Q4_K_M.gguf --color --temp 0.1 -n -1 -f prompts/testprompt.txt --min_p 0.1 --top_p 1 --top_k 50 --repeat_penalty 1.1 -c 4096 -r "<|eot_id|>" --seed 1

and

{
  "model": "gpt-3.5-turbo",
  "temperature": 0.1,
  "top_p": 1,
  "top_k": 5,
  "min_p": 0.1,
  "repeat_penalty": 1.1,
  "seed":1,
  "stream": false,
  "messages": [
    {
      "role": "system",
      "content": "-You are an agents manager with agent call functionality to delegate user's query to appropriate agent. Output only in json format without any additional text :\\n{\"function\":\"<function name you want to call>\", params:<parameters for the function>}\\n-Dont answer user's query yourself\n-You can only call functions provided below :\\n<functions>\\n- ask_expertcoder_agent : {\\n    \"description\":\"Always use this ask programming or code related queries\",\\n    \"parameters\": {\\n        {\\n            \"request\":\"user's request\"\\n            \"type\":\"string\"\\n        }\\n    }\\n    \"required\":[\"request\"]\\n<functions/>\\n<instruction/>\\n\\n<chatHistory>\\n</chatHistory>"
    },
    {
      "role": "user",
      "content": "how to display current date in dd/mm/yyyy format using python"
    }
  ]
}

@Jeximo
Copy link
Contributor

Jeximo commented Apr 28, 2024

Why do you expect server to call the gpt-3.5 json?

Maybe try curl --request POST --url http://localhost:8080/completion --data '{"prompt": "whats 5+5?", "temperature": 0, "seed": 1, "n_predict": 128}' for server.

@x4080
Copy link
Author

x4080 commented Apr 29, 2024

the model name is ignored by llama cpp server, I use it because it used to be calling chat gpt api

@x4080
Copy link
Author

x4080 commented Apr 29, 2024

Ok, today I tried new gguf fix https://huggingface.co/bartowski/Meta-Llama-3-8B-Instruct-GGUF and updated llama cpp
and now server and regular llama cpp result is the same without seed, maybe I can close it for now

@x4080 x4080 closed this as completed Apr 29, 2024
@mirekphd
Copy link

mirekphd commented May 17, 2024

I think this was closed prematurely, @Jeximo. Simple math questions with only one correct answer and with sampling turned off by zero temperature were all too easy to yield predictably fixed answers even if the custom seed value did not get passed to the inference engine (which I believe it wasn't).

Just kindly retake your test for the purpose of e.g. tweet generation instead of math operations (repeated several times with the same prompt and seed) with the exactly opposite (i.e. maximized) temperature (and possibly also top_p). Please look into the HTTP client logs as well to see if the seed is being set to expected values.

@x4080
Copy link
Author

x4080 commented May 17, 2024

@mirekphd, I found out what makes difference results between server and regular llama cpp :

  • system and user prompt can make a difference if adding new line or not
  • seed and temperature may lower the variance
  • for json result, I found out using completion instead of chat completion is better since we can add prefix to what LLM wanted to generate
  • but sometime I still can find different result because of "instability" of LLM (maybe because of quantization?), for example where to put chat history in the system prompt make a difference

edit : and repeat penalty

@mirekphd
Copy link

@x4080 I think the reason for the model response randomness is even simpler here (in llama.cpp server ): the custom seed when passed to the REST API is not used by the server, which is even seen in the client logs. In contrast, in the Python package llama_cpp_python (with locally accessed llama,cpp backend, without the API calls) the deterministic responses (over multiple repeats of the test inference) work correctly - it's sufficient to fix the seed to get the same results, regardless of temperature or top_p settings.

@x4080
Copy link
Author

x4080 commented May 18, 2024

@mirekphd thats interesting, I didnt know that about the seed, so is it a feature or a bug ? What I found out is about the repeat penalty, in the doc I think default is 1.1 in fact it is 1.0

@mirekphd
Copy link

@mirekphd thats interesting, I didnt know that about the seed, so is it a feature or a bug ? What I found out is about the repeat penalty, in the doc I think default is 1.1 in fact it is 1.0

Yes, I can confirm your observation of repeat_penalty being too low (1.0 instead of the expected documented 1.1).

Could you file a bug report for this? And I will report the issue with seed. My case is arguably easier to prove by its unwanted side effects (i.e. randomness of responses), as I have achieved reproducibility (even for high temperature and top_p) by simply fixing the seed through the Python package (local binding, without client-server communication).

On the other hand, while harder to prove, your finding is arguably more serious, because it affects all users of high-level OpenAI API, where the repeat_penalty argument is not even exposed by the API, so there is no easy workaround for this, apart from dropping the openai altogether and switching to some lower-level / generic HTTP client.

@mirekphd
Copy link

And I will report the issue with seed.

I did it here: #7381

@x4080
Copy link
Author

x4080 commented May 19, 2024

I think I did an issue weeks ago : #7109

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants