I finally found prove that server output can be different (and vs groq now) - model name : llama3 8b Instruct #6955

x4080 · 2024-04-28T01:26:07Z

Hi, I dont know if this is a bug or not, Previously I was noticing that answer from server is different than using regular llama.cpp.
Now I can prove it, here goes :

First this is using regular llama.cpp (and also output from groq)

./main -ngl 99 -t 0 -m ./models/Meta-Llama-3-8B-Instruct.Q4_K_M.gguf --color --temp 0.1 -n -1 -f prompts/testprompt.txt --min_p 0.1 --top_p 1 --top_k 50 --repeat_penalty 1.1 -c 4096 -r "<|eot_id|>"

testprompt.txt

<|begin_of_text|><|start_header_id|>system<|end_header_id|>
<instruction>
-You are an agents manager with agent call functionality to delegate user's query to appropriate agent. Output only in json format without any additional text :
{"function":"<function name you want to call>", params:<parameters for the function>}
-Dont answer user's query yourself, always use agent calls
-You can only call functions provided below :
<functions>
- ask_expertcoder_agent : {
    "description":"Always use this ask programming or code related queries",
    "parameters": {
        {
            "request":"user's request"
            "type":"string"
        }
    }
    "required":["request"]
<functions/>
<instruction/>

<chatHistory>
</chatHistory>
<|eot_id|><|start_header_id|>user<|end_header_id|>
how to display current date in dd/mm/yyyy format using python
<|eot_id|><|start_header_id|>assistant<|end_header_id|>

output :

{"function":"ask_expertcoder_agent", "params": {"request": "How to display current date in dd/mm/yyyy format using Python?"}}

Here's using server

./build/bin/./server -m ./models/Meta-Llama-3-8B-Instruct.Q4_K_M.gguf -c 4096 -ngl 99 -t 1 --chat-template llama3

Here's the json to call

{
  "model": "gpt-3.5-turbo",
  "temperature": 0.1,
  "top_p": 1,
  "top_k": 5,
  "min_p": 0.1,
  "repeat_penalty": 1.1,
  "stream": false,
  "messages": [
    {
      "role": "system",
      "content": "-You are an agents manager with agent call functionality to delegate user's query to appropriate agent. Output only in json format without any additional text :\\n{\"function\":\"<function name you want to call>\", params:<parameters for the function>}\\n-Dont answer user's query yourself, always use agent calls\\n-You can only call functions provided below :\\n<functions>\\n- ask_expertcoder_agent : {\\n    \"description\":\"Always use this ask programming or code related queries\",\\n    \"parameters\": {\\n        {\\n            \"request\":\"user's request\"\\n            \"type\":\"string\"\\n        }\\n    }\\n    \"required\":[\"request\"]\\n<functions/>\\n<instruction/>\\n\\n<chatHistory>\\n</chatHistory>"
    },
    {
      "role": "user",
      "content": "how to display current date in dd/mm/yyyy format using python"
    }
  ]
}

here's the output:

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "You can use the `datetime` module in Python to display the current date in dd/mm/yyyy format. Here's an example:\n```\nimport datetime\n\ncurrent_date = datetime.date.today()\nformatted_date = current_date.strftime(\"%d/%m/%Y\")\n\nprint(formatted_date)\n```\nThis will output the current date in the format dd/mm/yyyy.\n\nHere's a breakdown of how it works:\n\n* `datetime.date.today()` returns the current date as a `date` object.\n* `strftime()` is a method that formats the date object into a string. The `%d/%m/%Y` format code specifies the desired output format:\n\t+ `%d`: day of the month (01-31)\n\t+ `%m`: month (01-12)\n\t+ `%Y`: year (in four digits)\n\nBy combining these format codes, you get the desired output format: dd/mm/yyyy.\n\nYou can also use the `datetime.datetime.now()` function to get the current date and time, if needed:\n```\nimport datetime\n\ncurrent_datetime = datetime.datetime.now()\nformatted_date = current_datetime.strftime(\"%d/%m/%Y %H:%M:%S\")\n\nprint(formatted_date)\n```\nThis will output the current date and time in the format dd/mm/yyyy HH:MM:SS.",
        "role": "assistant"
      }
    }
  ],
  "created": 1714266899,
  "model": "gpt-3.5-turbo",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 263,
    "prompt_tokens": 186,
    "total_tokens": 449
  },
  "id": "chatcmpl-pphq2J7AlzKpo5ynIdbsQ74EPm9VadRi"
}

So basically function json is not generated instead it directly write the code

I understand that it seems impossible that result can be different between server and regular llama cpp, but it did happened

PS: I tried also in ollama and it's output is like the server one

Is this a bug ?

Thanks

The text was updated successfully, but these errors were encountered:

phymbert · 2024-04-28T10:49:40Z

You are not using any seed, isn't?

Jeximo · 2024-04-28T15:36:33Z

Now I can prove it, here goes

@x4080 -s SEED, --seed SEED: Set the random number generator (RNG) seed (default: -1, -1 = random seed).

Please remove RNG by enabling seed, i.e. seed 7.

x4080 · 2024-04-28T20:27:03Z

Hi, I didnt use any seed, so I should add seed instead ?

Edit : Tried using seed on both, and the results are still not change :

./main -ngl 99 -t 0 -m ./models/Meta-Llama-3-8B-Instruct.Q4_K_M.gguf --color --temp 0.1 -n -1 -f prompts/testprompt.txt --min_p 0.1 --top_p 1 --top_k 50 --repeat_penalty 1.1 -c 4096 -r "<|eot_id|>" --seed 1

and

{
  "model": "gpt-3.5-turbo",
  "temperature": 0.1,
  "top_p": 1,
  "top_k": 5,
  "min_p": 0.1,
  "repeat_penalty": 1.1,
  "seed":1,
  "stream": false,
  "messages": [
    {
      "role": "system",
      "content": "-You are an agents manager with agent call functionality to delegate user's query to appropriate agent. Output only in json format without any additional text :\\n{\"function\":\"<function name you want to call>\", params:<parameters for the function>}\\n-Dont answer user's query yourself\n-You can only call functions provided below :\\n<functions>\\n- ask_expertcoder_agent : {\\n    \"description\":\"Always use this ask programming or code related queries\",\\n    \"parameters\": {\\n        {\\n            \"request\":\"user's request\"\\n            \"type\":\"string\"\\n        }\\n    }\\n    \"required\":[\"request\"]\\n<functions/>\\n<instruction/>\\n\\n<chatHistory>\\n</chatHistory>"
    },
    {
      "role": "user",
      "content": "how to display current date in dd/mm/yyyy format using python"
    }
  ]
}

Jeximo · 2024-04-28T20:56:34Z

Why do you expect server to call the gpt-3.5 json?

Maybe try curl --request POST --url http://localhost:8080/completion --data '{"prompt": "whats 5+5?", "temperature": 0, "seed": 1, "n_predict": 128}' for server.

x4080 · 2024-04-29T02:51:22Z

the model name is ignored by llama cpp server, I use it because it used to be calling chat gpt api

x4080 · 2024-04-29T21:04:43Z

Ok, today I tried new gguf fix https://huggingface.co/bartowski/Meta-Llama-3-8B-Instruct-GGUF and updated llama cpp
and now server and regular llama cpp result is the same without seed, maybe I can close it for now

mirekphd · 2024-05-17T16:39:40Z

I think this was closed prematurely, @Jeximo. Simple math questions with only one correct answer and with sampling turned off by zero temperature were all too easy to yield predictably fixed answers even if the custom seed value did not get passed to the inference engine (which I believe it wasn't).

Just kindly retake your test for the purpose of e.g. tweet generation instead of math operations (repeated several times with the same prompt and seed) with the exactly opposite (i.e. maximized) temperature (and possibly also top_p). Please look into the HTTP client logs as well to see if the seed is being set to expected values.

x4080 · 2024-05-17T20:49:08Z

@mirekphd, I found out what makes difference results between server and regular llama cpp :

system and user prompt can make a difference if adding new line or not
seed and temperature may lower the variance
for json result, I found out using completion instead of chat completion is better since we can add prefix to what LLM wanted to generate
but sometime I still can find different result because of "instability" of LLM (maybe because of quantization?), for example where to put chat history in the system prompt make a difference

edit : and repeat penalty

mirekphd · 2024-05-18T06:18:31Z

@x4080 I think the reason for the model response randomness is even simpler here (in llama.cpp server ): the custom seed when passed to the REST API is not used by the server, which is even seen in the client logs. In contrast, in the Python package llama_cpp_python (with locally accessed llama,cpp backend, without the API calls) the deterministic responses (over multiple repeats of the test inference) work correctly - it's sufficient to fix the seed to get the same results, regardless of temperature or top_p settings.

x4080 · 2024-05-18T19:59:37Z

@mirekphd thats interesting, I didnt know that about the seed, so is it a feature or a bug ? What I found out is about the repeat penalty, in the doc I think default is 1.1 in fact it is 1.0

mirekphd · 2024-05-19T10:04:38Z

@mirekphd thats interesting, I didnt know that about the seed, so is it a feature or a bug ? What I found out is about the repeat penalty, in the doc I think default is 1.1 in fact it is 1.0

Yes, I can confirm your observation of repeat_penalty being too low (1.0 instead of the expected documented 1.1).

Could you file a bug report for this? And I will report the issue with seed. My case is arguably easier to prove by its unwanted side effects (i.e. randomness of responses), as I have achieved reproducibility (even for high temperature and top_p) by simply fixing the seed through the Python package (local binding, without client-server communication).

On the other hand, while harder to prove, your finding is arguably more serious, because it affects all users of high-level OpenAI API, where the repeat_penalty argument is not even exposed by the API, so there is no easy workaround for this, apart from dropping the openai altogether and switching to some lower-level / generic HTTP client.

mirekphd · 2024-05-19T10:52:57Z

And I will report the issue with seed.

I did it here: #7381

x4080 · 2024-05-19T20:29:26Z

I think I did an issue weeks ago : #7109

x4080 closed this as completed Apr 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I finally found prove that server output can be different (and vs groq now) - model name : llama3 8b Instruct #6955

I finally found prove that server output can be different (and vs groq now) - model name : llama3 8b Instruct #6955

x4080 commented Apr 28, 2024

phymbert commented Apr 28, 2024

Jeximo commented Apr 28, 2024

x4080 commented Apr 28, 2024 •

edited

Loading

Jeximo commented Apr 28, 2024

x4080 commented Apr 29, 2024

x4080 commented Apr 29, 2024

mirekphd commented May 17, 2024 •

edited

Loading

x4080 commented May 17, 2024 •

edited

Loading

mirekphd commented May 18, 2024

x4080 commented May 18, 2024

mirekphd commented May 19, 2024

mirekphd commented May 19, 2024

x4080 commented May 19, 2024

I finally found prove that server output can be different (and vs groq now) - model name : llama3 8b Instruct #6955

I finally found prove that server output can be different (and vs groq now) - model name : llama3 8b Instruct #6955

Comments

x4080 commented Apr 28, 2024

phymbert commented Apr 28, 2024

Jeximo commented Apr 28, 2024

x4080 commented Apr 28, 2024 • edited Loading

Jeximo commented Apr 28, 2024

x4080 commented Apr 29, 2024

x4080 commented Apr 29, 2024

mirekphd commented May 17, 2024 • edited Loading

x4080 commented May 17, 2024 • edited Loading

mirekphd commented May 18, 2024

x4080 commented May 18, 2024

mirekphd commented May 19, 2024

mirekphd commented May 19, 2024

x4080 commented May 19, 2024

x4080 commented Apr 28, 2024 •

edited

Loading

mirekphd commented May 17, 2024 •

edited

Loading

x4080 commented May 17, 2024 •

edited

Loading