-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: uncached prompt is not used for penalty #8971
Comments
After some more testing I discovered that this bug is even worse than I described above. It seems like the How to test:
The context size is 256 tokens, the Unless I'm misunderstanding something, this breaks all the penalties for almost all use-cases. Tested on 2fb9267. |
The bug described in the original issue is caused by a faulty ring-buffer implementation here: Line 454 in 82e3b03
It basically always throws out the last token. So when you start with a clean buffer, there will always be just 1 token in the buffer. I would wait for #9294 to land before trying to address this. At least that PR seems to fix the ring_buffer ejection issue. |
I think on latest master this is fixed. Tried the instructions above and the printf shows the correct number. Let us know if you spot any other issues. |
Yes, the issue with the However, the first one (with the prompt not being used for penalty) is not. The output is slightly different now (no more zero token), but the result is effectively the same. curl -s --data '{"prompt": "Note that the file, line, and message properties are", "n_predict": 4, "repeat_penalty": 1.1, "cache_prompt": true}' http://127.0.0.1:8080/completion > /dev/null This is what I get on the first try:
Only the new tokens are in the On the second try (exactly the same query) I get:
Now it's properly including all the prior context. Tested on 49006c6. |
What happened?
Sometimes the part of the initial prompt that should be considered for the penalties is ignored. Only the newly generated tokens are used for calculating penalty. For now I can assume it has something to do with the prompt caching (explained below).
Let's add the following debug code to the
llama_sample_repetition_penalties_impl
right after thetoken_count
map is filled in:It will show the tokens that will be used for penalty calculation.
After starting the server and running this:
the server log shows:
So it ignores the initial prompt and only uses the new tokens.
However, if I run the exact same query the second time, I get this:
Now it has all the initial tokens + one new token each step.
The bug has something to do with the prompt caching, because it does not happen when the cached prompt is used. But it happens in all other cases:
cache_prompt = false
I tested it with CUDA/no-CUDA builds and two different models - the results are the same.
Name and Version
./llama-server --version
version: 3565 (6e02327)
built with cc (Ubuntu 13.2.0-23ubuntu4) 13.2.0 for x86_64-linux-gnu
What operating system are you seeing the problem on?
Linux
Relevant log output
No response
The text was updated successfully, but these errors were encountered: