-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: phi 3.5 mini produces garbage past 4096 context #9127
Comments
For conversation, the server is working fine with phi-3.5 quantized to 4 bits. But after a while it started outputting tons of blank lines and garbage when told to make a simple HTML page. Hitting the [Reset] button on the chat server's Gradio page, localhost:8080 fixed it for now. It makes great web pages. The only thing I can guess is that unusual prompt formats from using other models corrupted the chat history somehow. But I have no way to look into the (now cleared) chat history to see. Will keep testing! |
Are you using flash attention or not? I've seen that without flash attention the output is garbage, but with its coherent |
I found with -fa turned on it was running super slow, but also outputting garbage. Right now it's off and stable at 4096. |
To do: Test, if fixed by #9396 |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
What happened?
Phi 3.5 mini doesn't produce <|end|> or <|endoftext|> when the context is set higher than 4096, just endless garbage tokens. Possible rope scale issue?
Name and Version
llama-server, recent compile
What operating system are you seeing the problem on?
No response
Relevant log output
No response
The text was updated successfully, but these errors were encountered: