llama.cpp Discussions #23
Replies: 3 comments 2 replies
-
Just FYI, llama CPP received a big bug fix in the last hour that significantly improves token per second performance at large context lengths 🔥 |
Beta Was this translation helpful? Give feedback.
-
What version of llama.cpp is this python binding using? On my machine, the model load is 3 second using the llama.cpp but 40seconds using this python binding. |
Beta Was this translation helpful? Give feedback.
-
any way to bring "-n N, --n_predict N: number of tokens to predict" to python interface? without it, it only retrieves a fixed max number of words not the full content. it is defaulted at 128, but should be set at -1 for getting full content. |
Beta Was this translation helpful? Give feedback.
-
I created this as a place to discuss
llama.cpp
specific info that doesn't directly require action for this repo, but is still related :)Performance wise, there's some cool stuff being investigated related to generation speed as the context grows. It looks like there's a significant drop in tokens/S, which was introduced in a last couple weeks. If they can find the cause, gen should be much faster :D
ggerganov/llama.cpp#603
Beta Was this translation helpful? Give feedback.
All reactions