-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sampling : refactor + optimize penalties sampler #10803
Conversation
I don't think there is anything wrong with keeping |
9d0f210
to
869ec41
Compare
Restored the |
There is a pending issue from the initial refactor that setting a Lines 886 to 892 in 8faa1d4
|
Decided to move the penalties sampler at the end of the default sampling chain. This will change the default behaviour compared to |
Please don't. This will substantially change default sampler behavior. With DRY, I have always recommended putting it before truncation samplers because then sufficiently penalized tokens get truncated away, which matches the intuitive expectation that repeating tokens should not occur at all under certain conditions. This is what all other inference engines that implement DRY do by default. It's great that the order is now fully configurable, but changing the default will break a lot of settings that work fine now. |
Ok, I'll move it back, but we have to figure out something else eventually. We cannot apply these penalties to the full vocabulary because the performance is significantly affected. Depending on your CPU and the number of parallel slots that you use, the slowdown from enabling either of these penalties can be really significant and I am pretty sure a lot of people out there are using it without even realizing the performance hit from this. I don't use normally any repetition penalties and therefore haven't noticed this. But recently I started working on text-to-speech (TTS) models, and for some reason (that I don't yet understand completely), they seem to benefit from repetition penalties. And it does not matter if the rep penalty is applied before or after top-k. With the default sampling sequence on |
They are already not for LLMs too because In general, it might be a good idea to have sampling parameters suggestions in some form, either in README or in console warnings. For TTS models it can be a warning to use repetition penalty, for LLMs - to switch |
By good defaults, I am focusing on the performance - not the generation quality. I am worried about what is being applied before top-k. This is where the performance issues are because we haven't truncated the vocab yet. After top-k, you are free to put anything you like because it mainly depends on your use case (e.g. for FIM code completion, top-p makes sense IMO) and the performance will be good, as long as you didn't set a very big K. |
Also, to clarify the significance of this issue even better, the performance penalty from enabling penalties stacks linearly with the number of parallel users of |
In that case, how much is "very big K"? Theoretically, what is the K threshold value at which it's large enough to not affect creativity, but is smaller than vocab size enough to improve performance? If it exists, we can set the default value to this threshold and put (I had an idea of |
@ngxson I am trying to build the new WebUI and I get this error: $ ▶ npm run build
> [email protected] build
> vite build
vite v5.4.11 building for production...
transforming (1) index.html
🌼 daisyUI 4.12.14
├─ ✔︎ 32 themes added https://daisyui.com/docs/themes
╰─ ★ Star daisyUI on GitHub https://github.com/saadeghi/daisyui
(node:87476) ExperimentalWarning: CommonJS module /llama.cpp/examples/server/webui/node_modules/tailwindcss/lib/lib/load-config.js is loading ES Module /llama.cpp/examples/server/webui/tailwind.config.js using require().
Support for loading ES Module in require() is an experimental feature and might change at any time
(Use `node --trace-warnings ...` to show where the warning was created)
file:///llama.cpp/examples/server/webui/tailwind.config.js:10
plugins: [
^
ReferenceError: require is not defined
at file:///llama.cpp/examples/server/webui/tailwind.config.js:10:12
at ModuleJobSync.runSync (node:internal/modules/esm/module_job:395:35)
at ModuleLoader.importSyncForRequire (node:internal/modules/esm/loader:329:47)
at loadESMFromCJS (node:internal/modules/cjs/loader:1376:24)
at Module._compile (node:internal/modules/cjs/loader:1528:5)
at Object..js (node:internal/modules/cjs/loader:1698:10)
at Module.load (node:internal/modules/cjs/loader:1303:32)
at Function._load (node:internal/modules/cjs/loader:1117:12)
at TracingChannel.traceSync (node:diagnostics_channel:322:14)
at wrapModuleLoad (node:internal/modules/cjs/loader:218:24)
Node.js v23.3.0 Do you know how to fix it? |
It's a known nodejs 22.12 issue, I fixed it via #10779 You just need to merge with latest master branch |
f0f1fe7
to
a312568
Compare
@@ -21,7 +21,7 @@ const CONFIG_DEFAULT = { | |||
systemMessage: 'You are a helpful assistant.', | |||
showTokensPerSecond: false, | |||
// make sure these default values are in sync with `common.h` | |||
samplers: 'dkypmxt', | |||
samplers: 'edkypmxt', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ngxson Note that I made this change to the WebUI and AFAIU I have to build the new index.html
with npm run build
. But this command is failing on my computer with the error in the previous comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't worry I can push the built index.html to this PR.
If you want to build it on your computer, maybe you can install a version manager like nvm, then nvm use 22.11.0
to use the correct version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: we can also have a manually run CI job that build the frontend and output an artifact, I'll add this in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok thanks. I managed to install 22.11.0
with nvm
and it works now 👍
Not sure. I don't think use cases with, let's say K > 128, ever make sense, but I could be wrong. Anyway, no need to decide and change the behaviour now - we can discuss later. |
Putting A |
It was As for UPD: now that I think about it, any position of penalties would have an effect on the resulting distribution: even with 128 candidates, penalizing repeated ones would shift less probable candidates up. The lower |
I also feel like promoting the top-k sampler to the very front of the sampling chain would be good as it will cleanly solve the performance problems for default settings. And since the chain is fully customizable, we won't remove any functionality, but just make sure that it's not so easy to start penalizing the entire vocab. Btw, in #9897 I completely forgot about the penalties sampler being in front of the top-k sampler. So the description there that Will open a separate PR for that and we can discuss this further if necessary. |
src/llama-sampling.cpp
Outdated
ctx->token_count[token]++; | ||
|
||
// if the ring buffer is full, remove the oldest token | ||
if (ctx->prev.size() >= (size_t) ctx->penalty_last_n) { | ||
const auto pop = ctx->prev.front(); | ||
|
||
ctx->token_count[pop]--; | ||
if (ctx->token_count[pop] == 0) { | ||
ctx->token_count.erase(pop); | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change needs extra attention for correctness.
Co-authored-by: Diego Devesa <[email protected]>
ggml-ci
7415f3f
to
b58ebf3
Compare
Refactor, optimize and simplify the penalties sampler. It's position in
common_sampler
is now customizable, instead of being hardcoded at the front of the chain:The main reason to allow this is that the penalties can be quite expensive to apply on full vocabulary. Now, they can be applied after a top-k sampler for example.
Also, the token count frequencies are now maintained in the sampler instead of being recreated on each token. Also, the
penalize_nl
option is removed since it is not relevant with new models.API Changes
llama_sampler_init_penalties()
Server API changes
penalize_nl
parameter