Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sampling : refactor + optimize penalties sampler #10803

Merged
merged 10 commits into from
Dec 16, 2024
Merged

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Dec 12, 2024

Refactor, optimize and simplify the penalties sampler. It's position in common_sampler is now customizable, instead of being hardcoded at the front of the chain:

... --sampling-seq kep ...

sampler chain: logits -> logit-bias -> top-k -> penalties -> top-p -> dist

The main reason to allow this is that the penalties can be quite expensive to apply on full vocabulary. Now, they can be applied after a top-k sampler for example.

Also, the token count frequencies are now maintained in the sampler instead of being recreated on each token. Also, the penalize_nl option is removed since it is not relevant with new models.

API Changes

  • Change llama_sampler_init_penalties()

Server API changes

  • Remove penalize_nl parameter

@github-actions github-actions bot added testing Everything test related examples devops improvements to build systems and github actions server labels Dec 12, 2024
@slaren
Copy link
Collaborator

slaren commented Dec 12, 2024

Also, the ignore_eos and penalize_nl options are removed since the former can be achieved through logit biases and the latter is not relevant with new models.

I don't think there is anything wrong with keeping --ignore-eos as a shortcut to the logit bias, although it is true that it is not as useful now as it was when it was added.

@ggerganov ggerganov force-pushed the gg/sampling-penalties branch from 9d0f210 to 869ec41 Compare December 12, 2024 19:22
@ggerganov
Copy link
Owner Author

Restored the --ignore-eos options.

@slaren
Copy link
Collaborator

slaren commented Dec 12, 2024

There is a pending issue from the initial refactor that setting a --repeat-last-n to -1 does not set it to n_ctx, unlike what the documentation says, if this cannot be fixed at least the documentation should be updated.

llama.cpp/common/arg.cpp

Lines 886 to 892 in 8faa1d4

add_opt(common_arg(
{"--repeat-last-n"}, "N",
string_format("last n tokens to consider for penalize (default: %d, 0 = disabled, -1 = ctx_size)", params.sampling.penalty_last_n),
[](common_params & params, int value) {
params.sampling.penalty_last_n = value;
params.sampling.n_prev = std::max(params.sampling.n_prev, params.sampling.penalty_last_n);
}

@ggerganov ggerganov requested a review from ngxson as a code owner December 12, 2024 20:02
@ggerganov
Copy link
Owner Author

Decided to move the penalties sampler at the end of the default sampling chain. This will change the default behaviour compared to master, but should result in significant sampling speed speedup (more than x5) when using any penalties.

common/common.cpp Outdated Show resolved Hide resolved
@p-e-w
Copy link

p-e-w commented Dec 13, 2024

Decided to move the penalties sampler at the end of the default sampling chain.

Please don't. This will substantially change default sampler behavior. With DRY, I have always recommended putting it before truncation samplers because then sufficiently penalized tokens get truncated away, which matches the intuitive expectation that repeating tokens should not occur at all under certain conditions. This is what all other inference engines that implement DRY do by default.

It's great that the order is now fully configurable, but changing the default will break a lot of settings that work fine now.

@ggerganov
Copy link
Owner Author

Ok, I'll move it back, but we have to figure out something else eventually. We cannot apply these penalties to the full vocabulary because the performance is significantly affected. Depending on your CPU and the number of parallel slots that you use, the slowdown from enabling either of these penalties can be really significant and I am pretty sure a lot of people out there are using it without even realizing the performance hit from this.

I don't use normally any repetition penalties and therefore haven't noticed this. But recently I started working on text-to-speech (TTS) models, and for some reason (that I don't yet understand completely), they seem to benefit from repetition penalties. And it does not matter if the rep penalty is applied before or after top-k. With the default sampling sequence on master these penalties will take a significant toll on the performance and it took me quite some time to realize this. So the defaults are not good and they should be updated somehow. The primary goal is to not affect the performance in a significant way and only allow it if the user really understands the implications. Simply setting a repetition penalty to 1.1 should not make any noticeable difference to the speed as it does now.

@MaggotHATE
Copy link
Contributor

So the defaults are not good and they should be updated somehow.

They are already not for LLMs too because top_p is not disabled by default, this has been discussed for a long time now. TTS models are new (I assume it's OuteTTS in this case), so it makes sense that they would require specific settings for optimal result - the same way as top_p is usually suggested to be turned off.

In general, it might be a good idea to have sampling parameters suggestions in some form, either in README or in console warnings. For TTS models it can be a warning to use repetition penalty, for LLMs - to switch top_p off. Performance-related warning can be added too.

@ggerganov
Copy link
Owner Author

By good defaults, I am focusing on the performance - not the generation quality.

I am worried about what is being applied before top-k. This is where the performance issues are because we haven't truncated the vocab yet. After top-k, you are free to put anything you like because it mainly depends on your use case (e.g. for FIM code completion, top-p makes sense IMO) and the performance will be good, as long as you didn't set a very big K.

@ggerganov
Copy link
Owner Author

Also, to clarify the significance of this issue even better, the performance penalty from enabling penalties stacks linearly with the number of parallel users of llama-server, because the sampling from the slot results is currently done sequentially. So the more users who enabled penalties, the slower the performance will be for all users, regardless if they are using penalties or not.

@MaggotHATE
Copy link
Contributor

MaggotHATE commented Dec 13, 2024

a very big K

In that case, how much is "very big K"?

Theoretically, what is the K threshold value at which it's large enough to not affect creativity, but is smaller than vocab size enough to improve performance? If it exists, we can set the default value to this threshold and put top_k as the first sampler - but only if it doesn't change the result of penalty repeat significantly.

(I had an idea of top_k = -2 special value that would cut the size in half, but the performance gain will be relatively small)

@ggerganov
Copy link
Owner Author

@ngxson I am trying to build the new WebUI and I get this error:

$ ▶ npm run build

> [email protected] build
> vite build

vite v5.4.11 building for production...
transforming (1) index.html
🌼   daisyUI 4.12.14
├─ ✔︎ 32 themes added		https://daisyui.com/docs/themes
╰─ ★ Star daisyUI on GitHub	https://github.com/saadeghi/daisyui

(node:87476) ExperimentalWarning: CommonJS module /llama.cpp/examples/server/webui/node_modules/tailwindcss/lib/lib/load-config.js is loading ES Module /llama.cpp/examples/server/webui/tailwind.config.js using require().
Support for loading ES Module in require() is an experimental feature and might change at any time
(Use `node --trace-warnings ...` to show where the warning was created)
file:///llama.cpp/examples/server/webui/tailwind.config.js:10
  plugins: [
           ^

ReferenceError: require is not defined
    at file:///llama.cpp/examples/server/webui/tailwind.config.js:10:12
    at ModuleJobSync.runSync (node:internal/modules/esm/module_job:395:35)
    at ModuleLoader.importSyncForRequire (node:internal/modules/esm/loader:329:47)
    at loadESMFromCJS (node:internal/modules/cjs/loader:1376:24)
    at Module._compile (node:internal/modules/cjs/loader:1528:5)
    at Object..js (node:internal/modules/cjs/loader:1698:10)
    at Module.load (node:internal/modules/cjs/loader:1303:32)
    at Function._load (node:internal/modules/cjs/loader:1117:12)
    at TracingChannel.traceSync (node:diagnostics_channel:322:14)
    at wrapModuleLoad (node:internal/modules/cjs/loader:218:24)

Node.js v23.3.0

Do you know how to fix it?

@ngxson
Copy link
Collaborator

ngxson commented Dec 13, 2024

It's a known nodejs 22.12 issue, I fixed it via #10779

You just need to merge with latest master branch

@ggerganov ggerganov force-pushed the gg/sampling-penalties branch from f0f1fe7 to a312568 Compare December 13, 2024 11:41
@@ -21,7 +21,7 @@ const CONFIG_DEFAULT = {
systemMessage: 'You are a helpful assistant.',
showTokensPerSecond: false,
// make sure these default values are in sync with `common.h`
samplers: 'dkypmxt',
samplers: 'edkypmxt',
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngxson Note that I made this change to the WebUI and AFAIU I have to build the new index.html with npm run build. But this command is failing on my computer with the error in the previous comment.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't worry I can push the built index.html to this PR.

If you want to build it on your computer, maybe you can install a version manager like nvm, then nvm use 22.11.0 to use the correct version

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: we can also have a manually run CI job that build the frontend and output an artifact, I'll add this in the future

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok thanks. I managed to install 22.11.0 with nvm and it works now 👍

@ggerganov
Copy link
Owner Author

In that case, how much is "very big K"?

Not sure. I don't think use cases with, let's say K > 128, ever make sense, but I could be wrong. Anyway, no need to decide and change the behaviour now - we can discuss later.

@p-e-w
Copy link

p-e-w commented Dec 14, 2024

Putting top_k first and defaulting it to a high value is a reasonable compromise. AFAIK, top_k is already special-cased because top_k = 1 disables the entire sampler chain and disregards sampler order, even if it would have changed which token is in first position. top_k is a terrible sampler for steering generation anyway, and the value chosen by the user will usually be either 1 (if they want greedy sampling) or whatever the default is.

A top_k of 128 is already extremely generous. In practice, the 50th most probable token is often below 10^-7 already. IMO, llama.cpp could impose a maximum of 128 even when the user chooses 0, that is, no limit. That would bring the performance benefits to users of frontends that always set a value for top_k, and typically default to 0.

@MaggotHATE
Copy link
Contributor

MaggotHATE commented Dec 14, 2024

AFAIK, top_k is already special-cased because top_k = 1 disables the entire sampler chain and disregards sampler order

It was temp, and that feature was reworked a while ago, so technically speaking, all samplers are non-special now, especially with this PR - they all will be in one sampling chain.

As for 128 limit, the only times I saw a really large number of candidates were with p_step sampler - and it wasn't filtering out anything in those cases. So long as we don't have any noise-type sampler, a reasonable limit should not be a problem (if we keep customization for the user, of course).

UPD: now that I think about it, any position of penalties would have an effect on the resulting distribution: even with 128 candidates, penalizing repeated ones would shift less probable candidates up. The lower penalty repeat is in the chain, the less viable options we would get, reducing creativity. Maybe it's the reason why it works well when it does (starting from the entire vocab doesn't limit your options significantly) and doesn't work other times (irrelevant tokens become too probable after filtering).

@ggerganov
Copy link
Owner Author

I also feel like promoting the top-k sampler to the very front of the sampling chain would be good as it will cleanly solve the performance problems for default settings. And since the chain is fully customizable, we won't remove any functionality, but just make sure that it's not so easy to start penalizing the entire vocab.

Btw, in #9897 I completely forgot about the penalties sampler being in front of the top-k sampler. So the description there that --top-k 1 is equivalent to greedy sampling is only true if there are no penalties enabled. Anyway, it's a small detail and if we move the top-k to the front, it wouldn't be an issue.

Will open a separate PR for that and we can discuss this further if necessary.

Comment on lines 1420 to 1431
ctx->token_count[token]++;

// if the ring buffer is full, remove the oldest token
if (ctx->prev.size() >= (size_t) ctx->penalty_last_n) {
const auto pop = ctx->prev.front();

ctx->token_count[pop]--;
if (ctx->token_count[pop] == 0) {
ctx->token_count.erase(pop);
}
}

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change needs extra attention for correctness.

@ggerganov ggerganov force-pushed the gg/sampling-penalties branch from 7415f3f to b58ebf3 Compare December 16, 2024 09:25
@ggerganov ggerganov merged commit 644fd71 into master Dec 16, 2024
50 checks passed
@ggerganov ggerganov deleted the gg/sampling-penalties branch December 16, 2024 10:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
devops improvements to build systems and github actions examples server testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants