sampling : refactor + optimize penalties sampler #10803

ggerganov · 2024-12-12T18:54:12Z

Refactor, optimize and simplify the penalties sampler. It's position in common_sampler is now customizable, instead of being hardcoded at the front of the chain:

... --sampling-seq kep ...

sampler chain: logits -> logit-bias -> top-k -> penalties -> top-p -> dist

The main reason to allow this is that the penalties can be quite expensive to apply on full vocabulary. Now, they can be applied after a top-k sampler for example.

Also, the token count frequencies are now maintained in the sampler instead of being recreated on each token. Also, the penalize_nl option is removed since it is not relevant with new models.

API Changes

Change llama_sampler_init_penalties()

Server API changes

Remove penalize_nl parameter

slaren · 2024-12-12T19:00:52Z

Also, the ignore_eos and penalize_nl options are removed since the former can be achieved through logit biases and the latter is not relevant with new models.

I don't think there is anything wrong with keeping --ignore-eos as a shortcut to the logit bias, although it is true that it is not as useful now as it was when it was added.

ggerganov · 2024-12-12T19:24:01Z

Restored the --ignore-eos options.

slaren · 2024-12-12T19:41:56Z

There is a pending issue from the initial refactor that setting a --repeat-last-n to -1 does not set it to n_ctx, unlike what the documentation says, if this cannot be fixed at least the documentation should be updated.

llama.cpp/common/arg.cpp

Lines 886 to 892 in 8faa1d4

    
           add_opt(common_arg( 
        
               {"--repeat-last-n"}, "N", 
        
               string_format("last n tokens to consider for penalize (default: %d, 0 = disabled, -1 = ctx_size)", params.sampling.penalty_last_n), 
        
               [](common_params & params, int value) { 
        
                   params.sampling.penalty_last_n = value; 
        
                   params.sampling.n_prev = std::max(params.sampling.n_prev, params.sampling.penalty_last_n); 
        
               }

ggerganov · 2024-12-12T20:35:45Z

Decided to move the penalties sampler at the end of the default sampling chain. This will change the default behaviour compared to master, but should result in significant sampling speed speedup (more than x5) when using any penalties.

common/common.cpp

p-e-w · 2024-12-13T04:01:54Z

Decided to move the penalties sampler at the end of the default sampling chain.

Please don't. This will substantially change default sampler behavior. With DRY, I have always recommended putting it before truncation samplers because then sufficiently penalized tokens get truncated away, which matches the intuitive expectation that repeating tokens should not occur at all under certain conditions. This is what all other inference engines that implement DRY do by default.

It's great that the order is now fully configurable, but changing the default will break a lot of settings that work fine now.

ggerganov · 2024-12-13T10:11:37Z

Ok, I'll move it back, but we have to figure out something else eventually. We cannot apply these penalties to the full vocabulary because the performance is significantly affected. Depending on your CPU and the number of parallel slots that you use, the slowdown from enabling either of these penalties can be really significant and I am pretty sure a lot of people out there are using it without even realizing the performance hit from this.

I don't use normally any repetition penalties and therefore haven't noticed this. But recently I started working on text-to-speech (TTS) models, and for some reason (that I don't yet understand completely), they seem to benefit from repetition penalties. And it does not matter if the rep penalty is applied before or after top-k. With the default sampling sequence on master these penalties will take a significant toll on the performance and it took me quite some time to realize this. So the defaults are not good and they should be updated somehow. The primary goal is to not affect the performance in a significant way and only allow it if the user really understands the implications. Simply setting a repetition penalty to 1.1 should not make any noticeable difference to the speed as it does now.

MaggotHATE · 2024-12-13T10:33:04Z

So the defaults are not good and they should be updated somehow.

They are already not for LLMs too because top_p is not disabled by default, this has been discussed for a long time now. TTS models are new (I assume it's OuteTTS in this case), so it makes sense that they would require specific settings for optimal result - the same way as top_p is usually suggested to be turned off.

In general, it might be a good idea to have sampling parameters suggestions in some form, either in README or in console warnings. For TTS models it can be a warning to use repetition penalty, for LLMs - to switch top_p off. Performance-related warning can be added too.

ggerganov · 2024-12-13T10:45:54Z

By good defaults, I am focusing on the performance - not the generation quality.

I am worried about what is being applied before top-k. This is where the performance issues are because we haven't truncated the vocab yet. After top-k, you are free to put anything you like because it mainly depends on your use case (e.g. for FIM code completion, top-p makes sense IMO) and the performance will be good, as long as you didn't set a very big K.

ggerganov · 2024-12-13T11:00:04Z

Also, to clarify the significance of this issue even better, the performance penalty from enabling penalties stacks linearly with the number of parallel users of llama-server, because the sampling from the slot results is currently done sequentially. So the more users who enabled penalties, the slower the performance will be for all users, regardless if they are using penalties or not.

MaggotHATE · 2024-12-13T11:11:05Z

a very big K

In that case, how much is "very big K"?

Theoretically, what is the K threshold value at which it's large enough to not affect creativity, but is smaller than vocab size enough to improve performance? If it exists, we can set the default value to this threshold and put top_k as the first sampler - but only if it doesn't change the result of penalty repeat significantly.

(I had an idea of top_k = -2 special value that would cut the size in half, but the performance gain will be relatively small)

ggerganov · 2024-12-13T11:36:44Z

@ngxson I am trying to build the new WebUI and I get this error:

$ ▶ npm run build

> [email protected] build
> vite build

vite v5.4.11 building for production...
transforming (1) index.html
🌼   daisyUI 4.12.14
├─ ✔︎ 32 themes added		https://daisyui.com/docs/themes
╰─ ★ Star daisyUI on GitHub	https://github.com/saadeghi/daisyui

(node:87476) ExperimentalWarning: CommonJS module /llama.cpp/examples/server/webui/node_modules/tailwindcss/lib/lib/load-config.js is loading ES Module /llama.cpp/examples/server/webui/tailwind.config.js using require().
Support for loading ES Module in require() is an experimental feature and might change at any time
(Use `node --trace-warnings ...` to show where the warning was created)
file:///llama.cpp/examples/server/webui/tailwind.config.js:10
  plugins: [
           ^

ReferenceError: require is not defined
    at file:///llama.cpp/examples/server/webui/tailwind.config.js:10:12
    at ModuleJobSync.runSync (node:internal/modules/esm/module_job:395:35)
    at ModuleLoader.importSyncForRequire (node:internal/modules/esm/loader:329:47)
    at loadESMFromCJS (node:internal/modules/cjs/loader:1376:24)
    at Module._compile (node:internal/modules/cjs/loader:1528:5)
    at Object..js (node:internal/modules/cjs/loader:1698:10)
    at Module.load (node:internal/modules/cjs/loader:1303:32)
    at Function._load (node:internal/modules/cjs/loader:1117:12)
    at TracingChannel.traceSync (node:diagnostics_channel:322:14)
    at wrapModuleLoad (node:internal/modules/cjs/loader:218:24)

Node.js v23.3.0

Do you know how to fix it?

ngxson · 2024-12-13T11:40:30Z

It's a known nodejs 22.12 issue, I fixed it via #10779

You just need to merge with latest master branch

ggerganov · 2024-12-13T11:43:27Z

examples/server/webui/src/main.js

@@ -21,7 +21,7 @@ const CONFIG_DEFAULT = {
  systemMessage: 'You are a helpful assistant.',
  showTokensPerSecond: false,
  // make sure these default values are in sync with `common.h`
-  samplers: 'dkypmxt',
+  samplers: 'edkypmxt',


@ngxson Note that I made this change to the WebUI and AFAIU I have to build the new index.html with npm run build. But this command is failing on my computer with the error in the previous comment.

Don't worry I can push the built index.html to this PR.

If you want to build it on your computer, maybe you can install a version manager like nvm, then nvm use 22.11.0 to use the correct version

note: we can also have a manually run CI job that build the frontend and output an artifact, I'll add this in the future

Ok thanks. I managed to install 22.11.0 with nvm and it works now 👍

ggerganov · 2024-12-13T11:56:33Z

In that case, how much is "very big K"?

Not sure. I don't think use cases with, let's say K > 128, ever make sense, but I could be wrong. Anyway, no need to decide and change the behaviour now - we can discuss later.

p-e-w · 2024-12-14T01:11:43Z

Putting top_k first and defaulting it to a high value is a reasonable compromise. AFAIK, top_k is already special-cased because top_k = 1 disables the entire sampler chain and disregards sampler order, even if it would have changed which token is in first position. top_k is a terrible sampler for steering generation anyway, and the value chosen by the user will usually be either 1 (if they want greedy sampling) or whatever the default is.

A top_k of 128 is already extremely generous. In practice, the 50th most probable token is often below 10^-7 already. IMO, llama.cpp could impose a maximum of 128 even when the user chooses 0, that is, no limit. That would bring the performance benefits to users of frontends that always set a value for top_k, and typically default to 0.

MaggotHATE · 2024-12-14T04:41:26Z

AFAIK, top_k is already special-cased because top_k = 1 disables the entire sampler chain and disregards sampler order

It was temp, and that feature was reworked a while ago, so technically speaking, all samplers are non-special now, especially with this PR - they all will be in one sampling chain.

As for 128 limit, the only times I saw a really large number of candidates were with p_step sampler - and it wasn't filtering out anything in those cases. So long as we don't have any noise-type sampler, a reasonable limit should not be a problem (if we keep customization for the user, of course).

UPD: now that I think about it, any position of penalties would have an effect on the resulting distribution: even with 128 candidates, penalizing repeated ones would shift less probable candidates up. The lower penalty repeat is in the chain, the less viable options we would get, reducing creativity. Maybe it's the reason why it works well when it does (starting from the entire vocab doesn't limit your options significantly) and doesn't work other times (irrelevant tokens become too probable after filtering).

ggerganov · 2024-12-14T07:13:55Z

I also feel like promoting the top-k sampler to the very front of the sampling chain would be good as it will cleanly solve the performance problems for default settings. And since the chain is fully customizable, we won't remove any functionality, but just make sure that it's not so easy to start penalizing the entire vocab.

Btw, in #9897 I completely forgot about the penalties sampler being in front of the top-k sampler. So the description there that --top-k 1 is equivalent to greedy sampling is only true if there are no penalties enabled. Anyway, it's a small detail and if we move the top-k to the front, it wouldn't be an issue.

Will open a separate PR for that and we can discuss this further if necessary.

ggerganov · 2024-12-14T07:17:09Z

src/llama-sampling.cpp

+    ctx->token_count[token]++;
+
+    // if the ring buffer is full, remove the oldest token
+    if (ctx->prev.size() >= (size_t) ctx->penalty_last_n) {
+        const auto pop = ctx->prev.front();
+
+        ctx->token_count[pop]--;
+        if (ctx->token_count[pop] == 0) {
+            ctx->token_count.erase(pop);
+        }
+    }
+


This change needs extra attention for correctness.

ggml-ci

Co-authored-by: Diego Devesa <[email protected]>

ggml-ci

github-actions bot added testing Everything test related examples devops improvements to build systems and github actions server labels Dec 12, 2024

ggerganov force-pushed the gg/sampling-penalties branch from 9d0f210 to 869ec41 Compare December 12, 2024 19:22

ggerganov requested a review from ngxson as a code owner December 12, 2024 20:02

ngxson approved these changes Dec 12, 2024

View reviewed changes

slaren reviewed Dec 12, 2024

View reviewed changes

common/common.cpp Outdated Show resolved Hide resolved

ggerganov force-pushed the gg/sampling-penalties branch from f0f1fe7 to a312568 Compare December 13, 2024 11:41

ggerganov commented Dec 13, 2024

View reviewed changes

ggerganov requested a review from slaren December 13, 2024 11:56

ngxson mentioned this pull request Dec 13, 2024

server: (UI) add syntax highlighting and latex math rendering #10808

Merged

ggerganov commented Dec 14, 2024

View reviewed changes

ggerganov mentioned this pull request Dec 15, 2024

changelog : libllama API #9289

Open

ggerganov mentioned this pull request Dec 15, 2024

changelog : llama-server REST API #9291

Open

ggerganov and others added 10 commits December 16, 2024 11:16

sampling : refactor + optimize penalties sampler

0a1f7fb

ggml-ci

common : apply ignore_eos as logit bias

58a5c3b

ggml-ci

batched : remove penalties sampler

a04a5b5

params : allow penalty_last_n == -1 to be equal to context size

9847a37

ggml-ci

common : by default, move the penalties at the end of the sampling chain

97261aa

ggml-ci

common : ignore all EOG tokens

1ff9296

Co-authored-by: Diego Devesa <[email protected]>

common : move back the penalties at the front of the sampling chain

685c84c

ggml-ci

readme : restore hint about --ignore-eos flag [no ci]

60d26de

llama : minor

e27c711

ggml-ci

webui : update

b58ebf3

ggerganov force-pushed the gg/sampling-penalties branch from 7415f3f to b58ebf3 Compare December 16, 2024 09:25

ggerganov merged commit 644fd71 into master Dec 16, 2024
50 checks passed

ggerganov deleted the gg/sampling-penalties branch December 16, 2024 10:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sampling : refactor + optimize penalties sampler #10803

sampling : refactor + optimize penalties sampler #10803

ggerganov commented Dec 12, 2024 •

edited

Loading

slaren commented Dec 12, 2024

ggerganov commented Dec 12, 2024

slaren commented Dec 12, 2024

ggerganov commented Dec 12, 2024

p-e-w commented Dec 13, 2024

ggerganov commented Dec 13, 2024

MaggotHATE commented Dec 13, 2024

ggerganov commented Dec 13, 2024

ggerganov commented Dec 13, 2024

MaggotHATE commented Dec 13, 2024 •

edited

Loading

ggerganov commented Dec 13, 2024

ngxson commented Dec 13, 2024

ggerganov Dec 13, 2024

ngxson Dec 13, 2024

ngxson Dec 13, 2024

ggerganov Dec 13, 2024

ggerganov commented Dec 13, 2024

p-e-w commented Dec 14, 2024

MaggotHATE commented Dec 14, 2024 •

edited

Loading

ggerganov commented Dec 14, 2024

ggerganov Dec 14, 2024

sampling : refactor + optimize penalties sampler #10803

sampling : refactor + optimize penalties sampler #10803

Conversation

ggerganov commented Dec 12, 2024 • edited Loading

API Changes

Server API changes

slaren commented Dec 12, 2024

ggerganov commented Dec 12, 2024

slaren commented Dec 12, 2024

ggerganov commented Dec 12, 2024

p-e-w commented Dec 13, 2024

ggerganov commented Dec 13, 2024

MaggotHATE commented Dec 13, 2024

ggerganov commented Dec 13, 2024

ggerganov commented Dec 13, 2024

MaggotHATE commented Dec 13, 2024 • edited Loading

ggerganov commented Dec 13, 2024

ngxson commented Dec 13, 2024

ggerganov Dec 13, 2024

Choose a reason for hiding this comment

ngxson Dec 13, 2024

Choose a reason for hiding this comment

ngxson Dec 13, 2024

Choose a reason for hiding this comment

ggerganov Dec 13, 2024

Choose a reason for hiding this comment

ggerganov commented Dec 13, 2024

p-e-w commented Dec 14, 2024

MaggotHATE commented Dec 14, 2024 • edited Loading

ggerganov commented Dec 14, 2024

ggerganov Dec 14, 2024

Choose a reason for hiding this comment

ggerganov commented Dec 12, 2024 •

edited

Loading

MaggotHATE commented Dec 13, 2024 •

edited

Loading

MaggotHATE commented Dec 14, 2024 •

edited

Loading