-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : refactor sampling #8643
Conversation
f208aa4
to
f866cb9
Compare
I'm thinking there is no reason to have two separate structs The API will be simplified:
|
Storing some of the sampling state (particularly the RNG) in a separate I'm not so sure about the change to grammars though. Why is it being handled so differently to all of the other sampling techniques? If it's due to the statefulness, how about other stateful samplers such as mirostat? |
Apologies in advance for the wall-of-text -- I'm trying to wrap my head around some things here, and going a bit stream-of-consciousness. If I was smarter or had a better handle on things, I could write more succinctly.
There isn't much in the I do still like the idea of keeping grammar reasonably partitioned away from sampling (and doing things like measuring grammar timing as separate from sampling timing -- I think that will be very valuable as we continue to optimize and extend the grammar engine, and I saw you added a note about that in one of your earlier commits). Will combining these structures muddy those waters? I almost wonder if the But then again, what do you see as the definition of what a Part of me feels like I guess I also don't fully understand why the RNG shouldn't live within the confines of the "Good fences make good neighbors", as the saying goes, and overall I think I agree with you -- there's a bit of muddiness here, and this refactoring is very welcome.
Having a hard time giving feedback on this part, because I need to get a better handle on what a I'm not sure how best to express the distinctives of each object, but maybe something like:
I'll admit that I don't fully understand how parallelism works within llama.cpp, and all of the different things that can exist with a global context, the individual job runners, how that ties into batching and shared memory, etc etc etc. So my ignorance about that might also play into my confusion on the rest of this as well. The naming conventions in the ownership hierarchy might also want to be standardized. With grammars, "short name" is on top:
And with sampling, the "long name" (
So the unclear ownership chain is perhaps also what's contributing to my vague feelings of confusion and uncertainty. It's also weird to me to see an object structure that is the name of the module itself (
👍 I like this change a lot.
These changes make me a bit uncomfortable. I really like the way that grammar is (currently) decoupled from sampling -- it makes the end-to-end grammar integration tests feel as clean as they are, and makes the GBNF validator program possible. The GBNF validator program is of dubious importance (I might be the only person to ever use that program), but the integration tests are pretty cool. Would the grammar integration tests be negatively impacted by bringing sampling into it (when it's previously been able to avoid it), or would it be cleaned up?
I'm trying to understand this one, and struggling a bit. Is part of the reason for this change because of the way that sampling and grammar are tied together a bit closely right now? In particular, I feel like the most involved piece of coupling between the two modules is the optimization that was added in #4306. That was a very important optimization (and I never want it to go away), but it had the unfortunate side-effect of passing control between grammar and sampling a couple of times in each loop, and it's not always clear (when I'm reading the code anyways) which modules is "on top" and driving the interaction. Especially with that weird
👍 This is good. |
I don't think of grammars as a sampling technique (akin to something like mirostat). Rather, it lives at a layer kinda' above the sampler, because it takes the logits that are calculated by the model, and it constrains the sampler to prevent it from considering tokens that don't match the grammar by setting those logits to zero. Grammar places boundaries on the sampler, but it itself is not a sampler. |
The 2 structures will continue to exist separately in the internal
From user PoV, I'm looking for ways to eliminate The current implementation on
Hm, I think we can definitely consider the grammar as a sampler. If you think about it, in one way or another all sampling techniques do the same thing - given a set of candidate tokens with their respective probabilities, the sampler produces a new subset of tokens with new probabilities. So in that sense, applying grammar constraints can be looked as a sampler, such as top-k for example.
No, we will keep the tests as they are. They would just need to include the internal header |
Thank you, that is a very helpful reply! Apologies in advance for my newbie perspective on this -- most of what I've learned about LLMs I've learned piecemeal, and I'm trying to learn on the fly. Thank you for your patience with me!
Aaah, this is where my thinking was different. Let me try to align with you. In the sampling module, we have two classes of functions. As you noted, one takes in a set of candidate tokens with their respective probabilities, and the output is a new subset of tokens (length 0-N) with new probabilities. Examples of this include the logit bias map, guidance prompts, repetition penalties, and grammar constraints. You call these "samplers" -- but I guess I was thinking of them as "pre-samplers". Each one of these is applied in The second class of functions takes in a set of tokens and outputs exactly one llama_token id. These are applied in llama_sampling_sample_impl() after preparation is done, and it always returns exactly one llama_token id, and only one of these can be called per run -- they can't be chained together. Examples of this are things like In one sense, both classes of functions are similar -- that they both return a "set" of llama tokens (in so far as a single ID can be considered a "set") -- but they feel categorically different in that one is called in the I also got my impression from things in the code like this comment, that refers to grammar as something that's done before sampling:
All of that feeds into why I was thinking of grammar logic as being more of a "constraint" than a "sampler" itself, but overall I'd like to align my understanding to yours. |
You are technically correct. "samplers" is more appropriate for the functions that return the final token. So it's better to call the rest of the functions "constraints".
Not exactly - the grammar constraints can also be applied after the sampler: Lines 324 to 347 in 50e0535
I'm revisiting the implementation, and applying the grammar post-sampling does not seem to be equivalent to applying it pre-sampling. While the existing implementation in
The prepare function on |
2ad156c
to
a880be2
Compare
beebdfd
to
43440c0
Compare
299d255
to
bebf5d7
Compare
d03a5a2
to
5a9753b
Compare
267f138
to
5243e3f
Compare
62984db
to
694c4b1
Compare
a5d664c
to
6420268
Compare
I think this should be good to merge. Will leave it for a day or two for any comments, and then merge |
What would be the path to use this API with GPU sampling? I would expect that we will need to add a function similar to |
The sampling chain is indeed stored in Lines 1128 to 1131 in ca74a33
I am thinking that in the future, we can utilize this information within |
I don't want to go too much into this because ultimately this is a matter of opinion, but I think there would be significant advantages to a design in which samplers are abstract objects that can be combined and extended without having to modify anything else. This would also allow users to implement their own samplers, and it would allow new experimental samplers to be implemented in a separate library. // llama_sampler base class (can be made accessible as a C interface in a similar way ggml-backend does it)
struct llama_sampler {
virtual void sample_cpu(llama_token_data_array * candidates);
virtual void sample_ggml(/* to be defined */);
};
// llama_sampler_chain is just another sampler that uses a list of samplers
struct llama_sampler_chain : llama_sampler {
std::vector<llama_sampler *> samplers;
void sample_cpu(llama_token_data_array * candidates) override;
void sample_ggml(/* to be defined */) override;
};
void llama_sampler_chain::sample_cpu(llama_token_data_array * candidates) {
for (auto sampler : samplers) {
sampler->sample_cpu(candidates);
}
}
// user API example:
{
llama_sampler_t sampler = llama_sampler_chain_new();
llama_sampler_chain_add(sampler, llama_sampler_top_k_new(10));
llama_sampler_chain_add(sampler, llama_sampler_temperature_new(0.5));
llama_sampler_chain_add(sampler, llama_sampler_top_p_new(0.9));
llama_sampler_chain_add(sampler, llama_sampler_softmax_new());
llama_sampler_chain_add(sampler, llama_sampler_sample_new());
// if the sampler needs to be modified later the user can keep the pointer to it:
// llama_sampler_t temp = llama_sampler_temperature_new(sampler, 0.5);
// llama_sampler_chain_add(sampler, temp);
// llama_sampler_temperature_set(temp, 0.7);
// decode and then sample
llama_decode(ctx, ...);
llama_sampler_sample(sampler, ctx, ith);
// future API: decode with sampling
llama_decode_sample(ctx, sampler, ...);
} |
Yes, this is a good suggestion. I will try to update the PR in the proposed way. |
second this. What I currently dislike about these changes is that you seem to get stuck to using EDIT: Something that would be nice though is leaving a low level function to not lose functionality from the current API ...
llama_token_data_array candidates_p = { candidates.data(), candidates.size(), false };
llama_sampler_process(ctx, &candidates_p, llama_sampler_temperature_new(0.5)) idk how feasible that would be for the GPU implementation, but for testing GPU sampling against CPU the option of copying all logits / at least probs over will afaik be needed anyways (and not just the sampled token) |
bb3d182
to
f648ca2
Compare
It wouldn't make sense to create a new sampler every time. You would be able to do something like this instead: auto smpl_temp = llama_sampler_temperature_new(0.5);
...
llama_token_data_array candidates_p = { candidates.data(), candidates.size(), false };
smpl_temp->sample_cpu(candidates); |
I just meant some util method to expose sample_cpu/sample_ggml depending on ctx in the C interface, that didn't seem planned from the comments. I'd be fine with something like void llama_sample_cpu(struct llama_sampler * smpl, struct llama_token_data_array * candidates) {
smpl->sample_cpu(candidates);
} I saw Line 1190 in dcf1359
I just noticed the old sampling API isn't even marked as deprecated (I thought it would be, my bad just saw this and quickly commented since I thought this would break my project). But imo this isn't a good choice long term from a maintenance perspective (new samplers would have to update both APIs). It seems fairly easy to make the new proposed API offer the same functionality as the old API. |
Superseded by #9294 |
ref #5214
Overview
Remove
struct llama_sampling_context
fromcommon
and replace it withstruct llama_sampling
in thellama
library. The entirecommon/sampling
functionality is now part of thellama
library.API Changes
enum llama_sampler_type
struct llama_sampling_params
struct llama_sampling
and newllama_sampling_
API (replaces the oldllama_sample_
andllama_grammar_
APIs)LLAMA_API_INTERNAL
Implementation details
common/grammar-parser
insrc/llama-grammar
llama_context
no longer comes with a built-in sampling context. The user code is responsible for creating, using, saving and loading thellama_sampling
objects as needed. As a consequence, thellama_state
no longer serializes the RNGstruct llama_sampling
is very similar to the oldcommon/llama_sampling_context
. It supports the same parameters, grammar, token history and sampler sequences.llama_sampling
instead ofllama_context
. The grammar-related computations are timed separatelystruct llama_sampling
keeps an internal list of token candidates, which is initialized upon passing the logits viallama_sampling_set_logits
. This internal list can be optionally used by not providing an external candidates array (as in the past) which simplifies the API usage significantly for common use casesExample
While the old way of maintaining the array of candidate tokens within the user code remains available, there is now a simpler implementation by utilizing the internal list of candidates in
llama_sampling
:TODO
grammars
: fix resampling logic regression #7424Apply Add token healing tomain
andserver
#7187Future plan
struct llama_sampling
for offloading the sampling to the GPU. Can be extended with whatever extra information is necessary and utilized in the decoding API. Hopefully the current iteration is a good step in that direction.