llama : refactor sampling #8643

ggerganov · 2024-07-23T07:49:57Z

Overview

Remove struct llama_sampling_context from common and replace it with struct llama_sampling in the llama library. The entire common/sampling functionality is now part of the llama library.

API Changes

Add enum llama_sampler_type
Add struct llama_sampling_params
Add struct llama_sampling and new llama_sampling_ API (replaces the old llama_sample_ and llama_grammar_ APIs)
Remove LLAMA_API_INTERNAL
Remove Classifier-Free Guidance related API
Remove Prompt Penalty support

Implementation details

Move common/grammar-parser in src/llama-grammar
The llama_context no longer comes with a built-in sampling context. The user code is responsible for creating, using, saving and loading the llama_sampling objects as needed. As a consequence, the llama_state no longer serializes the RNG
The struct llama_sampling is very similar to the old common/llama_sampling_context. It supports the same parameters, grammar, token history and sampler sequences.
The sampling timings are now performed by llama_sampling instead of llama_context. The grammar-related computations are timed separately
The struct llama_sampling keeps an internal list of token candidates, which is initialized upon passing the logits via llama_sampling_set_logits. This internal list can be optionally used by not providing an external candidates array (as in the past) which simplifies the API usage significantly for common use cases
The grammar code has been refactored, hopefully it is a bit easier to read. No functional changes.

Example

While the old way of maintaining the array of candidate tokens within the user code remains available, there is now a simpler implementation by utilizing the internal list of candidates in llama_sampling:

before

auto   n_vocab = llama_n_vocab(model);
auto * logits  = llama_get_logits_ith(ctx, i_batch[i]);

std::vector<llama_token_data> candidates;
candidates.reserve(n_vocab);

for (llama_token token_id = 0; token_id < n_vocab; token_id++) {
    candidates.emplace_back(llama_token_data{ token_id, logits[token_id], 0.0f });
}

llama_token_data_array candidates_p = { candidates.data(), candidates.size(), false };

// note that we used to pass `llama_context ctx` to the sampling API
llama_sample_top_k(ctx, &candidates_p, top_k, 1);
llama_sample_top_p(ctx, &candidates_p, top_p, 1);
llama_sample_temp (ctx, &candidates_p, temp);

const llama_token new_token_id = llama_sample_token(ctx, &candidates_p);

after

const auto * logits = llama_get_logits_ith(ctx, i_batch[i]);

llama_sampling_set_logits(smpl, logits);

// we now pass `llama_sampling smpl` and no longer need to maintain the candidates explicitly
llama_sampling_top_k(smpl, nullptr);
llama_sampling_top_p(smpl, nullptr);
llama_sampling_temp (smpl, nullptr);

const llama_token new_token_id = llama_sampling_sample_dist(smpl, nullptr);

TODO

Future plan

Utilize the new struct llama_sampling for offloading the sampling to the GPU. Can be extended with whatever extra information is necessary and utilized in the decoding API. Hopefully the current iteration is a good step in that direction.

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

common/sampling.cpp

…factor from #8643

…factor from #8643 (#8651)

ggerganov · 2024-07-24T16:42:19Z

I'm thinking there is no reason to have two separate structs llama_sampling and llama_grammar, so struct llama_grammar should be absorbed completely in struct llama_sampling and not exposed through the API. Will also move the grammar_parser from common into llama-grammar.cpp

The API will be simplified:

Remove:
- llama_grammar_init
- llama_grammar_free
- llama_grammar_copy
Rename
- llama_grammar_sample -> llama_sampling_grammar
- llama_grammar_accept_token -> llama_sampling_accept_token
Add
- llama_sampling_copy
Hide
- enum llama_gretype
- struct llama_grammar_element
Update
- llama_sampling_init optionally takes a grammar string

martindevans · 2024-07-24T20:04:23Z

Storing some of the sampling state (particularly the RNG) in a separate llama_sampling object seems like a good change (particularly for batching, where I think different sequences shared an RNG before this change?).

I'm not so sure about the change to grammars though. Why is it being handled so differently to all of the other sampling techniques? If it's due to the statefulness, how about other stateful samplers such as mirostat?

HanClinto · 2024-07-24T21:10:27Z

Apologies in advance for the wall-of-text -- I'm trying to wrap my head around some things here, and going a bit stream-of-consciousness. If I was smarter or had a better handle on things, I could write more succinctly.

I'm thinking there is no reason to have two separate structs llama_sampling and llama_grammar, so struct llama_grammar should be absorbed completely in struct llama_sampling and not exposed through the API.

There isn't much in the llama_sampling structure -- mainly just timing metrics, right (and the RNG)?

I do still like the idea of keeping grammar reasonably partitioned away from sampling (and doing things like measuring grammar timing as separate from sampling timing -- I think that will be very valuable as we continue to optimize and extend the grammar engine, and I saw you added a note about that in one of your earlier commits).

Will combining these structures muddy those waters? I almost wonder if the llama_sampling should be merged into llama_grammar struct, rather than the other way 'round.

But then again, what do you see as the definition of what a llama_sampling object is, and how is it distinct from llama_sampling_context and llama_sampling_params?

Part of me feels like llama_sampling should also include a function pointer to the actual sampling function used, but then that's getting into some crossover into llama_sampling_params.

I guess I also don't fully understand why the RNG shouldn't live within the confines of the llama_sampling_context.

"Good fences make good neighbors", as the saying goes, and overall I think I agree with you -- there's a bit of muddiness here, and this refactoring is very welcome.

Add

llama_sampling_copy

Having a hard time giving feedback on this part, because I need to get a better handle on what a llama_sampling object is, and how it's different from context / param objects.

I'm not sure how best to express the distinctives of each object, but maybe something like:

object	description	instances
llama_sampling_params	CLI-type options to configure the sampler	Global, one param config
llama_sampling_context	Working space for each inference instance to store sampled tokens (?)	One per job runner
llama_sampling	Timing metrics and RNG state for each sampler instance (?)	One per job runner (?)
llama_grammar_rules	Parsed grammar rules from GBNF	Global, shared amongst all runners that use this grammar, when grammar present
llama_grammar_stacks	Working stacks for tracking branches of valid grammar trees built along the way as tokens are sampled	One per job runner, when grammar present
lllama_grammar	Wrapper for grammar rules and grammar stacks, along with some small scratch-pad memory space for unicode characters that span multiple tokens.	One per job runner, when grammar present

I'll admit that I don't fully understand how parallelism works within llama.cpp, and all of the different things that can exist with a global context, the individual job runners, how that ties into batching and shared memory, etc etc etc. So my ignorance about that might also play into my confusion on the rest of this as well.

The naming conventions in the ownership hierarchy might also want to be standardized. With grammars, "short name" is on top:

llama_grammar
- llama_grammar_rules
- llama_grammar_stacks

And with sampling, the "long name" (_context) is on top, and llama_sampling is buried:

llama_sampling_context
- llama_sampling_params
- llama_sampling
- llama_grammar

So the unclear ownership chain is perhaps also what's contributing to my vague feelings of confusion and uncertainty.

It's also weird to me to see an object structure that is the name of the module itself (llama_grammar and llama_sampling) -- it feels like they should be qualified with something, like renaming llama_grammar to llama_grammar_context to match the sampling paradigm.

Will also move the grammar_parser from common into llama-grammar.cpp

👍 I like this change a lot.

Remove:

llama_grammar_init

llama_grammar_free

llama_grammar_copy
...

Update

llama_sampling_init optionally takes a grammar string

These changes make me a bit uncomfortable. I really like the way that grammar is (currently) decoupled from sampling -- it makes the end-to-end grammar integration tests feel as clean as they are, and makes the GBNF validator program possible. The GBNF validator program is of dubious importance (I might be the only person to ever use that program), but the integration tests are pretty cool.

Would the grammar integration tests be negatively impacted by bringing sampling into it (when it's previously been able to avoid it), or would it be cleaned up?

Rename

llama_grammar_sample -> llama_sampling_grammar

llama_grammar_accept_token -> llama_sampling_accept_token

I'm trying to understand this one, and struggling a bit. Is part of the reason for this change because of the way that sampling and grammar are tied together a bit closely right now?

In particular, I feel like the most involved piece of coupling between the two modules is the optimization that was added in #4306. That was a very important optimization (and I never want it to go away), but it had the unfortunate side-effect of passing control between grammar and sampling a couple of times in each loop, and it's not always clear (when I'm reading the code anyways) which modules is "on top" and driving the interaction. Especially with that weird is_resampling parameter and the way that it changes the control flow -- it's gives me that unsettling feeling that I always get when I'm debugging recursive code. It's always felt that way to me -- not because it's bad, but I think it's just an inherent / necessary level of complexity.

Hide

enum llama_gretype

struct llama_grammar_element

👍 This is good.

HanClinto · 2024-07-24T21:26:33Z

I'm not so sure about the change to grammars though. Why is it being handled so differently to all of the other sampling techniques? If it's due to the statefulness, how about other stateful samplers such as mirostat?

I don't think of grammars as a sampling technique (akin to something like mirostat). Rather, it lives at a layer kinda' above the sampler, because it takes the logits that are calculated by the model, and it constrains the sampler to prevent it from considering tokens that don't match the grammar by setting those logits to zero.

Grammar places boundaries on the sampler, but it itself is not a sampler.

ggerganov · 2024-07-25T10:47:09Z

I do still like the idea of keeping grammar reasonably partitioned away from sampling (and doing things like measuring grammar timing as separate from sampling timing -- I think that will be very valuable as we continue to optimize and extend the grammar engine, and I saw you added a note about that in one of your earlier commits).

The 2 structures will continue to exist separately in the internal llama implementation, but I'm thinking that there is no need to expose llama_grammar through the public API.

But then again, what do you see as the definition of what a llama_sampling object is, and how is it distinct from llama_sampling_context and llama_sampling_params?

From user PoV, I'm looking for ways to eliminate common: llama_sampling_context and only have llama: llama_sampling. The goal is llama_sampling to become an object that holds the entire sampling state. That includes not just the RNG and timings, but also things like previous tokens (for repetition penalties), mirostat state, parameters like temperature, top-k, top-p, etc. This is because we eventually want to be able to offload the sampling to the GPU as well (see #5214 for more discussion on this topic). So we need to have some object that contains all the information necessary to perform sampling inside the llama library.

The current implementation on master does not provide that, but on the other hand it is quite low-level and allows the user to implement almost any sort of sampling approaches in the user code. This is also a nice feature to have and I'll be trying to keep this option available as well.

I don't think of grammars as a sampling technique

Hm, I think we can definitely consider the grammar as a sampler. If you think about it, in one way or another all sampling techniques do the same thing - given a set of candidate tokens with their respective probabilities, the sampler produces a new subset of tokens with new probabilities. So in that sense, applying grammar constraints can be looked as a sampler, such as top-k for example.

Would the grammar integration tests be negatively impacted?

No, we will keep the tests as they are. They would just need to include the internal header llama-grammar.h instead of LLAMA_API_INTERNAL + llama.h, but this is a good change.

HanClinto · 2024-07-25T15:58:26Z

Thank you, that is a very helpful reply! Apologies in advance for my newbie perspective on this -- most of what I've learned about LLMs I've learned piecemeal, and I'm trying to learn on the fly. Thank you for your patience with me!

If you think about it, in one way or another all sampling techniques do the same thing - given a set of candidate tokens with their respective probabilities, the sampler produces a new subset of tokens with new probabilities.

Aaah, this is where my thinking was different. Let me try to align with you.

In the sampling module, we have two classes of functions. As you noted, one takes in a set of candidate tokens with their respective probabilities, and the output is a new subset of tokens (length 0-N) with new probabilities. Examples of this include the logit bias map, guidance prompts, repetition penalties, and grammar constraints.

You call these "samplers" -- but I guess I was thinking of them as "pre-samplers". Each one of these is applied in llama_sampling_prepare, before sampling logic is applied. They are non-exclusive to each other, and multiple can be applied / chained-together in a single sampling "run".

The second class of functions takes in a set of tokens and outputs exactly one llama_token id. These are applied in llama_sampling_sample_impl() after preparation is done, and it always returns exactly one llama_token id, and only one of these can be called per run -- they can't be chained together. Examples of this are things like llama_sampling_sample_greedy(), llama_sampling_sample_mirostat(), softmax, etc -- basically anyting with the format of id = llama_sampling_foo() is what I was thinking of as a "sampler". Do you call this something different, or have a better way for me to think of these?

In one sense, both classes of functions are similar -- that they both return a "set" of llama tokens (in so far as a single ID can be considered a "set") -- but they feel categorically different in that one is called in the prepare function (and many can be applied), and the latter always returns exactly one token (and cannot be chained together).

I also got my impression from things in the code like this comment, that refers to grammar as something that's done before sampling:

    // apply grammar checks before sampling logic
    if (apply_grammar && ctx_sampling->grammar != NULL) {
        llama_grammar_sample(ctx_sampling->grammar, ctx_main, &cur_p);
    }

All of that feeds into why I was thinking of grammar logic as being more of a "constraint" than a "sampler" itself, but overall I'd like to align my understanding to yours.

ggerganov · 2024-07-26T10:52:00Z

You call these "samplers" -- but I guess I was thinking of them as "pre-samplers". Each one of these is applied in llama_sampling_prepare, before sampling logic is applied. They are non-exclusive to each other, and multiple can be applied / chained-together in a single sampling "run".

You are technically correct. "samplers" is more appropriate for the functions that return the final token. So it's better to call the rest of the functions "constraints".

I also got my impression from things in the code like this comment, that refers to grammar as something that's done before sampling:

Not exactly - the grammar constraints can also be applied after the sampler:

llama.cpp/common/sampling.cpp

Lines 324 to 347 in 50e0535

    
           if (ctx_sampling->grammar != NULL && !is_resampling) { 
        
               // Get a pointer to the logits 
        
               float * logits = llama_get_logits_ith(ctx_main, idx); 
        
               // Create an array with a single token data element for the sampled id 
        
               llama_token_data single_token_data = {id, logits[id], 0.0f}; 
        
               llama_token_data_array single_token_data_array = { &single_token_data, 1, false }; 
        
               // Apply grammar constraints to the single token 
        
               llama_sample_grammar(ctx_main, &single_token_data_array, ctx_sampling->grammar); 
        
               // Check if the token is valid according to the grammar by seeing if its logit has been set to -INFINITY 
        
               bool is_valid = single_token_data_array.data[0].logit != -INFINITY; 
        
               // If the token is not valid according to the grammar, perform resampling 
        
               if (!is_valid) { 
        
                   LOG("Resampling because token %d: '%s' does not meet grammar rules\n", id, llama_token_to_piece(ctx_main, id).c_str()); 
        
                   // Restore logits from the copy 
        
                   std::copy(original_logits.begin(), original_logits.end(), logits); 
        
                   return llama_sampling_sample_impl(ctx_sampling, ctx_main, ctx_cfg, idx, /* is_resampling= */ true); 
        
               } 
        
           }

I'm revisiting the implementation, and applying the grammar post-sampling does not seem to be equivalent to applying it pre-sampling. While the existing implementation in common sort of makes this assumption in order to achieve the optimization in #4306 and I'm not very sure this is correct.

In one sense, both classes of functions are similar -- that they both return a "set" of llama tokens (in so far as a single ID can be considered a "set") -- but they feel categorically different in that one is called in the prepare function (and many can be applied), and the latter always returns exactly one token (and cannot be chained together).

The prepare function on master was introduced mostly to avoid code repetition (IIRC). I don't think it makes much sense to have it in the API and will be trying to avoid it in the refactoring

…factor from #8643 (#8651)

src/llama-grammar.cpp

ggerganov · 2024-08-31T09:16:07Z

I think this should be good to merge. Will leave it for a day or two for any comments, and then merge

slaren · 2024-09-02T15:33:18Z

Future plan

Utilize the new struct llama_sampling for offloading the sampling to the GPU. Can be extended with whatever extra information is necessary and utilized in the decoding API. Hopefully the current iteration is a good step in that direction.

What would be the path to use this API with GPU sampling? I would expect that we will need to add a function similar to llama_decode_sample(ctx, batch, sampling, n_tokens) to support sampling after evaluation, which will allow us to evaluate multiple tokens without requiring a synchronization with the GPU, minimizing downtime. But for that to be possible, the sampling object needs to represent the entire sampling chain. Maybe that was meant to be the purpose of llama_sampler_type and samplers in llama_sampling_params? Currently that's unused in llama.cpp.

ggerganov · 2024-09-02T15:38:15Z

The sampling chain is indeed stored in samplers. Currently, the information about the samplers chain is used from the user code only through the llama_sampling_sample() function:

llama.cpp/include/llama.h

Lines 1128 to 1131 in ca74a33

    
           /// @details Sample a token using the configured samplers (see "llama_sampling_params.samplers"). 
        
           LLAMA_API llama_token llama_sampling_sample( 
        
                   struct llama_sampling * smpl, 
        
                  llama_token_data_array * candidates);

I am thinking that in the future, we can utilize this information within llama to append the necessary operations for GPU-side sampling.

slaren · 2024-09-02T15:59:57Z

I don't want to go too much into this because ultimately this is a matter of opinion, but I think there would be significant advantages to a design in which samplers are abstract objects that can be combined and extended without having to modify anything else. This would also allow users to implement their own samplers, and it would allow new experimental samplers to be implemented in a separate library.

// llama_sampler base class (can be made accessible as a C interface in a similar way ggml-backend does it)
struct llama_sampler {
    virtual void sample_cpu(llama_token_data_array * candidates);
    virtual void sample_ggml(/* to be defined */);
};

// llama_sampler_chain is just another sampler that uses a list of samplers
struct llama_sampler_chain : llama_sampler {
    std::vector<llama_sampler *> samplers;

    void sample_cpu(llama_token_data_array * candidates) override;
    void sample_ggml(/* to be defined */) override;
};

void llama_sampler_chain::sample_cpu(llama_token_data_array * candidates) {
    for (auto sampler : samplers) {
        sampler->sample_cpu(candidates);
    }
}

// user API example:
{
    llama_sampler_t sampler = llama_sampler_chain_new();
    llama_sampler_chain_add(sampler, llama_sampler_top_k_new(10));
    llama_sampler_chain_add(sampler, llama_sampler_temperature_new(0.5));
    llama_sampler_chain_add(sampler, llama_sampler_top_p_new(0.9));
    llama_sampler_chain_add(sampler, llama_sampler_softmax_new());
    llama_sampler_chain_add(sampler, llama_sampler_sample_new());

    // if the sampler needs to be modified later the user can keep the pointer to it:
    // llama_sampler_t temp = llama_sampler_temperature_new(sampler, 0.5);
    // llama_sampler_chain_add(sampler, temp);
    // llama_sampler_temperature_set(temp, 0.7);

    // decode and then sample
    llama_decode(ctx, ...);
    llama_sampler_sample(sampler, ctx, ith);

    // future API: decode with sampling
    llama_decode_sample(ctx, sampler, ...);
}

ggerganov · 2024-09-02T18:21:22Z

Yes, this is a good suggestion. I will try to update the PR in the proposed way.

arlo-phoenix · 2024-09-02T22:47:51Z

I don't want to go too much into this because ultimately this is a matter of opinion, but I think there would be significant advantages to a design in which samplers are abstract objects that can be combined and extended without having to modify anything else. This would also allow users to implement their own samplers, and it would allow new experimental samplers to be implemented in a separate library.

   llama_sampler_t sampler = llama_sampler_chain_new();
   llama_sampler_chain_add(sampler, llama_sampler_top_k_new(10));
   llama_sampler_chain_add(sampler, llama_sampler_temperature_new(0.5));
   llama_sampler_chain_add(sampler, llama_sampler_top_p_new(0.9));
   llama_sampler_chain_add(sampler, llama_sampler_softmax_new());
   llama_sampler_chain_add(sampler, llama_sampler_sample_new());

second this. What I currently dislike about these changes is that you seem to get stuck to using llama_sampling_params. That is very annoying for wrappers imo (I've seen llama_sampling.h, but then you'd just have to deal with the full previous API again and not benefit from this). The user API from @slaren on the other hand looks great and makes experimenting with samplers much easier (I have a WIP sampler logit/probs visualizer for each sampling step that allows any chain of llama-cpp samplers and custom numpy samplers. That project would break / become very annoying with the current PR (unless llama_sampling is wrapped). With the proposed API it would instead get simplified, so thank you for going through with the suggestion and slaren for proposing it)

EDIT: Something that would be nice though is leaving a low level function to not lose functionality from the current API

...
llama_token_data_array candidates_p = { candidates.data(), candidates.size(), false };
llama_sampler_process(ctx, &candidates_p, llama_sampler_temperature_new(0.5))

idk how feasible that would be for the GPU implementation, but for testing GPU sampling against CPU the option of copying all logits / at least probs over will afaik be needed anyways (and not just the sampled token)

ggml-ci

ggerganov · 2024-09-03T07:37:14Z

EDIT: Something that would be nice though is leaving a low level function to not lose functionality from the current API
...
llama_token_data_array candidates_p = { candidates.data(), candidates.size(), false };
llama_sampler_process(ctx, &candidates_p, llama_sampler_temperature_new(0.5))

It wouldn't make sense to create a new sampler every time. You would be able to do something like this instead:

auto smpl_temp = llama_sampler_temperature_new(0.5);
...

llama_token_data_array candidates_p = { candidates.data(), candidates.size(), false };
smpl_temp->sample_cpu(candidates);

arlo-phoenix · 2024-09-03T09:45:09Z

EDIT: Something that would be nice though is leaving a low level function to not lose functionality from the current API
...
llama_token_data_array candidates_p = { candidates.data(), candidates.size(), false };
llama_sampler_process(ctx, &candidates_p, llama_sampler_temperature_new(0.5))
It wouldn't make sense to create a new sampler every time. You would be able to do something like this instead:
auto smpl_temp = llama_sampler_temperature_new(0.5);
...

llama_token_data_array candidates_p = { candidates.data(), candidates.size(), false };
smpl_temp->sample_cpu(candidates);

I just meant some util method to expose sample_cpu/sample_ggml depending on ctx in the C interface, that didn't seem planned from the comments. I'd be fine with something like

void llama_sample_cpu(struct llama_sampler * smpl, struct llama_token_data_array * candidates) {
  smpl->sample_cpu(candidates);
}

I saw

llama.cpp/include/llama.h

Line 1190 in dcf1359

    
           LLAMA_API void llama_constraint_apply (struct llama_constraint * cnstr, llama_token_data_array * candidates);

so seems planned anyways

While the old way of maintaining the array of candidate tokens within the user code remains available

I just noticed the old sampling API isn't even marked as deprecated (I thought it would be, my bad just saw this and quickly commented since I thought this would break my project). But imo this isn't a good choice long term from a maintenance perspective (new samplers would have to update both APIs). It seems fairly easy to make the new proposed API offer the same functionality as the old API.

ggerganov · 2024-09-07T09:34:57Z

Superseded by #9294

ggerganov mentioned this pull request Jul 23, 2024

llama : move vocab, grammar and sampling into separate files #8508

Merged

7 tasks

github-actions bot added testing Everything test related examples server labels Jul 23, 2024

ggerganov changed the base branch from gg/llama-reorganize to master July 23, 2024 10:13

ggerganov force-pushed the gg/llama-refactor-sampling branch from f208aa4 to f866cb9 Compare July 23, 2024 10:14

HanClinto reviewed Jul 23, 2024

View reviewed changes

common/sampling.cpp Show resolved Hide resolved

HanClinto added a commit that referenced this pull request Jul 23, 2024

Updated Swift and Android bindings to use the new llama_sampling_* re…

b6c9b53

…factor from #8643

HanClinto mentioned this pull request Jul 23, 2024

llama : update Swift and Android bindings for refactor sampling #8651

Merged

4 tasks

HanClinto added a commit that referenced this pull request Jul 24, 2024

Updated Swift and Android bindings to use the new llama_sampling_* re…

a352ad4

…factor from #8643 (#8651)

github-actions bot added the android Issues specific to Android label Jul 24, 2024

ggerganov pushed a commit that referenced this pull request Jul 26, 2024

Updated Swift and Android bindings to use the new llama_sampling_* re…

fda1a97

…factor from #8643 (#8651)

ggerganov force-pushed the gg/llama-refactor-sampling branch 2 times, most recently from 2ad156c to a880be2 Compare July 26, 2024 18:54

ggerganov mentioned this pull request Jul 27, 2024

llama : refactor session file management #8699

Merged

13 tasks

mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label Aug 1, 2024

ggerganov force-pushed the gg/llama-refactor-sampling branch from beebdfd to 43440c0 Compare August 5, 2024 07:08

mofosyne added the refactoring Refactoring label Aug 6, 2024

ggerganov force-pushed the gg/llama-refactor-sampling branch from 299d255 to bebf5d7 Compare August 6, 2024 15:32

ggerganov mentioned this pull request Aug 8, 2024

added implementation of DRY sampler #6839

Closed

ExtReMLapin reviewed Aug 8, 2024

View reviewed changes

src/llama-grammar.cpp Show resolved Hide resolved

HanClinto mentioned this pull request Aug 9, 2024

grammars: fix resampling logic regression #7424

Merged

ggerganov force-pushed the gg/llama-refactor-sampling branch from d03a5a2 to 5a9753b Compare August 10, 2024 08:10

ggerganov force-pushed the gg/llama-refactor-sampling branch 6 times, most recently from 267f138 to 5243e3f Compare August 21, 2024 08:30

ggerganov force-pushed the gg/llama-refactor-sampling branch 4 times, most recently from 62984db to 694c4b1 Compare August 29, 2024 10:20

ggerganov added the breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. label Aug 29, 2024

ggerganov marked this pull request as ready for review August 29, 2024 15:31

ggerganov force-pushed the gg/llama-refactor-sampling branch from a5d664c to 6420268 Compare August 30, 2024 08:13

ggerganov requested a review from slaren August 31, 2024 09:15

ggerganov mentioned this pull request Sep 3, 2024

changelog : libllama API #9289

Open

llama : add llama_sampling API + move grammar in libllama

f648ca2

ggml-ci

ggerganov force-pushed the gg/llama-refactor-sampling branch from bb3d182 to f648ca2 Compare September 3, 2024 07:33

ggerganov marked this pull request as draft September 3, 2024 07:34

ggerganov removed the request for review from slaren September 3, 2024 07:34

ggerganov mentioned this pull request Sep 3, 2024

llama : refactor sampling v2 #9294

Merged

4 tasks

ggerganov closed this Sep 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : refactor sampling #8643

llama : refactor sampling #8643

ggerganov commented Jul 23, 2024 •

edited

Loading

ggerganov commented Jul 24, 2024

martindevans commented Jul 24, 2024

HanClinto commented Jul 24, 2024 •

edited

Loading

HanClinto commented Jul 24, 2024

ggerganov commented Jul 25, 2024 •

edited

Loading

HanClinto commented Jul 25, 2024

ggerganov commented Jul 26, 2024

ggerganov commented Aug 31, 2024

slaren commented Sep 2, 2024

ggerganov commented Sep 2, 2024

slaren commented Sep 2, 2024 •

edited

Loading

ggerganov commented Sep 2, 2024

arlo-phoenix commented Sep 2, 2024 •

edited

Loading

ggerganov commented Sep 3, 2024

arlo-phoenix commented Sep 3, 2024 •

edited

Loading

ggerganov commented Sep 7, 2024

llama : refactor sampling #8643

llama : refactor sampling #8643

Conversation

ggerganov commented Jul 23, 2024 • edited Loading

Overview

API Changes

Implementation details

Example

TODO

Future plan

ggerganov commented Jul 24, 2024

martindevans commented Jul 24, 2024

HanClinto commented Jul 24, 2024 • edited Loading

HanClinto commented Jul 24, 2024

ggerganov commented Jul 25, 2024 • edited Loading

HanClinto commented Jul 25, 2024

ggerganov commented Jul 26, 2024

ggerganov commented Aug 31, 2024

slaren commented Sep 2, 2024

ggerganov commented Sep 2, 2024

slaren commented Sep 2, 2024 • edited Loading

ggerganov commented Sep 2, 2024

arlo-phoenix commented Sep 2, 2024 • edited Loading

ggerganov commented Sep 3, 2024

arlo-phoenix commented Sep 3, 2024 • edited Loading

ggerganov commented Sep 7, 2024

ggerganov commented Jul 23, 2024 •

edited

Loading

HanClinto commented Jul 24, 2024 •

edited

Loading

ggerganov commented Jul 25, 2024 •

edited

Loading

slaren commented Sep 2, 2024 •

edited

Loading

arlo-phoenix commented Sep 2, 2024 •

edited

Loading

arlo-phoenix commented Sep 3, 2024 •

edited

Loading