Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : refactor sampling #8643

Closed
wants to merge 1 commit into from
Closed

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Jul 23, 2024

ref #5214

Overview

Remove struct llama_sampling_context from common and replace it with struct llama_sampling in the llama library. The entire common/sampling functionality is now part of the llama library.

API Changes

  • Add enum llama_sampler_type
  • Add struct llama_sampling_params
  • Add struct llama_sampling and new llama_sampling_ API (replaces the old llama_sample_ and llama_grammar_ APIs)
  • Remove LLAMA_API_INTERNAL
  • Remove Classifier-Free Guidance related API
  • Remove Prompt Penalty support

Implementation details

  • Move common/grammar-parser in src/llama-grammar
  • The llama_context no longer comes with a built-in sampling context. The user code is responsible for creating, using, saving and loading the llama_sampling objects as needed. As a consequence, the llama_state no longer serializes the RNG
  • The struct llama_sampling is very similar to the old common/llama_sampling_context. It supports the same parameters, grammar, token history and sampler sequences.
  • The sampling timings are now performed by llama_sampling instead of llama_context. The grammar-related computations are timed separately
  • The struct llama_sampling keeps an internal list of token candidates, which is initialized upon passing the logits via llama_sampling_set_logits. This internal list can be optionally used by not providing an external candidates array (as in the past) which simplifies the API usage significantly for common use cases
  • The grammar code has been refactored, hopefully it is a bit easier to read. No functional changes.

Example

While the old way of maintaining the array of candidate tokens within the user code remains available, there is now a simpler implementation by utilizing the internal list of candidates in llama_sampling:

  • before
auto   n_vocab = llama_n_vocab(model);
auto * logits  = llama_get_logits_ith(ctx, i_batch[i]);

std::vector<llama_token_data> candidates;
candidates.reserve(n_vocab);

for (llama_token token_id = 0; token_id < n_vocab; token_id++) {
    candidates.emplace_back(llama_token_data{ token_id, logits[token_id], 0.0f });
}

llama_token_data_array candidates_p = { candidates.data(), candidates.size(), false };

// note that we used to pass `llama_context ctx` to the sampling API
llama_sample_top_k(ctx, &candidates_p, top_k, 1);
llama_sample_top_p(ctx, &candidates_p, top_p, 1);
llama_sample_temp (ctx, &candidates_p, temp);

const llama_token new_token_id = llama_sample_token(ctx, &candidates_p);
  • after
const auto * logits = llama_get_logits_ith(ctx, i_batch[i]);

llama_sampling_set_logits(smpl, logits);

// we now pass `llama_sampling smpl` and no longer need to maintain the candidates explicitly
llama_sampling_top_k(smpl, nullptr);
llama_sampling_top_p(smpl, nullptr);
llama_sampling_temp (smpl, nullptr);

const llama_token new_token_id = llama_sampling_sample_dist(smpl, nullptr);

TODO

Future plan

  • Utilize the new struct llama_sampling for offloading the sampling to the GPU. Can be extended with whatever extra information is necessary and utilized in the decoding API. Hopefully the current iteration is a good step in that direction.

@github-actions github-actions bot added testing Everything test related examples server labels Jul 23, 2024
@ggerganov ggerganov changed the base branch from gg/llama-reorganize to master July 23, 2024 10:13
@ggerganov ggerganov force-pushed the gg/llama-refactor-sampling branch from f208aa4 to f866cb9 Compare July 23, 2024 10:14
@github-actions github-actions bot added the android Issues specific to Android label Jul 24, 2024
@ggerganov
Copy link
Owner Author

I'm thinking there is no reason to have two separate structs llama_sampling and llama_grammar, so struct llama_grammar should be absorbed completely in struct llama_sampling and not exposed through the API. Will also move the grammar_parser from common into llama-grammar.cpp

The API will be simplified:

  • Remove:

    • llama_grammar_init
    • llama_grammar_free
    • llama_grammar_copy
  • Rename

    • llama_grammar_sample -> llama_sampling_grammar
    • llama_grammar_accept_token -> llama_sampling_accept_token
  • Add

    • llama_sampling_copy
  • Hide

    • enum llama_gretype
    • struct llama_grammar_element
  • Update

    • llama_sampling_init optionally takes a grammar string

@martindevans
Copy link
Contributor

Storing some of the sampling state (particularly the RNG) in a separate llama_sampling object seems like a good change (particularly for batching, where I think different sequences shared an RNG before this change?).

I'm not so sure about the change to grammars though. Why is it being handled so differently to all of the other sampling techniques? If it's due to the statefulness, how about other stateful samplers such as mirostat?

@HanClinto
Copy link
Collaborator

HanClinto commented Jul 24, 2024

Apologies in advance for the wall-of-text -- I'm trying to wrap my head around some things here, and going a bit stream-of-consciousness. If I was smarter or had a better handle on things, I could write more succinctly.

I'm thinking there is no reason to have two separate structs llama_sampling and llama_grammar, so struct llama_grammar should be absorbed completely in struct llama_sampling and not exposed through the API.

There isn't much in the llama_sampling structure -- mainly just timing metrics, right (and the RNG)?

I do still like the idea of keeping grammar reasonably partitioned away from sampling (and doing things like measuring grammar timing as separate from sampling timing -- I think that will be very valuable as we continue to optimize and extend the grammar engine, and I saw you added a note about that in one of your earlier commits).

Will combining these structures muddy those waters? I almost wonder if the llama_sampling should be merged into llama_grammar struct, rather than the other way 'round.

But then again, what do you see as the definition of what a llama_sampling object is, and how is it distinct from llama_sampling_context and llama_sampling_params?

Part of me feels like llama_sampling should also include a function pointer to the actual sampling function used, but then that's getting into some crossover into llama_sampling_params.

I guess I also don't fully understand why the RNG shouldn't live within the confines of the llama_sampling_context.

"Good fences make good neighbors", as the saying goes, and overall I think I agree with you -- there's a bit of muddiness here, and this refactoring is very welcome.

  • Add

    • llama_sampling_copy

Having a hard time giving feedback on this part, because I need to get a better handle on what a llama_sampling object is, and how it's different from context / param objects.

I'm not sure how best to express the distinctives of each object, but maybe something like:

object description instances
llama_sampling_params CLI-type options to configure the sampler Global, one param config
llama_sampling_context Working space for each inference instance to store sampled tokens (?) One per job runner
llama_sampling Timing metrics and RNG state for each sampler instance (?) One per job runner (?)
llama_grammar_rules Parsed grammar rules from GBNF Global, shared amongst all runners that use this grammar, when grammar present
llama_grammar_stacks Working stacks for tracking branches of valid grammar trees built along the way as tokens are sampled One per job runner, when grammar present
lllama_grammar Wrapper for grammar rules and grammar stacks, along with some small scratch-pad memory space for unicode characters that span multiple tokens. One per job runner, when grammar present

I'll admit that I don't fully understand how parallelism works within llama.cpp, and all of the different things that can exist with a global context, the individual job runners, how that ties into batching and shared memory, etc etc etc. So my ignorance about that might also play into my confusion on the rest of this as well.

The naming conventions in the ownership hierarchy might also want to be standardized. With grammars, "short name" is on top:

  • llama_grammar
    • llama_grammar_rules
    • llama_grammar_stacks

And with sampling, the "long name" (_context) is on top, and llama_sampling is buried:

  • llama_sampling_context
    • llama_sampling_params
    • llama_sampling
    • llama_grammar

So the unclear ownership chain is perhaps also what's contributing to my vague feelings of confusion and uncertainty.

It's also weird to me to see an object structure that is the name of the module itself (llama_grammar and llama_sampling) -- it feels like they should be qualified with something, like renaming llama_grammar to llama_grammar_context to match the sampling paradigm.

Will also move the grammar_parser from common into llama-grammar.cpp

👍 I like this change a lot.

  • Remove:

    • llama_grammar_init
    • llama_grammar_free
    • llama_grammar_copy
      ...
  • Update

    • llama_sampling_init optionally takes a grammar string

These changes make me a bit uncomfortable. I really like the way that grammar is (currently) decoupled from sampling -- it makes the end-to-end grammar integration tests feel as clean as they are, and makes the GBNF validator program possible. The GBNF validator program is of dubious importance (I might be the only person to ever use that program), but the integration tests are pretty cool.

Would the grammar integration tests be negatively impacted by bringing sampling into it (when it's previously been able to avoid it), or would it be cleaned up?

  • Rename

    • llama_grammar_sample -> llama_sampling_grammar
    • llama_grammar_accept_token -> llama_sampling_accept_token

I'm trying to understand this one, and struggling a bit. Is part of the reason for this change because of the way that sampling and grammar are tied together a bit closely right now?

In particular, I feel like the most involved piece of coupling between the two modules is the optimization that was added in #4306. That was a very important optimization (and I never want it to go away), but it had the unfortunate side-effect of passing control between grammar and sampling a couple of times in each loop, and it's not always clear (when I'm reading the code anyways) which modules is "on top" and driving the interaction. Especially with that weird is_resampling parameter and the way that it changes the control flow -- it's gives me that unsettling feeling that I always get when I'm debugging recursive code. It's always felt that way to me -- not because it's bad, but I think it's just an inherent / necessary level of complexity.

  • Hide

    • enum llama_gretype
    • struct llama_grammar_element

👍 This is good.

@HanClinto
Copy link
Collaborator

I'm not so sure about the change to grammars though. Why is it being handled so differently to all of the other sampling techniques? If it's due to the statefulness, how about other stateful samplers such as mirostat?

I don't think of grammars as a sampling technique (akin to something like mirostat). Rather, it lives at a layer kinda' above the sampler, because it takes the logits that are calculated by the model, and it constrains the sampler to prevent it from considering tokens that don't match the grammar by setting those logits to zero.

Grammar places boundaries on the sampler, but it itself is not a sampler.

@ggerganov
Copy link
Owner Author

ggerganov commented Jul 25, 2024

I do still like the idea of keeping grammar reasonably partitioned away from sampling (and doing things like measuring grammar timing as separate from sampling timing -- I think that will be very valuable as we continue to optimize and extend the grammar engine, and I saw you added a note about that in one of your earlier commits).

The 2 structures will continue to exist separately in the internal llama implementation, but I'm thinking that there is no need to expose llama_grammar through the public API.

But then again, what do you see as the definition of what a llama_sampling object is, and how is it distinct from llama_sampling_context and llama_sampling_params?

From user PoV, I'm looking for ways to eliminate common: llama_sampling_context and only have llama: llama_sampling. The goal is llama_sampling to become an object that holds the entire sampling state. That includes not just the RNG and timings, but also things like previous tokens (for repetition penalties), mirostat state, parameters like temperature, top-k, top-p, etc. This is because we eventually want to be able to offload the sampling to the GPU as well (see #5214 for more discussion on this topic). So we need to have some object that contains all the information necessary to perform sampling inside the llama library.

The current implementation on master does not provide that, but on the other hand it is quite low-level and allows the user to implement almost any sort of sampling approaches in the user code. This is also a nice feature to have and I'll be trying to keep this option available as well.

I don't think of grammars as a sampling technique

Hm, I think we can definitely consider the grammar as a sampler. If you think about it, in one way or another all sampling techniques do the same thing - given a set of candidate tokens with their respective probabilities, the sampler produces a new subset of tokens with new probabilities. So in that sense, applying grammar constraints can be looked as a sampler, such as top-k for example.

Would the grammar integration tests be negatively impacted?

No, we will keep the tests as they are. They would just need to include the internal header llama-grammar.h instead of LLAMA_API_INTERNAL + llama.h, but this is a good change.

@HanClinto
Copy link
Collaborator

Thank you, that is a very helpful reply! Apologies in advance for my newbie perspective on this -- most of what I've learned about LLMs I've learned piecemeal, and I'm trying to learn on the fly. Thank you for your patience with me!

If you think about it, in one way or another all sampling techniques do the same thing - given a set of candidate tokens with their respective probabilities, the sampler produces a new subset of tokens with new probabilities.

Aaah, this is where my thinking was different. Let me try to align with you.

In the sampling module, we have two classes of functions. As you noted, one takes in a set of candidate tokens with their respective probabilities, and the output is a new subset of tokens (length 0-N) with new probabilities. Examples of this include the logit bias map, guidance prompts, repetition penalties, and grammar constraints.

You call these "samplers" -- but I guess I was thinking of them as "pre-samplers". Each one of these is applied in llama_sampling_prepare, before sampling logic is applied. They are non-exclusive to each other, and multiple can be applied / chained-together in a single sampling "run".

The second class of functions takes in a set of tokens and outputs exactly one llama_token id. These are applied in llama_sampling_sample_impl() after preparation is done, and it always returns exactly one llama_token id, and only one of these can be called per run -- they can't be chained together. Examples of this are things like llama_sampling_sample_greedy(), llama_sampling_sample_mirostat(), softmax, etc -- basically anyting with the format of id = llama_sampling_foo() is what I was thinking of as a "sampler". Do you call this something different, or have a better way for me to think of these?

In one sense, both classes of functions are similar -- that they both return a "set" of llama tokens (in so far as a single ID can be considered a "set") -- but they feel categorically different in that one is called in the prepare function (and many can be applied), and the latter always returns exactly one token (and cannot be chained together).

I also got my impression from things in the code like this comment, that refers to grammar as something that's done before sampling:

    // apply grammar checks before sampling logic
    if (apply_grammar && ctx_sampling->grammar != NULL) {
        llama_grammar_sample(ctx_sampling->grammar, ctx_main, &cur_p);
    }

All of that feeds into why I was thinking of grammar logic as being more of a "constraint" than a "sampler" itself, but overall I'd like to align my understanding to yours.

@ggerganov
Copy link
Owner Author

You call these "samplers" -- but I guess I was thinking of them as "pre-samplers". Each one of these is applied in llama_sampling_prepare, before sampling logic is applied. They are non-exclusive to each other, and multiple can be applied / chained-together in a single sampling "run".

You are technically correct. "samplers" is more appropriate for the functions that return the final token. So it's better to call the rest of the functions "constraints".

I also got my impression from things in the code like this comment, that refers to grammar as something that's done before sampling:

Not exactly - the grammar constraints can also be applied after the sampler:

if (ctx_sampling->grammar != NULL && !is_resampling) {
// Get a pointer to the logits
float * logits = llama_get_logits_ith(ctx_main, idx);
// Create an array with a single token data element for the sampled id
llama_token_data single_token_data = {id, logits[id], 0.0f};
llama_token_data_array single_token_data_array = { &single_token_data, 1, false };
// Apply grammar constraints to the single token
llama_sample_grammar(ctx_main, &single_token_data_array, ctx_sampling->grammar);
// Check if the token is valid according to the grammar by seeing if its logit has been set to -INFINITY
bool is_valid = single_token_data_array.data[0].logit != -INFINITY;
// If the token is not valid according to the grammar, perform resampling
if (!is_valid) {
LOG("Resampling because token %d: '%s' does not meet grammar rules\n", id, llama_token_to_piece(ctx_main, id).c_str());
// Restore logits from the copy
std::copy(original_logits.begin(), original_logits.end(), logits);
return llama_sampling_sample_impl(ctx_sampling, ctx_main, ctx_cfg, idx, /* is_resampling= */ true);
}
}

I'm revisiting the implementation, and applying the grammar post-sampling does not seem to be equivalent to applying it pre-sampling. While the existing implementation in common sort of makes this assumption in order to achieve the optimization in #4306 and I'm not very sure this is correct.

In one sense, both classes of functions are similar -- that they both return a "set" of llama tokens (in so far as a single ID can be considered a "set") -- but they feel categorically different in that one is called in the prepare function (and many can be applied), and the latter always returns exactly one token (and cannot be chained together).

The prepare function on master was introduced mostly to avoid code repetition (IIRC). I don't think it makes much sense to have it in the API and will be trying to avoid it in the refactoring

@ggerganov ggerganov force-pushed the gg/llama-refactor-sampling branch 2 times, most recently from 2ad156c to a880be2 Compare July 26, 2024 18:54
@mofosyne mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label Aug 1, 2024
@ggerganov ggerganov force-pushed the gg/llama-refactor-sampling branch from beebdfd to 43440c0 Compare August 5, 2024 07:08
@mofosyne mofosyne added the refactoring Refactoring label Aug 6, 2024
@ggerganov ggerganov force-pushed the gg/llama-refactor-sampling branch from 299d255 to bebf5d7 Compare August 6, 2024 15:32
@ggerganov ggerganov force-pushed the gg/llama-refactor-sampling branch 6 times, most recently from 267f138 to 5243e3f Compare August 21, 2024 08:30
@ggerganov ggerganov force-pushed the gg/llama-refactor-sampling branch 4 times, most recently from 62984db to 694c4b1 Compare August 29, 2024 10:20
@ggerganov ggerganov added the breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. label Aug 29, 2024
@ggerganov ggerganov marked this pull request as ready for review August 29, 2024 15:31
@ggerganov ggerganov force-pushed the gg/llama-refactor-sampling branch from a5d664c to 6420268 Compare August 30, 2024 08:13
@ggerganov ggerganov requested a review from slaren August 31, 2024 09:15
@ggerganov
Copy link
Owner Author

I think this should be good to merge. Will leave it for a day or two for any comments, and then merge

@slaren
Copy link
Collaborator

slaren commented Sep 2, 2024

Future plan

Utilize the new struct llama_sampling for offloading the sampling to the GPU. Can be extended with whatever extra information is necessary and utilized in the decoding API. Hopefully the current iteration is a good step in that direction.

What would be the path to use this API with GPU sampling? I would expect that we will need to add a function similar to llama_decode_sample(ctx, batch, sampling, n_tokens) to support sampling after evaluation, which will allow us to evaluate multiple tokens without requiring a synchronization with the GPU, minimizing downtime. But for that to be possible, the sampling object needs to represent the entire sampling chain. Maybe that was meant to be the purpose of llama_sampler_type and samplers in llama_sampling_params? Currently that's unused in llama.cpp.

@ggerganov
Copy link
Owner Author

The sampling chain is indeed stored in samplers. Currently, the information about the samplers chain is used from the user code only through the llama_sampling_sample() function:

llama.cpp/include/llama.h

Lines 1128 to 1131 in ca74a33

/// @details Sample a token using the configured samplers (see "llama_sampling_params.samplers").
LLAMA_API llama_token llama_sampling_sample(
struct llama_sampling * smpl,
llama_token_data_array * candidates);

I am thinking that in the future, we can utilize this information within llama to append the necessary operations for GPU-side sampling.

@slaren
Copy link
Collaborator

slaren commented Sep 2, 2024

I don't want to go too much into this because ultimately this is a matter of opinion, but I think there would be significant advantages to a design in which samplers are abstract objects that can be combined and extended without having to modify anything else. This would also allow users to implement their own samplers, and it would allow new experimental samplers to be implemented in a separate library.

// llama_sampler base class (can be made accessible as a C interface in a similar way ggml-backend does it)
struct llama_sampler {
    virtual void sample_cpu(llama_token_data_array * candidates);
    virtual void sample_ggml(/* to be defined */);
};

// llama_sampler_chain is just another sampler that uses a list of samplers
struct llama_sampler_chain : llama_sampler {
    std::vector<llama_sampler *> samplers;

    void sample_cpu(llama_token_data_array * candidates) override;
    void sample_ggml(/* to be defined */) override;
};

void llama_sampler_chain::sample_cpu(llama_token_data_array * candidates) {
    for (auto sampler : samplers) {
        sampler->sample_cpu(candidates);
    }
}

// user API example:
{
    llama_sampler_t sampler = llama_sampler_chain_new();
    llama_sampler_chain_add(sampler, llama_sampler_top_k_new(10));
    llama_sampler_chain_add(sampler, llama_sampler_temperature_new(0.5));
    llama_sampler_chain_add(sampler, llama_sampler_top_p_new(0.9));
    llama_sampler_chain_add(sampler, llama_sampler_softmax_new());
    llama_sampler_chain_add(sampler, llama_sampler_sample_new());

    // if the sampler needs to be modified later the user can keep the pointer to it:
    // llama_sampler_t temp = llama_sampler_temperature_new(sampler, 0.5);
    // llama_sampler_chain_add(sampler, temp);
    // llama_sampler_temperature_set(temp, 0.7);

    // decode and then sample
    llama_decode(ctx, ...);
    llama_sampler_sample(sampler, ctx, ith);

    // future API: decode with sampling
    llama_decode_sample(ctx, sampler, ...);
}

@ggerganov
Copy link
Owner Author

Yes, this is a good suggestion. I will try to update the PR in the proposed way.

@arlo-phoenix
Copy link
Contributor

arlo-phoenix commented Sep 2, 2024

I don't want to go too much into this because ultimately this is a matter of opinion, but I think there would be significant advantages to a design in which samplers are abstract objects that can be combined and extended without having to modify anything else. This would also allow users to implement their own samplers, and it would allow new experimental samplers to be implemented in a separate library.

   llama_sampler_t sampler = llama_sampler_chain_new();
   llama_sampler_chain_add(sampler, llama_sampler_top_k_new(10));
   llama_sampler_chain_add(sampler, llama_sampler_temperature_new(0.5));
   llama_sampler_chain_add(sampler, llama_sampler_top_p_new(0.9));
   llama_sampler_chain_add(sampler, llama_sampler_softmax_new());
   llama_sampler_chain_add(sampler, llama_sampler_sample_new());

second this. What I currently dislike about these changes is that you seem to get stuck to using llama_sampling_params. That is very annoying for wrappers imo (I've seen llama_sampling.h, but then you'd just have to deal with the full previous API again and not benefit from this). The user API from @slaren on the other hand looks great and makes experimenting with samplers much easier (I have a WIP sampler logit/probs visualizer for each sampling step that allows any chain of llama-cpp samplers and custom numpy samplers. That project would break / become very annoying with the current PR (unless llama_sampling is wrapped). With the proposed API it would instead get simplified, so thank you for going through with the suggestion and slaren for proposing it)

EDIT: Something that would be nice though is leaving a low level function to not lose functionality from the current API

...
llama_token_data_array candidates_p = { candidates.data(), candidates.size(), false };
llama_sampler_process(ctx, &candidates_p, llama_sampler_temperature_new(0.5))

idk how feasible that would be for the GPU implementation, but for testing GPU sampling against CPU the option of copying all logits / at least probs over will afaik be needed anyways (and not just the sampled token)

@ggerganov ggerganov force-pushed the gg/llama-refactor-sampling branch from bb3d182 to f648ca2 Compare September 3, 2024 07:33
@ggerganov ggerganov marked this pull request as draft September 3, 2024 07:34
@ggerganov ggerganov removed the request for review from slaren September 3, 2024 07:34
@ggerganov
Copy link
Owner Author

EDIT: Something that would be nice though is leaving a low level function to not lose functionality from the current API

...
llama_token_data_array candidates_p = { candidates.data(), candidates.size(), false };
llama_sampler_process(ctx, &candidates_p, llama_sampler_temperature_new(0.5))

It wouldn't make sense to create a new sampler every time. You would be able to do something like this instead:

auto smpl_temp = llama_sampler_temperature_new(0.5);
...

llama_token_data_array candidates_p = { candidates.data(), candidates.size(), false };
smpl_temp->sample_cpu(candidates);

@ggerganov ggerganov mentioned this pull request Sep 3, 2024
4 tasks
@arlo-phoenix
Copy link
Contributor

arlo-phoenix commented Sep 3, 2024

EDIT: Something that would be nice though is leaving a low level function to not lose functionality from the current API

...
llama_token_data_array candidates_p = { candidates.data(), candidates.size(), false };
llama_sampler_process(ctx, &candidates_p, llama_sampler_temperature_new(0.5))

It wouldn't make sense to create a new sampler every time. You would be able to do something like this instead:

auto smpl_temp = llama_sampler_temperature_new(0.5);
...

llama_token_data_array candidates_p = { candidates.data(), candidates.size(), false };
smpl_temp->sample_cpu(candidates);

I just meant some util method to expose sample_cpu/sample_ggml depending on ctx in the C interface, that didn't seem planned from the comments. I'd be fine with something like

void llama_sample_cpu(struct llama_sampler * smpl, struct llama_token_data_array * candidates) {
  smpl->sample_cpu(candidates);
}

I saw

LLAMA_API void llama_constraint_apply (struct llama_constraint * cnstr, llama_token_data_array * candidates);
so seems planned anyways


While the old way of maintaining the array of candidate tokens within the user code remains available

I just noticed the old sampling API isn't even marked as deprecated (I thought it would be, my bad just saw this and quickly commented since I thought this would break my project). But imo this isn't a good choice long term from a maintenance perspective (new samplers would have to update both APIs). It seems fairly easy to make the new proposed API offer the same functionality as the old API.

@ggerganov
Copy link
Owner Author

Superseded by #9294

@ggerganov ggerganov closed this Sep 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
android Issues specific to Android breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. examples refactoring Refactoring Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level server testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants