Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : refactor sampling v2 #9294

Merged
merged 47 commits into from
Sep 7, 2024
Merged

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Sep 3, 2024

alt: #8643

ref: #5214

Overview

  • Replace llama_sampling_ and llama_grammar_ with new llama_sampler_
  • Overhaul common: struct llama_sampling_context in common: struct gpt_sampler
  • Support for user-defined samplers via struct llama_sampler_i interface

API Changes

  • Add struct llama_sampler and struct llama_sampler_i
  • Add llama_sampler_ API
  • Add llama_sampler_chain_ API for chaining multiple samplers
  • Remove LLAMA_API_INTERNAL
  • Remove Classifier-Free Guidance related stuff
  • Remove Prompt Penalty support
  • Add llama_perf_ API and remove old llama_print_timings and llama_reset_timings

Implementation details

  • Move common/grammar-parser in src/llama-grammar
  • The llama_context no longer comes with a built-in sampling context. The user code is responsible for creating, using, saving and loading the llama_sampler objects as needed. As a consequence, the llama_state no longer serializes the RNG
  • The grammar code has been refactored, hopefully it is a bit easier to read. No functional changes.
  • The samplers implemented in llama-sampling.cpp can be used as examples for implementing custom samplers in user code

Example

Comparison of user sampling code before and after:

  • before
// decoding loop:
auto   n_vocab = llama_n_vocab(model);
auto * logits  = llama_get_logits_ith(ctx, i_batch[i]);

std::vector<llama_token_data> candidates;
candidates.reserve(n_vocab);

for (llama_token token_id = 0; token_id < n_vocab; token_id++) {
    candidates.emplace_back(llama_token_data{ token_id, logits[token_id], 0.0f });
}

llama_token_data_array candidates_p = { candidates.data(), candidates.size(), false };

llama_sample_top_k(ctx, &candidates_p, top_k, 1);
llama_sample_top_p(ctx, &candidates_p, top_p, 1);
llama_sample_temp (ctx, &candidates_p, temp);

const llama_token new_token_id = llama_sample_token(ctx, &candidates_p);
  • after
// prepare the sampling chain at the start
auto sparams = llama_sampler_chain_default_params();

llama_sampler * smpl = llama_sampler_chain_init(sparams);

llama_sampler_chain_add(smpl, llama_sampler_init_top_k(top_k));
llama_sampler_chain_add(smpl, llama_sampler_init_top_p(top_p, 1));
llama_sampler_chain_add(smpl, llama_sampler_init_temp (temp));
llama_sampler_chain_add(smpl, llama_sampler_init_dist (seed));

...

// decoding loop:
const llama_token new_token_id = llama_sampler_sample(smpl, ctx, i_batch[i]);

Future plan and ideas

  • Attach samplers to the decoding graphs for better performance (e.g. llama_decode_with_sampler())
  • Extend llama_decode_ API to support multiple decoding runs (e.g. llama_decode_n())
  • Existing samplers implementation in llama-sampling.cpp could be split into separate source files
  • Expose struct llama_vocab through the public API and change calls that currently use struct llama_model to use it when appropriate
  • Deduplicate the ring_buffer code by implementing ggml_ring_buffer for fixed-size objects
  • Measure and report the performance of the grammar
  • Simplify llama_token_data llama : refactor sampling v2 #9294 (review)

@ExtReMLapin
Copy link
Contributor

Not sure it's the right place or time to talk about this but in another issue a guy had the idea of, "if the grammar says the character/word can only be "xxx" and nothing else, don't bother asking the LLM what to say for the X next tokens".

As there is a refactoring going on, maybe it's the right time to implement it.

@ggerganov
Copy link
Owner Author

It's not in the scope of this change, but also the grammar never fits only one token. For example, all three tokens "x", "xx" and "xxx" would fit the grammar in that case. One way would be to use the longest token. Another way would be to tokenize "xxx" and use the resulting tokens. Not sure

@github-actions github-actions bot added testing Everything test related server android Issues specific to Android labels Sep 4, 2024
@ggerganov ggerganov changed the base branch from gg/llama-refactor-sampling to master September 4, 2024 12:38
@ggerganov ggerganov force-pushed the gg/llama-refactor-sampling-v2 branch from 762e955 to 3c46719 Compare September 4, 2024 14:26
@ggerganov
Copy link
Owner Author

This is getting close to ready. Later today will add detailed description of the changes, some comments in the code and do a bit more testing.

@slaren PTAL - any comments and suggestions are welcome.

include/llama.h Show resolved Hide resolved
src/llama-sampling.cpp Outdated Show resolved Hide resolved
src/llama-sampling.cpp Outdated Show resolved Hide resolved
src/llama-sampling.cpp Outdated Show resolved Hide resolved
include/llama.h Outdated
Comment on lines 1046 to 1050
LLAMA_API struct llama_constraint * llama_constraint_init_top_k (int32_t k, int32_t min_keep);
LLAMA_API struct llama_constraint * llama_constraint_init_top_p (float p, int32_t min_keep);
LLAMA_API struct llama_constraint * llama_constraint_init_min_p (float p, int32_t min_keep);
LLAMA_API struct llama_constraint * llama_constraint_init_tail_free (float z, int32_t min_keep);
LLAMA_API struct llama_constraint * llama_constraint_init_typical (float p, int32_t min_keep);
Copy link
Collaborator

@slaren slaren Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what's the history of the min_keep parameter in all of these samplers, from what I can tell parameter is not used in the examples except by the server, but it seems very suspect to me.

Edit: looks like it has been there since the beginning (#1126), and there was never any explanation of why it is needed.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed min_keep from the top_k sampler as it didn't make sense. For the p-based samplers, I think it makes sense to guarantee minimum number of candidate results, regardless of the p value.

include/llama.h Outdated Show resolved Hide resolved
common/sampling.cpp Outdated Show resolved Hide resolved
@ggerganov
Copy link
Owner Author

Thanks for the review. I got side tracked a bit with a bug in the speculative example (should be fixed now). Will apply the review tomorrow and prepare this for merging.

@ggerganov ggerganov force-pushed the gg/llama-refactor-sampling-v2 branch from 8307e96 to 11c2e46 Compare September 5, 2024 15:13
// be positioned at a character range (see `llama_grammar_advance_stack`), and
// produces the N possible stacks if the given char is accepted at those
// positions
llama_grammar_stacks llama_grammar_accept(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello! Why does llama_grammar_accept return the stacks? It was previously passed by reference

Copy link
Owner Author

@ggerganov ggerganov Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for noticing. I changed it because I thought it improves the signature of the function, but I missed that it would lead to extra memory allocations. So I restored the original signature. f9762c6

Edit: though after one more look, I think it does not matter since we move stacks_new either way.

include/llama.h Outdated Show resolved Hide resolved
@ggerganov ggerganov marked this pull request as ready for review September 6, 2024 12:32
@ggerganov ggerganov requested a review from slaren September 6, 2024 12:32
Copy link
Collaborator

@slaren slaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

In the future we should probably simplify llama_token_data (or remove it entirely) to keep only one value per token and add a flag to llama_token_data_array to indicate whether the current values are probabilities (ie. normalized to sum 1) or not, so that samplers that can only operate with probabilities can know if they need to call softmax. Having two values per token is very confusing to me because some samplers operate on one or other, and this can lead to situations where a sampler modifies the probabilities, and the next one calls softmax which discards all the changes to the probabilities and computes them from the logits again. I cannot tell if there are already situations like that which end with some samplers that operate on probabilities being ignored.

@mudler
Copy link
Contributor

mudler commented Sep 12, 2024

Somehow after this change I see breakage in llava sampling - I'm still OOO so I didn't deep dive yet, see mudler/LocalAI#3497 for reference on how it breaks in LocalAI.

Would be really appreciated if anyone have an idea/pointers of what's going on ! Thank you!


10:25PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:42747): stderr /home/mudler/_git/LocalAI/backend/cpp/llama-avx2/llama.cpp/ggml/src/ggml.c:13835: GGML_ASSERT(i01 >= 0 &
& i01 < ne01) failed


10:25PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:42747): stdout [Thread debugging using libthread_db enabled]
10:25PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:42747): stdout Using host libthread_db library "/lib64/libthread_db.so.1".
10:25PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:42747): stdout 0x00007f989b8e94a3 in ?? () from /lib64/libgomp.so.1
10:25PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:42747): stdout #0  0x00007f989b8e94a3 in ?? () from /lib64/libgomp.so.1
10:25PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:42747): stdout #1  0x00000000008222e5 in ggml_graph_compute_thread.isra ()
10:25PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:42747): stdout #2  0x00007f989b8dcd16 in GOMP_parallel () from /lib64/libgomp.so.1
10:25PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:42747): stdout #3  0x0000000000825a2a in ggml_graph_compute ()
10:25PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:42747): stdout #4  0x0000000000834010 in ggml_backend_cpu_graph_compute ()
10:25PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:42747): stdout #5  0x000000000083784c in ggml_backend_graph_compute ()
10:25PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:42747): stdout #6  0x0000000000652b63 in clip_image_batch_encode.constprop ()
10:25PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:42747): stdout #7  0x0000000000653553 in clip_image_encode ()
10:25PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:42747): stdout #8  0x0000000000657ac8 in llava_image_embed_make_with_clip_img ()
10:25PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:42747): stdout #9  0x00000000004e2c09 in llama_server_context::update_slots() [clone .isra.0] ()
10:25PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:42747): stdout #10 0x00000000004d7629 in llama_server_queue::start_loop() ()
10:25PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:42747): stdout #11 0x000000000048b040 in main ()
10:25PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:42747): stdout [Inferior 1 (process 13029) detached]

@ggerganov
Copy link
Owner Author

Which commit are you using? I think you need to update to 1b28061

@mudler
Copy link
Contributor

mudler commented Sep 12, 2024

Which commit are you using? I think you need to update to 1b28061

Thanks for the quick reply! I was at daa9623, I'll try with that and let you know ASAP

@mudler
Copy link
Contributor

mudler commented Sep 12, 2024

mmh, tried with the latest commit (e6b7801) but still crashing with:

6:24PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:44339): stderr /home/mudler/_git/LocalAI/backend/cpp/llama-avx2/llama.cpp/ggml/src/ggml.c:13853: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed
...
6:24PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:44339): stdout [Thread debugging using libthread_db enabled]
6:24PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:44339): stdout Using host libthread_db library "/lib64/libthread_db.so.1".
6:24PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:44339): stdout 0x00007fd8a45ee4a3 in ?? () from /lib64/libgomp.so.1
6:24PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:44339): stdout #0  0x00007fd8a45ee4a3 in ?? () from /lib64/libgomp.so.1
6:24PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:44339): stdout #1  0x00000000007dd4b5 in ggml_graph_compute_thread.isra ()
6:24PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:44339): stdout #2  0x00007fd8a45e1d16 in GOMP_parallel () from /lib64/libgomp.so.1
6:24PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:44339): stdout #3  0x00000000007e0cca in ggml_graph_compute ()
6:24PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:44339): stdout #4  0x00000000007ef340 in ggml_backend_cpu_graph_compute ()
6:24PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:44339): stdout #5  0x00000000007f2b7c in ggml_backend_graph_compute ()
6:24PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:44339): stdout #6  0x000000000060d8b3 in clip_image_batch_encode.constprop ()
6:24PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:44339): stdout #7  0x000000000060e2a3 in clip_image_encode ()
6:24PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:44339): stdout #8  0x0000000000612818 in llava_image_embed_make_with_clip_img ()
6:24PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:44339): stdout #9  0x00000000004dd269 in llama_server_context::update_slots() [clone .isra.0] ()
6:24PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:44339): stdout #10 0x00000000004d1ce9 in llama_server_queue::start_loop() ()
6:24PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:44339): stdout #11 0x0000000000486a10 in main ()

@slaren
Copy link
Collaborator

slaren commented Sep 12, 2024

That's not related to the sampling changes, the only difference is that get_rows operations are bound-checked in all builds now, while previously it was only checked in debug builds. The clip implementation is broken and needs to be fixed: #9066 (comment)

@mudler
Copy link
Contributor

mudler commented Sep 12, 2024

That's not related to the sampling changes, the only difference is that get_rows operations are bound-checked in all builds now, while previously it was only checked in debug builds. The clip implementation is broken and needs to be fixed: #9066 (comment)

Thanks for that bit, totally missed it. What it's weird is that now for me it's a 100% hit since I started pinning new version of llama.cpp - as I have test suites running vision tests this never popped up until now. It's not sporadic at all, but really consistent and can't get a single time passing

Commit still working here: 815b1fb
Commit which is not working: e6b7801 (which includes #9082 )

@slaren
Copy link
Collaborator

slaren commented Sep 12, 2024

You would need to run the test suite with a debug build to be able to hit the assert. If you want the previous behavior you can revert #9354 in your build, but that still does not make it any less broken, it just hides the issue.

@mudler
Copy link
Contributor

mudler commented Sep 12, 2024

You would need to run the test suite with a debug build to be able to hit the assert. If you want the previous behavior you can revert #9354 in your build, but that still does not make it any less broken, it just hides the issue.

Thanks for the hints @slaren - I actually tried to comment the assert at all as well to double check - but as you suggested it "hides" it, and crashes in the same way.

another datapoint from my side - seems the suggestion in the comment, to edit

diff --git a/examples/llava/clip.cpp b/examples/llava/clip.cpp
index 342042ff..224db9b5 100644
--- a/examples/llava/clip.cpp
+++ b/examples/llava/clip.cpp
@@ -2419,7 +2419,7 @@ bool clip_image_batch_encode(clip_ctx * ctx, const int n_threads, const clip_ima
             struct ggml_tensor * patches = ggml_graph_get_tensor(gf, "patches");
             int* patches_data = (int*)malloc(ggml_nbytes(patches));
             for (int i = 0; i < num_patches; i++) {
-                patches_data[i] = i + 1;
+                patches_data[i] = i;
             }
             ggml_backend_tensor_set(patches, patches_data, 0, ggml_nbytes(patches));
             free(patches_data);

actually "fixes" the issue here - probably it's not ideal, but at least seems indeed something is off in the clip implementation when loading images. Sorry for making noise here - is there already an issue open for this issue or shall I open one? I can't find one for this specific issue

@slaren
Copy link
Collaborator

slaren commented Sep 12, 2024

I don't think there is an open issue about this currently, I know that it was briefly discussed in #9066, but that one is already closed.

@ngxson
Copy link
Collaborator

ngxson commented Sep 24, 2024

Maybe worth to notice: signature of llama_sampling_sample is slightly different from the new gpt_sampler_sample:

llama_token llama_sampling_sample(
        struct llama_sampling_context * ctx_sampling,
        struct llama_context * ctx_main,
        struct llama_context * ctx_cfg,
        int idx = -1);

llama_token gpt_sampler_sample(
    struct gpt_sampler * gsmpl,
    struct llama_context * ctx,
    int idx,
    bool grammar_first = false);

Compiler may not though any error or warning because values set to llama_sampling_sample can also be cast to new values required by gpt_sampler_sample

@ggerganov
Copy link
Owner Author

Compiler may not though any error or warning because values set to llama_sampling_sample can also be cast to new values required by gpt_sampler_sample

Hm, it will generate an error. Don't think there is an issue.

@zhaoyinglia
Copy link

zhaoyinglia commented Oct 10, 2024

@ggerganov Hi, I'm a bit confused. Why was Classifier-Free Guidance removed? Are there any issues with it?

@ggerganov
Copy link
Owner Author

The implementation does not need to be part of libllama because it works on the logits and is trivial to implement in user code. I removed it also from the examples, because it was much simpler this way and refactoring the sampling was with higher priority than keeping CFG functional. We can reintroduce this functionality in one of the examples or in a new dedicated example if there is interest. PRs welcome.

gabe-l-hart added a commit to gabe-l-hart/ollama that referenced this pull request Oct 14, 2024
The changes here reflect the changes made in the big llama.cpp sampling PR
ggerganov/llama.cpp#9294

The sampling functionality is now broken into the base interface
(llama_sampler) and the generation implementation (gpt_sampler). The
changes here reflect that. Since the sampling.h/sampling.cpp code uses c++
STL headers, the sampling_ext.[h|cpp] wrapper is maintained to allow go to
access a pure-C interface.

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>
gabe-l-hart added a commit to gabe-l-hart/ollama that referenced this pull request Oct 15, 2024
The changes here reflect the changes made in the big llama.cpp sampling PR
ggerganov/llama.cpp#9294

The sampling functionality is now broken into the base interface
(llama_sampler) and the generation implementation (gpt_sampler). The
changes here reflect that. Since the sampling.h/sampling.cpp code uses c++
STL headers, the sampling_ext.[h|cpp] wrapper is maintained to allow go to
access a pure-C interface.

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>
gabe-l-hart added a commit to gabe-l-hart/ollama that referenced this pull request Oct 17, 2024
The changes here reflect the changes made in the big llama.cpp sampling PR
ggerganov/llama.cpp#9294

The sampling functionality is now broken into the base interface
(llama_sampler) and the generation implementation (gpt_sampler). The
changes here reflect that. Since the sampling.h/sampling.cpp code uses c++
STL headers, the sampling_ext.[h|cpp] wrapper is maintained to allow go to
access a pure-C interface.

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>
jessegross pushed a commit to ollama/ollama that referenced this pull request Oct 17, 2024
* fix(ext_server): Port llama.cpp sampling refactors to ext_server

This was a fairly large changeset. I closely followed the changes here:
ggerganov/llama.cpp@df270ef

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

* fix(server.cpp): Refactor server.cpp logging for llama.cpp overhaul

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

* feat: Bump llama.cpp to the latest master with `granite` support

This does not yet have granite MoE support, but that can come in a
follow up PR

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

* fix(patches): Update all patches (except solar-pro) to work with bumped llama.cpp

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

* fix(solar): Update solar patch for llama.cpp bump

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

* feat(llama.cpp): Bump llama.cpp for granitemoe support

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

* feat(llama.cpp): Bump llama.cpp for granitemoe support

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

* fix(solar): Update the solar-pro patch for latest llama.cpp bump

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

* feat(llama.cpp): Bump to the latest master of llama.cpp

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

* fix(patches): Update all patches for latest bump

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

* feat(llama): Always run sync.sh from the right directory

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

* fix(llama/patches): Update llama patches

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

* feat(llama)!: Rough sync with llama.cpp submodule

There are a number of changes that will need to be propagated to llama.go
before any of this works!

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

* fix(llama/patches): Add a patch and update for missing ggml-impl.h include

This include is where the ggml_cgraph struct is defined. It is included in
many of the .c files to define the forward declartion in ggml.h. It seems
that with the subset of code included here, the import was somehow lost (or
out-of-order) when building, so adding this include to llama.cpp fixes the
missing definition.

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

* fix(llama/sync): Add missing ggml-cpu-impl.h copy-over in sync.sh

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

* fix(llama): Add missing log.cpp

This was added as part of the logging overhaul done in llama.cpp

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

* fix(llama): Overhaul use of sampling module for llama.cpp changes

The changes here reflect the changes made in the big llama.cpp sampling PR
ggerganov/llama.cpp#9294

The sampling functionality is now broken into the base interface
(llama_sampler) and the generation implementation (gpt_sampler). The
changes here reflect that. Since the sampling.h/sampling.cpp code uses c++
STL headers, the sampling_ext.[h|cpp] wrapper is maintained to allow go to
access a pure-C interface.

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

* fix(llama): Fix the impl of SampleTokenGreedy for new sampling

I don't think this method is currently used, so it could probably just be
removed so that all sampling goes through the GPT interface, but in the
interest of doing no harm, this should keep the method working as expected.

Branch: IBMGraniteArchitectureSupport

* fix(llama): Remove unused SampleTokenGreedy

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

* fix(sync): Remove bash-specific change to sync.sh

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

* chore(gofumpt): Format on llama.go to pass linting

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

* fix(llm): Fix missing <thread> include in ext_server

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

* fix(llama): Remove TODO about grammar_first

This feature was not used/needed previously so should be fine without
plumbing it through now.

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

* fix(llama): Better naming for sampling wrapper and args

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

* fix(llama): Fix patch 05 to use new wrapper api and re-sync

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

* runner: Flush pending responses before returning

If there are any pending reponses (such as from potential stop
tokens) then we should send them back before ending the sequence.
Otherwise, we can be missing tokens at the end of a response.

Fixes #6707

* fix(llama/sampling): Use gpt_sampler with a forward declaration

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

* fix(llama): Remove unnecessary patch for gguf impl header

This was caused by an earlier mistake in the embeddings patch that was
dereferencing the pointer instead of using the wrapper API.

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

* fix(llm): Remove use of deprecated --log-disable flag

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

---------

Signed-off-by: Gabe Goodhart <[email protected]>
eugenehp added a commit to eugenehp/bitnet-cpp-rs that referenced this pull request Oct 26, 2024
dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024
- Add `struct llama_sampler` and `struct llama_sampler_i`
- Add `llama_sampler_` API
- Add `llama_sampler_chain_` API for chaining multiple samplers
- Remove `LLAMA_API_INTERNAL`
- Add `llama_perf_` API and remove old `llama_print_timings` and `llama_reset_timings`
MaciejMogilany pushed a commit to Maciej-Mogilany/ollama that referenced this pull request Nov 12, 2024
* fix(ext_server): Port llama.cpp sampling refactors to ext_server

This was a fairly large changeset. I closely followed the changes here:
ggerganov/llama.cpp@df270ef

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

* fix(server.cpp): Refactor server.cpp logging for llama.cpp overhaul

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

* feat: Bump llama.cpp to the latest master with `granite` support

This does not yet have granite MoE support, but that can come in a
follow up PR

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

* fix(patches): Update all patches (except solar-pro) to work with bumped llama.cpp

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

* fix(solar): Update solar patch for llama.cpp bump

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

* feat(llama.cpp): Bump llama.cpp for granitemoe support

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

* feat(llama.cpp): Bump llama.cpp for granitemoe support

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

* fix(solar): Update the solar-pro patch for latest llama.cpp bump

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

* feat(llama.cpp): Bump to the latest master of llama.cpp

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

* fix(patches): Update all patches for latest bump

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

* feat(llama): Always run sync.sh from the right directory

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

* fix(llama/patches): Update llama patches

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

* feat(llama)!: Rough sync with llama.cpp submodule

There are a number of changes that will need to be propagated to llama.go
before any of this works!

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

* fix(llama/patches): Add a patch and update for missing ggml-impl.h include

This include is where the ggml_cgraph struct is defined. It is included in
many of the .c files to define the forward declartion in ggml.h. It seems
that with the subset of code included here, the import was somehow lost (or
out-of-order) when building, so adding this include to llama.cpp fixes the
missing definition.

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

* fix(llama/sync): Add missing ggml-cpu-impl.h copy-over in sync.sh

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

* fix(llama): Add missing log.cpp

This was added as part of the logging overhaul done in llama.cpp

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

* fix(llama): Overhaul use of sampling module for llama.cpp changes

The changes here reflect the changes made in the big llama.cpp sampling PR
ggerganov/llama.cpp#9294

The sampling functionality is now broken into the base interface
(llama_sampler) and the generation implementation (gpt_sampler). The
changes here reflect that. Since the sampling.h/sampling.cpp code uses c++
STL headers, the sampling_ext.[h|cpp] wrapper is maintained to allow go to
access a pure-C interface.

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

* fix(llama): Fix the impl of SampleTokenGreedy for new sampling

I don't think this method is currently used, so it could probably just be
removed so that all sampling goes through the GPT interface, but in the
interest of doing no harm, this should keep the method working as expected.

Branch: IBMGraniteArchitectureSupport

* fix(llama): Remove unused SampleTokenGreedy

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

* fix(sync): Remove bash-specific change to sync.sh

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

* chore(gofumpt): Format on llama.go to pass linting

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

* fix(llm): Fix missing <thread> include in ext_server

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

* fix(llama): Remove TODO about grammar_first

This feature was not used/needed previously so should be fine without
plumbing it through now.

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

* fix(llama): Better naming for sampling wrapper and args

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

* fix(llama): Fix patch 05 to use new wrapper api and re-sync

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

* runner: Flush pending responses before returning

If there are any pending reponses (such as from potential stop
tokens) then we should send them back before ending the sequence.
Otherwise, we can be missing tokens at the end of a response.

Fixes ollama#6707

* fix(llama/sampling): Use gpt_sampler with a forward declaration

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

* fix(llama): Remove unnecessary patch for gguf impl header

This was caused by an earlier mistake in the embeddings patch that was
dereferencing the pointer instead of using the wrapper API.

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

* fix(llm): Remove use of deprecated --log-disable flag

Branch: IBMGraniteArchitectureSupport

Signed-off-by: Gabe Goodhart <[email protected]>

---------

Signed-off-by: Gabe Goodhart <[email protected]>
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
- Add `struct llama_sampler` and `struct llama_sampler_i`
- Add `llama_sampler_` API
- Add `llama_sampler_chain_` API for chaining multiple samplers
- Remove `LLAMA_API_INTERNAL`
- Add `llama_perf_` API and remove old `llama_print_timings` and `llama_reset_timings`
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
- Add `struct llama_sampler` and `struct llama_sampler_i`
- Add `llama_sampler_` API
- Add `llama_sampler_chain_` API for chaining multiple samplers
- Remove `LLAMA_API_INTERNAL`
- Add `llama_perf_` API and remove old `llama_print_timings` and `llama_reset_timings`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
android Issues specific to Android breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. examples server testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants