Add ring buffer to store prev tokens in sampling #8890

kylo5aby · 2024-08-06T17:05:15Z

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

compilade · 2024-08-07T18:05:59Z

common/sampling.h

@@ -64,6 +65,105 @@ typedef struct llama_sampling_params {
    bool                     use_penalty_prompt_tokens = false;
 } llama_sampling_params;

+template<typename T>
+struct ring_buffer {


Small question: how does this differ from std::queue or std::deque?

Here I want to use a fixed capacity buffer to avoid resize or copy overhead, because the size of pre tokens for sampling is already known.

Right, std::deque can't reserve() like std::vector. This seems like a valid reason.

Might be worth it to write a (small) comment near ring_buffer to explain this.

Resolved. Thanks the feedback!

ggerganov

I changed the base branch to gg/llama-refactor-sampling since it's better to merge this change together with the sampling refactoring

ggerganov · 2024-08-12T08:19:45Z

examples/infill/infill.cpp

@@ -425,7 +425,7 @@ int main(int argc, char ** argv) {

            llama_sampling_accept(ctx_sampling, ctx, id, true);

-            LOG("last: %s\n", LOG_TOKENS_TOSTR_PRETTY(ctx, ctx_sampling->prev).c_str());
+            LOG("last: %s\n", LOG_TOKENS_TOSTR_PRETTY(ctx, ctx_sampling->prev.to_vector()).c_str());


Let's remove these logs completely for now - will bring them back after the logger is reimplemented:

Suggested change

LOG("last: %s\n", LOG_TOKENS_TOSTR_PRETTY(ctx, ctx_sampling->prev.to_vector()).c_str());

//LOG("last: %s\n", LOG_TOKENS_TOSTR_PRETTY(ctx, ctx_sampling->prev.to_vector()).c_str());

ggerganov · 2024-08-12T08:19:53Z

examples/main/main.cpp

@@ -736,7 +736,7 @@ int main(int argc, char ** argv) {

            llama_sampling_accept(ctx_sampling, ctx, id, /* apply_grammar= */ true);

-            LOG("last: %s\n", LOG_TOKENS_TOSTR_PRETTY(ctx, ctx_sampling->prev).c_str());
+            LOG("last: %s\n", LOG_TOKENS_TOSTR_PRETTY(ctx, ctx_sampling->prev.to_vector()).c_str());


Suggested change

LOG("last: %s\n", LOG_TOKENS_TOSTR_PRETTY(ctx, ctx_sampling->prev.to_vector()).c_str());

//LOG("last: %s\n", LOG_TOKENS_TOSTR_PRETTY(ctx, ctx_sampling->prev.to_vector()).c_str());

ggerganov · 2024-08-12T08:22:20Z

common/sampling.cpp

@@ -400,7 +400,7 @@ static llama_token_data_array llama_sampling_prepare_impl(
    llama_token_data_array cur_p = { cur.data(), cur.size(), false };

    // apply penalties
-    const auto& penalty_tokens = params.use_penalty_prompt_tokens ? params.penalty_prompt_tokens : prev;
+    const auto& penalty_tokens = params.use_penalty_prompt_tokens ? params.penalty_prompt_tokens : prev.to_vector();


Should think of a way to avoid the to_vector() due to performance considerations

Should think of a way to avoid the to_vector() due to performance considerations

I think one way to avoid vector copy is here we can pass penalty_prompt_tokens vector and start index to llama_sample_repetition_penalties, and then traverse penalty_last_n elements from the vector in it, which will avoid the copy. For example

void llama_sample_repetition_penalties( struct llama_context * ctx, llama_token_data_array * candidates, // const llama_token * last_tokens, const vector<llama_token>& penalty_tokens, size_t start_index, // .size() - penalty_tokens_used_size or (prev.first +.size() - penalty_tokens_used_size) % prev.capacity if ring buffer size_t penalty_last_n, float penalty_repeat, float penalty_freq, float penalty_present);

what do you think?

After the sampling refactoring, the common/sampling.h/.cpp stuff will be moved to llama-sampling.cpp and the API call will become simply:

void llama_sampling_repetition_penalties( struct llama_sampling * ctx, llama_token_data_array * candidates);

All the penalty related information (together with ring buffer with the previous tokens) will be inside the llama_sampling object and we can handle it there. So for now, we can just resolve the conflict and merge and later I'll avoid the to_vector()

github-actions bot added the examples label Aug 6, 2024

compilade reviewed Aug 7, 2024

View reviewed changes

mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Aug 8, 2024

kylo5aby force-pushed the sample-ring-buf branch from 9a34948 to 1238001 Compare August 12, 2024 01:54

ggerganov approved these changes Aug 12, 2024

View reviewed changes

ggerganov changed the base branch from master to gg/llama-refactor-sampling August 12, 2024 08:18

ggerganov reviewed Aug 12, 2024

View reviewed changes

ggerganov force-pushed the gg/llama-refactor-sampling branch from 8603eb2 to c5734f1 Compare August 12, 2024 12:43

kylo5aby force-pushed the sample-ring-buf branch from 1238001 to 8830fa1 Compare August 13, 2024 04:30

Use ring buffer to store prev in sampling

3b23ea7

kylo5aby force-pushed the sample-ring-buf branch from 8830fa1 to 3b23ea7 Compare August 13, 2024 07:24

ggerganov merged commit 5763d8e into ggerganov:gg/llama-refactor-sampling Aug 13, 2024
51 checks passed

ggerganov pushed a commit that referenced this pull request Aug 15, 2024

sampling : use ring buffer to store prev tokens (#8890)

9def2a6

ggerganov pushed a commit that referenced this pull request Aug 16, 2024

sampling : use ring buffer to store prev tokens (#8890)

6ae675f

ggerganov pushed a commit that referenced this pull request Aug 17, 2024

sampling : use ring buffer to store prev tokens (#8890)

b5ae8ae

ggerganov pushed a commit that referenced this pull request Aug 20, 2024

sampling : use ring buffer to store prev tokens (#8890)

4d2873a

ggerganov pushed a commit that referenced this pull request Aug 21, 2024

sampling : use ring buffer to store prev tokens (#8890)

7fb0a5e

ggerganov pushed a commit that referenced this pull request Aug 27, 2024

sampling : use ring buffer to store prev tokens (#8890)

4b16869

ggerganov pushed a commit that referenced this pull request Aug 28, 2024

sampling : use ring buffer to store prev tokens (#8890)

3483748

ggerganov pushed a commit that referenced this pull request Aug 30, 2024

sampling : use ring buffer to store prev tokens (#8890)

13eda35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ring buffer to store prev tokens in sampling #8890

Add ring buffer to store prev tokens in sampling #8890

kylo5aby commented Aug 6, 2024

compilade Aug 7, 2024 •

edited

Loading

kylo5aby Aug 8, 2024

compilade Aug 8, 2024

kylo5aby Aug 12, 2024

ggerganov left a comment

ggerganov Aug 12, 2024

ggerganov Aug 12, 2024

ggerganov Aug 12, 2024

kylo5aby Aug 13, 2024

ggerganov Aug 13, 2024

	LOG("last: %s\n", LOG_TOKENS_TOSTR_PRETTY(ctx, ctx_sampling->prev.to_vector()).c_str());
	//LOG("last: %s\n", LOG_TOKENS_TOSTR_PRETTY(ctx, ctx_sampling->prev.to_vector()).c_str());

Add ring buffer to store prev tokens in sampling #8890

Add ring buffer to store prev tokens in sampling #8890

Conversation

kylo5aby commented Aug 6, 2024

compilade Aug 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ggerganov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

compilade Aug 7, 2024 •

edited

Loading