llama : llama_perf + option to disable timings during decode #9355

ggerganov · 2024-09-07T17:53:02Z

Add option to disable time system calls during decode (llama_context_params.no_perf). Performance measurements are disabled by default for libllama, but for the examples in llama.cpp they are enabled by default
Restore getting internal timing information llama_perf_get

TODO:

add llama_arg after common : refactor arg parser #9308

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

ggml-ci

common/common.cpp

ggml-ci

ggerganov · 2024-09-10T09:27:46Z

include/llama.h

        bool offload_kqv; // whether to offload the KQV ops (including the KV cache) to GPU
        bool flash_attn;  // whether to use flash attention [EXPERIMENTAL]
-      //bool no_perf;     // whether to measure performance timings, TODO: implement
+        bool no_perf;     // whether to measure performance timings

        // Abort callback


This is minor libllama API breaking change due to the addition of the no_perf parameter

I don't think this will be a breaking change, since struct llama_context_params is expected to be created by llama_context_default_params(), right?

AFAIK such changes still break external bindings, such as: https://github.com/abetlen/llama-cpp-python/blob/c032fc65b0873337ed39e5d63e15468a5d797646/llama_cpp/llama_cpp.py#L841

src/llama.cpp

Co-authored-by: Xuan Son Nguyen <[email protected]>

slaren · 2024-09-10T15:23:06Z

include/llama.h

    enum llama_perf_type {
        LLAMA_PERF_TYPE_CONTEXT       = 0,
        LLAMA_PERF_TYPE_SAMPLER_CHAIN = 1,
    };

+    LLAMA_API struct llama_perf_data llama_perf_get(const void * ctx, enum llama_perf_type type);


I think it would be preferable to have two separate functions, just to remove the possibility of calling it with the wrong type of pointer.

ggml-ci

slaren · 2024-09-11T18:35:07Z

src/llama.cpp

    }
+
+    const auto * p = (const struct llama_sampler_chain *) chain->ctx;


These casts are very error prone and should always be checked. To do so, I would suggest moving these functions to llama-sampling.cpp, and checking the interface pointer. The llama_sampler_chain struct could also be moved to llama-sampling.cpp.

Additionally, since this only works with the chain sampler, it should be documented somewhere, either in the function/struct names, or with an explicit comment, otherwise the natural assumption is that it should work with any sampler.

Upon passing a non-chain sampler, should it return empty data or call GGML_ABORT()?

I think an abort would be better here until we can return status codes from functions, since it is most definitely not intended and the important part is that the programmer notices.

ggml-ci

…ov#9355) * llama : llama_perf + option to disable timings during decode ggml-ci * common : add llama_arg * Update src/llama.cpp Co-authored-by: Xuan Son Nguyen <[email protected]> * perf : separate functions in the API ggml-ci * perf : safer pointer handling + naming update ggml-ci * minor : better local var name * perf : abort on invalid sampler pointer ggml-ci --------- Co-authored-by: Xuan Son Nguyen <[email protected]>

ggerganov force-pushed the gg/llama-perf branch from f7cee89 to eda507d Compare September 7, 2024 17:58

ggerganov added 2 commits September 8, 2024 08:54

llama : llama_perf + option to disable timings during decode

471e7e1

ggml-ci

common : add llama_arg

ade52b6

ggerganov force-pushed the gg/llama-perf branch from eda507d to ade52b6 Compare September 8, 2024 05:58

ggerganov marked this pull request as ready for review September 8, 2024 05:58

ggerganov mentioned this pull request Sep 8, 2024

llama : refactor sampling v2 #9294

Merged

4 tasks

ngxson reviewed Sep 8, 2024

View reviewed changes

common/common.cpp Outdated Show resolved Hide resolved

Merge branch 'master' into gg/llama-perf

6cce78c

ggml-ci

ggerganov added the breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. label Sep 10, 2024

ggerganov commented Sep 10, 2024

View reviewed changes

ggerganov requested review from slaren and ngxson September 10, 2024 09:27

ngxson approved these changes Sep 10, 2024

View reviewed changes

ngxson reviewed Sep 10, 2024

View reviewed changes

src/llama.cpp Outdated Show resolved Hide resolved

src/llama.cpp Show resolved Hide resolved

Update src/llama.cpp

fd46535

Co-authored-by: Xuan Son Nguyen <[email protected]>

slaren reviewed Sep 10, 2024

View reviewed changes

perf : separate functions in the API

f42de24

ggml-ci

github-actions bot added the examples label Sep 11, 2024

slaren reviewed Sep 11, 2024

View reviewed changes

ggerganov added 4 commits September 12, 2024 09:19

perf : safer pointer handling + naming update

7362f28

ggml-ci

Merge branch 'master' into gg/llama-perf

44f0218

ggml-ci

minor : better local var name

f35e9b8

perf : abort on invalid sampler pointer

444b757

ggml-ci

slaren approved these changes Sep 13, 2024

View reviewed changes

ggerganov merged commit 0abc6a2 into master Sep 13, 2024
58 checks passed

ggerganov deleted the gg/llama-perf branch September 13, 2024 06:53

ggerganov mentioned this pull request Sep 13, 2024

changelog : libllama API #9289

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : llama_perf + option to disable timings during decode #9355

llama : llama_perf + option to disable timings during decode #9355

ggerganov commented Sep 7, 2024 •

edited

Loading

ggerganov Sep 10, 2024

ngxson Sep 10, 2024

ggerganov Sep 10, 2024

slaren Sep 10, 2024

slaren Sep 11, 2024

slaren Sep 12, 2024

ggerganov Sep 12, 2024

slaren Sep 12, 2024

		}

		const auto * p = (const struct llama_sampler_chain *) chain->ctx;

llama : llama_perf + option to disable timings during decode #9355

llama : llama_perf + option to disable timings during decode #9355

Conversation

ggerganov commented Sep 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ggerganov commented Sep 7, 2024 •

edited

Loading