-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : llama_perf + option to disable timings during decode #9355
Conversation
f7cee89
to
eda507d
Compare
eda507d
to
ade52b6
Compare
bool offload_kqv; // whether to offload the KQV ops (including the KV cache) to GPU | ||
bool flash_attn; // whether to use flash attention [EXPERIMENTAL] | ||
//bool no_perf; // whether to measure performance timings, TODO: implement | ||
bool no_perf; // whether to measure performance timings | ||
|
||
// Abort callback |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is minor libllama
API breaking change due to the addition of the no_perf
parameter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this will be a breaking change, since struct llama_context_params
is expected to be created by llama_context_default_params()
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK such changes still break external bindings, such as: https://github.com/abetlen/llama-cpp-python/blob/c032fc65b0873337ed39e5d63e15468a5d797646/llama_cpp/llama_cpp.py#L841
Co-authored-by: Xuan Son Nguyen <[email protected]>
include/llama.h
Outdated
enum llama_perf_type { | ||
LLAMA_PERF_TYPE_CONTEXT = 0, | ||
LLAMA_PERF_TYPE_SAMPLER_CHAIN = 1, | ||
}; | ||
|
||
LLAMA_API struct llama_perf_data llama_perf_get(const void * ctx, enum llama_perf_type type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be preferable to have two separate functions, just to remove the possibility of calling it with the wrong type of pointer.
src/llama.cpp
Outdated
} | ||
|
||
const auto * p = (const struct llama_sampler_chain *) chain->ctx; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These casts are very error prone and should always be checked. To do so, I would suggest moving these functions to llama-sampling.cpp
, and checking the interface pointer. The llama_sampler_chain
struct could also be moved to llama-sampling.cpp
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, since this only works with the chain sampler, it should be documented somewhere, either in the function/struct names, or with an explicit comment, otherwise the natural assumption is that it should work with any sampler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upon passing a non-chain sampler, should it return empty data or call GGML_ABORT()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think an abort would be better here until we can return status codes from functions, since it is most definitely not intended and the important part is that the programmer notices.
…ov#9355) * llama : llama_perf + option to disable timings during decode ggml-ci * common : add llama_arg * Update src/llama.cpp Co-authored-by: Xuan Son Nguyen <[email protected]> * perf : separate functions in the API ggml-ci * perf : safer pointer handling + naming update ggml-ci * minor : better local var name * perf : abort on invalid sampler pointer ggml-ci --------- Co-authored-by: Xuan Son Nguyen <[email protected]>
…ov#9355) * llama : llama_perf + option to disable timings during decode ggml-ci * common : add llama_arg * Update src/llama.cpp Co-authored-by: Xuan Son Nguyen <[email protected]> * perf : separate functions in the API ggml-ci * perf : safer pointer handling + naming update ggml-ci * minor : better local var name * perf : abort on invalid sampler pointer ggml-ci --------- Co-authored-by: Xuan Son Nguyen <[email protected]>
…ov#9355) * llama : llama_perf + option to disable timings during decode ggml-ci * common : add llama_arg * Update src/llama.cpp Co-authored-by: Xuan Son Nguyen <[email protected]> * perf : separate functions in the API ggml-ci * perf : safer pointer handling + naming update ggml-ci * minor : better local var name * perf : abort on invalid sampler pointer ggml-ci --------- Co-authored-by: Xuan Son Nguyen <[email protected]>
llama_context_params.no_perf
). Performance measurements are disabled by default forlibllama
, but for the examples inllama.cpp
they are enabled by defaultllama_perf_get
TODO:
llama_arg
after common : refactor arg parser #9308