-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Backend] Add Llamacpp backend #2975
base: main
Are you sure you want to change the base?
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
backends/llamacpp/.cargo/config.toml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove this, especially if we end up deploying on Grace Hopper but building containers on qemu? Would it impact anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially, I focused on the ARM CPU, where a “native” build is mandatory for now. Removing this requirement will still cause the Docker container to fail on a different host. Some work is necessary here, and I believe in “llama.cpp” as well, to achieve full performance without this constraint.
#[clap(long, env)] | ||
model_gguf: String, // TODO Option() with hf->gguf & quantize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit reluctant to have two parameters (model_id
, model_gguf
) here, as it would introduces defacto a discrepency between what users currently deploy (stack TGI) and this new backend.
would it be hard to rely solely on model-id
to keep to advantage of backward compat and transparent loading accross backend? Or is it too much efforts for a v1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally agree, but it was planned for a v2 if there is momentum on this backend :)
backends/llamacpp/src/main.rs
Outdated
use_mlock: bool, | ||
|
||
/// Enable offloading of KQV operations to the GPU. | ||
#[clap(default_value = "false", long, env)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we put this as true
for the time being as we mostly deploy this on GPU in this specific release?
Also, what happens if we set it to true
but no GPU is present? Is it ignored? If its the case, I would definitely argue for true
by default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It works on the CPU when enabled, but I’m worried that it might not be as safe on some model x GPU combinations.
Let's try to enable it by default we will see 👍
type_v: LlamacppGGMLType, | ||
|
||
/// Number of tokenizer workers used for payload validation and truncation. | ||
#[clap(default_value = "2", long, env)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2
seems arbitrary here? Should we default to 1
and let user override for more if needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took the default used in the launcher 😅
backends/llamacpp/src/main.rs
Outdated
#[clap(long, env)] | ||
ngrok: bool, | ||
|
||
/// Ngrok authentication token. | ||
#[clap(long, env)] | ||
ngrok_authtoken: Option<String>, | ||
|
||
/// Ngrok edge to use for tunneling. | ||
#[clap(long, env)] | ||
ngrok_edge: Option<String>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can remove these
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🫡
} | ||
|
||
impl LlamacppSampler { | ||
fn new(req: &LlamacppRequest) -> Option<Self> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as almost all the lines are being marked unsafe, should just the function be unsafe? Or a introduce an unsafe scope inside the safe function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it’s totally ugly... But I have some ideas to hide them. 👍
Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
This PR adds support for the llamacpp backend.
The
Dockerfile_llamacpp
enables native CPU execution as well as CUDA acceleration for GPUs.For setup and usage details, you can check the doc.