Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Backend] Add Llamacpp backend #2975

Open
wants to merge 63 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 51 commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
95e221e
Add llamacpp backend
angt Jan 24, 2025
bd0cc99
Get rid of llama_batch_get_one()
angt Jan 30, 2025
3eb4823
Use max_batch_total_tokens
angt Jan 30, 2025
e7facf6
Handle max_batch_size
angt Jan 30, 2025
a7b4b04
Add some input validation checks
angt Jan 30, 2025
8d2dfdf
Handle ctx args & fix sampling
angt Jan 30, 2025
f388747
Add GPU args
angt Jan 31, 2025
e07835c
Add --defrag-threshold
angt Jan 31, 2025
d6ded89
Add a stupid batch mechanism
angt Jan 31, 2025
390f0ec
Cleanup
angt Jan 31, 2025
7a3ed41
Add --numa
angt Jan 31, 2025
3f19913
Fix args
angt Jan 31, 2025
ae5bb78
Enable flash attention by default
angt Jan 31, 2025
e88a527
Add --offload-kqv
angt Jan 31, 2025
f38c34a
Fix batch_pos
angt Jan 31, 2025
960c12b
backend(llama): add CUDA Dockerfile_llamacpp for now
mfuntowicz Jan 31, 2025
161280f
Only export the latest logits
angt Feb 1, 2025
2a51e41
Output real logprobs
angt Feb 1, 2025
96434a1
Fix batching
angt Feb 1, 2025
27534d8
Fix seq iterations
angt Feb 1, 2025
c8505fb
Auto-detect n_threads when not provided
angt Feb 1, 2025
8ed362d
Clear request cache after completion
angt Feb 1, 2025
104a968
Remove warmup
angt Feb 1, 2025
ea28332
Cleanup
angt Feb 1, 2025
e6a8d33
backend(llama): add CUDA architectures build argument for Dockerfile
mfuntowicz Feb 3, 2025
bfb8e03
Add specific args for batch
angt Feb 3, 2025
38b33e9
Add --type-v & --type-k
angt Feb 3, 2025
207041a
Bump llamacpp to b4623
angt Feb 3, 2025
d883109
Disable graceful shutdown in debug mode
angt Feb 3, 2025
df2a4fb
Update Dockerfile_llamacpp
angt Feb 4, 2025
906c265
Cleanup Dockerfile
angt Feb 4, 2025
e007529
Update Cargo.lock
angt Feb 4, 2025
d3a772a
Update args
angt Feb 5, 2025
dbee804
Simplify batching logic
angt Feb 5, 2025
c52f083
Set TGI_LLAMA_PKG_CUDA from CUDA_VERSION
angt Feb 5, 2025
051ff2d
Rename bindings
angt Feb 5, 2025
09a745f
Remove n_ctx
angt Feb 5, 2025
5b77787
Make max_batch_total_tokens optional
angt Feb 5, 2025
695b129
Ensure all samplers are freed on error
angt Feb 5, 2025
0f62401
Initialize penalty_last_n with llamacpp default value
angt Feb 5, 2025
f22e2fb
Cleanup
angt Feb 5, 2025
b3e40c4
Improve default settings
angt Feb 5, 2025
1641c22
Add doc
angt Feb 5, 2025
e4d5fa7
Update docs
angt Feb 6, 2025
fb81c0d
Thanks clippy
angt Feb 6, 2025
2b0d99c
Thanks cargo fmt
angt Feb 6, 2025
8bc10d3
Update docs
angt Feb 6, 2025
7bff88b
Do not use HOSTNAME env
angt Feb 6, 2025
df723e6
Bump llama.cpp & cuda
angt Feb 6, 2025
5367d94
Fix requirements.txt
angt Feb 6, 2025
809e288
Fix fmt
angt Feb 6, 2025
3b1b049
Enable KQV offload by default
angt Feb 6, 2025
acca9c3
Remove Ngrok tunneling
angt Feb 6, 2025
0d27ee7
Remove .cargo/config.toml
angt Feb 7, 2025
4841f71
Fix Dockerfile
angt Feb 7, 2025
b6cfa0f
Add missing cuda prefix
angt Feb 7, 2025
6bdb644
Handle custom llama.cpp dir
angt Feb 7, 2025
0702e0b
Cleanup
angt Feb 7, 2025
508d47f
Add README.md
angt Feb 7, 2025
1401418
Add HF transfer
angt Feb 7, 2025
b77d05d
Fix bool args
angt Feb 7, 2025
d96a777
Update doc
angt Feb 7, 2025
5fb4afb
Update doc
angt Feb 7, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
925 changes: 504 additions & 421 deletions Cargo.lock

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ members = [
"backends/v3",
"backends/grpc-metadata",
"backends/trtllm",
"backends/llamacpp",
"launcher",
"router"
]
Expand Down
75 changes: 75 additions & 0 deletions Dockerfile_llamacpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
FROM nvidia/cuda:12.8.0-cudnn-devel-ubuntu24.04 AS deps

ARG llamacpp_version=b4651
ARG llamacpp_cuda=OFF
ARG cuda_arch=75-real;80-real;86-real;89-real;90-real
ENV TGI_LLAMA_PKG_CUDA=cuda-${CUDA_VERSION%.*}

WORKDIR /opt/src

ENV DEBIAN_FRONTEND=noninteractive
RUN apt update && apt install -y \
clang \
cmake \
curl \
git \
python3-dev \
libssl-dev \
pkg-config \
tar

ADD https://github.com/ggerganov/llama.cpp/archive/refs/tags/${llamacpp_version}.tar.gz /opt/src/
RUN tar -xzf ${llamacpp_version}.tar.gz \
&& cd llama.cpp-${llamacpp_version} \
&& cmake -B build \
-DCMAKE_INSTALL_PREFIX=/usr \
-DCMAKE_INSTALL_LIBDIR=/usr/lib \
-DCMAKE_C_COMPILER=clang \
-DCMAKE_CXX_COMPILER=clang++ \
-DCMAKE_CUDA_ARCHITECTURES=${cuda_arch} \
-DGGML_CUDA=${llamacpp_cuda} \
-DLLAMA_BUILD_COMMON=OFF \
-DLLAMA_BUILD_TESTS=OFF \
-DLLAMA_BUILD_EXAMPLES=OFF \
-DLLAMA_BUILD_SERVER=OFF \
&& cmake --build build --parallel --config Release \
&& cmake --install build

WORKDIR /app
COPY rust-toolchain.toml rust-toolchain.toml
RUN curl -sSf https://sh.rustup.rs | sh -s -- -y --no-modify-path --default-toolchain none
ENV PATH="/root/.cargo/bin:$PATH"
RUN cargo install cargo-chef --locked

FROM deps AS planner
COPY . .
RUN cargo chef prepare --recipe-path recipe.json

FROM deps AS builder
COPY --from=planner /app/recipe.json recipe.json
RUN cargo chef cook \
--recipe-path recipe.json \
--profile release-opt \
--package text-generation-router-llamacpp
COPY . .
RUN cargo build \
--profile release-opt \
--package text-generation-router-llamacpp --frozen

FROM nvidia/cuda:12.8.0-cudnn-runtime-ubuntu24.04

RUN apt update && apt install -y \
python3-venv \
python3-pip

RUN python3 -m venv /venv
ENV PATH="/venv/bin:$PATH"

COPY backends/llamacpp/requirements.txt requirements.txt
RUN pip3 install --no-cache-dir -r requirements.txt
angt marked this conversation as resolved.
Show resolved Hide resolved

COPY --from=builder /usr/lib/libllama.so /usr/lib/
COPY --from=builder /usr/lib/libggml*.so /usr/lib/
COPY --from=builder /app/target/release-opt/text-generation-router-llamacpp /usr/bin/

ENTRYPOINT ["text-generation-router-llamacpp"]
2 changes: 2 additions & 0 deletions backends/llamacpp/.cargo/config.toml
angt marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
[build]
rustflags = ["-C", "target-cpu=native"]
21 changes: 21 additions & 0 deletions backends/llamacpp/Cargo.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
[package]
name = "text-generation-router-llamacpp"
version.workspace = true
edition.workspace = true
authors.workspace = true
homepage.workspace = true

[build-dependencies]
bindgen = "0.71.1"
pkg-config = "0.3.31"

[dependencies]
async-trait = "0.1.85"
clap = "4.5.27"
num_cpus = "1.16.0"
text-generation-router = { path = "../../router" }
thiserror = "2.0.11"
tokenizers.workspace = true
tokio = "1.43.0"
tokio-stream = "0.1.17"
tracing = "0.1.41"
57 changes: 57 additions & 0 deletions backends/llamacpp/build.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
use bindgen::callbacks::{ItemInfo, ParseCallbacks};
use std::collections::HashMap;
use std::env;
use std::path::PathBuf;

fn inject_transient_dependencies(lib_search_path: Option<&str>, lib_target_hardware: &str) {
let hardware_targets = HashMap::from([("cpu", None), ("cuda", Some(vec!["cuda"]))]);

if let Some(lib_search_path) = lib_search_path {
lib_search_path.split(":").for_each(|path| {
println!("cargo:rustc-link-search=dependency={path}");
});
}

if let Some(hardware_transient_deps) = hardware_targets.get(lib_target_hardware) {
if let Some(additional_transient_deps) = hardware_transient_deps {
additional_transient_deps.iter().for_each(|dep| {
println!("cargo:rustc-link-lib={dep}");
});
}
}
}

#[derive(Debug)]
struct PrefixStripper;

impl ParseCallbacks for PrefixStripper {
fn generated_name_override(&self, item_info: ItemInfo<'_>) -> Option<String> {
item_info.name.strip_prefix("llama_").map(str::to_string)
}
}

fn main() {
let pkg_cuda = option_env!("TGI_LLAMA_PKG_CUDA");
let lib_search_path = option_env!("TGI_LLAMA_LD_LIBRARY_PATH");
let lib_target_hardware = option_env!("TGI_LLAMA_HARDWARE_TARGET").unwrap_or("cpu");

let bindings = bindgen::Builder::default()
.header("src/wrapper.h")
.prepend_enum_name(false)
.parse_callbacks(Box::new(PrefixStripper))
.parse_callbacks(Box::new(bindgen::CargoCallbacks::new()))
.generate()
.expect("Unable to generate bindings");

let out_path = PathBuf::from(env::var("OUT_DIR").unwrap());
bindings
.write_to_file(out_path.join("llamacpp.rs"))
.expect("Couldn't write bindings!");

if let Some(pkg_cuda) = pkg_cuda {
pkg_config::Config::new().probe(pkg_cuda).unwrap();
}
pkg_config::Config::new().probe("llama").unwrap();

inject_transient_dependencies(lib_search_path, lib_target_hardware);
}
2 changes: 2 additions & 0 deletions backends/llamacpp/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
transformers==4.48.2
huggingface-hub==0.28.1
Loading
Loading