Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add llama.cpp backend #231

Merged
merged 15 commits into from
Jul 30, 2024
Merged
48 changes: 48 additions & 0 deletions .github/workflows/test_cli_llama_cpp.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
name: CLI Llama.cpp Tests

on:
workflow_dispatch:
push:
branches:
- main
paths:
- .github/workflows/test_cli_llama_cpp.yaml
- "optimum_benchmark/**"
- "docker/**"
- "tests/**"
- "setup.py"
pull_request:
branches:
- main
paths:
- .github/workflows/test_cli_llama_cpp.yaml
- "optimum_benchmark/**"
- "docker/**"
- "tests/**"
- "setup.py"

concurrency:
cancel-in-progress: true
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}

jobs:
run_cli_llama_cpp_tests:
runs-on: ubuntu-latest

steps:
- name: Checkout
uses: actions/checkout@v3

- name: Set up Python 3.10
uses: actions/setup-python@v3
with:
python-version: "3.10"

- name: Install requirements
run: |
pip install --upgrade pip
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
pip install -e .[testing,lamma-cpp]

- name: Run tests
run: pytest -s -k "llama_cpp"
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -172,3 +172,7 @@ work-in-progress/
experiments/
amdsmi/
amd-*

# Mac specific
.DS_Store
outputs/
8 changes: 4 additions & 4 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,16 +48,16 @@ If you would like to work on any of the open Issues:
6. Depending on the feature you're working on and your development environment, you can run tests locally in an isolated docker container using the [makefile](Makefile). For example, to test the CLI with CPU device and PyTorch backend, you can run the following commands:

```bash
make install_cli_cpu_pytorch_extras
make install_cli_cpu_pytorch
make test_cli_cpu_pytorch
```

For a better development experience, we recommend using isolated docker containers to run tests:

```bash
make build_docker_cpu
make run_docker_cpu
make install_cli_cpu_pytorch_extras
make build_cpu_image
make run_cpu_container
make install_cli_cpu_pytorch
make test_cli_cpu_pytorch
```

Expand Down
2 changes: 2 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -173,6 +173,8 @@ test_cli_rocm_pytorch_single_gpu:
pytest -s -k "cli and rocm and pytorch and not (dp or ddp or device_map or deepspeed) and not (bnb or awq)"

# llm-perf
test_cli_llama_cpp:
pytest -s -k "llama_cpp"

install_llm_perf_cuda_pytorch:
pip install packaging && pip install flash-attn einops scipy auto-gptq optimum bitsandbytes autoawq codecarbon
Expand Down
25 changes: 25 additions & 0 deletions examples/llama_cpp.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
defaults:
- benchmark
- scenario: inference
- launcher: inline
- backend: llama_cpp
- _base_
- _self_

name: llama_cpp_llama_v2

backend:
device: mps
model: TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF
task: text-generation
filename: tinyllama-1.1b-chat-v1.0.Q4_0.gguf


scenario:
input_shapes:
batch_size: 1
sequence_length: 256
vocab_size: 32000
generate_kwargs:
max_new_tokens: 100
min_new_tokens: 100
26 changes: 26 additions & 0 deletions examples/llama_cpp_embedding.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
defaults:
- benchmark
- scenario: inference
- launcher: inline
- backend: llama_cpp
- _base_
- _self_

name: llama_cpp_llama

backend:
device: mps
model: nomic-ai/nomic-embed-text-v1.5-GGUF
task: feature-extraction
filename: nomic-embed-text-v1.5.Q4_0.gguf

scenario:
input_shapes:
batch_size: 1
sequence_length: 256
vocab_size: 30000
type_vocab_size: 1
max_position_embeddings: 512
generate_kwargs:
max_new_tokens: 100
min_new_tokens: 100
25 changes: 25 additions & 0 deletions examples/llama_cpp_mps.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
defaults:
- benchmark
- scenario: inference
- launcher: inline
- backend: llama_cpp
- _base_
- _self_

name: llama_cpp_llama

backend:
device: mps
model: TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF
task: text-generation
filename: tinyllama-1.1b-chat-v1.0.Q4_0.gguf


scenario:
input_shapes:
batch_size: 1
sequence_length: 256
vocab_size: 32000
generate_kwargs:
max_new_tokens: 100
min_new_tokens: 100
23 changes: 23 additions & 0 deletions examples/llama_mps.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
defaults:
- benchmark
- scenario: inference
- launcher: inline
- backend: pytorch
- _base_
- _self_

name: llama_tiny_mps

backend:
device: mps
model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
task: text-generation

scenario:
input_shapes:
batch_size: 4
sequence_length: 256
vocab_size: 32000
generate_kwargs:
max_new_tokens: 100
min_new_tokens: 100
23 changes: 23 additions & 0 deletions examples/llama_tiny_mps.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
defaults:
- benchmark
- scenario: inference
- launcher: inline
- backend: pytorch
- _base_
- _self_

name: llama_tiny_mps

backend:
device: mps
model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
task: text-generation

scenario:
input_shapes:
batch_size: 4
sequence_length: 256
vocab_size: 32000
generate_kwargs:
max_new_tokens: 100
min_new_tokens: 100
26 changes: 26 additions & 0 deletions examples/pytorch_bert_mps.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
defaults:
- benchmark
- scenario: inference
- launcher: process # launcher: inline works,
- backend: pytorch
- _base_
- _self_

name: pytorch_bert

# launcher:
# start_method: spawn

scenario:
latency: true
memory: true
input_shapes:
batch_size: 1
sequence_length: 128

backend:
device: cpu
no_weights: true
model: bert-base-uncased


2 changes: 2 additions & 0 deletions optimum_benchmark/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
from .backends import (
BackendConfig,
INCConfig,
LlamaCppConfig,
LLMSwarmConfig,
ORTConfig,
OVConfig,
Expand Down Expand Up @@ -38,4 +39,5 @@
"TrainingConfig",
"TRTLLMConfig",
"VLLMConfig",
"LlamaCppConfig",
]
2 changes: 2 additions & 0 deletions optimum_benchmark/backends/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
from .config import BackendConfig
from .llama_cpp.config import LlamaCppConfig
from .llm_swarm.config import LLMSwarmConfig
from .neural_compressor.config import INCConfig
from .onnxruntime.config import ORTConfig
Expand All @@ -20,4 +21,5 @@
"LLMSwarmConfig",
"BackendConfig",
"VLLMConfig",
"LlamaCppConfig",
]
7 changes: 6 additions & 1 deletion optimum_benchmark/backends/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,12 @@ def __init__(self, config: BackendConfigT):
self.automodel_loader = get_timm_automodel_loader()
self.pretrained_processor = None
self.generation_config = None

elif self.config.library == "llama_cpp":
self.logger.info("\t+ Benchmarking a Llama.cpp model")
self.pretrained_config = get_transformers_generation_config(self.config.model, **self.config.model_kwargs)
self.model_shapes = {}
self.pretrained_processor = None
self.generation_config = None
else:
self.logger.info("\t+ Benchmarking a Transformers model")
self.generation_config = get_transformers_generation_config(self.config.model, **self.config.model_kwargs)
Expand Down
12 changes: 8 additions & 4 deletions optimum_benchmark/backends/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,14 +53,16 @@ def __post_init__(self):

# TODO: add cache_dir, token, etc. to these methods
if self.task is None:
self.task = infer_task_from_model_name_or_path(self.model, self.model_kwargs.get("revision", None))
self.task = infer_task_from_model_name_or_path(
self.model, self.model_kwargs.get("revision", None), self.library
)

if self.library is None:
self.library = infer_library_from_model_name_or_path(self.model, self.model_kwargs.get("revision", None))

if self.model_type is None:
self.model_type = infer_model_type_from_model_name_or_path(
self.model, self.model_kwargs.get("revision", None)
self.model, self.model_kwargs.get("revision", None), self.library
)

if self.device is None:
Expand Down Expand Up @@ -90,8 +92,10 @@ def __post_init__(self):
else:
raise RuntimeError("CUDA device is only supported on systems with NVIDIA or ROCm drivers.")

if self.library not in ["transformers", "diffusers", "timm"]:
raise ValueError(f"`library` must be either `transformers`, `diffusers` or `timm`, but got {self.library}")
if self.library not in ["transformers", "diffusers", "timm", "llama_cpp"]:
raise ValueError(
f"`library` must be either `transformers`, `diffusers`, `timm` or `llama_cpp`, but got {self.library}"
)

if self.inter_op_num_threads is not None:
if self.inter_op_num_threads == -1:
Expand Down
Empty file.
Loading
Loading