| Documentation | Paper |
git clone https://github.com/project-etalon/etalon.git
conda create -n etalon python=3.10
conda activate etalon
cd etalon
pip install -e .
pip install -e ".[vllm]"
First create and setup your account at https://<your-org>.wandb.io/
or public Wandb and obtain API key. Then run the following command and enter API key linked to your wandb account:
wandb login --host https://<your-org>.wandb.io
To opt out of wandb, do any of the following:
- Don't pass any wandb related args like
--wandb-project
,--wandb-group
andwandb-run-name
when running python scripts. Alternatively, pass in--no-should-write-metrics
instead of--should-write-metrics
boolean flag. - Run
export WANDB_MODE=disabled
in your shell or add this to~/.zshrc
or~/.bashrc
. Remember to reload your shell usingsource ~/.zshrc
orsource ~/.bashrc
.
export OPENAI_API_KEY=secret_abcdefg
export OPENAI_API_BASE=https://api.endpoints.anyscale.com/v1
python -m etalon.run_benchmark \
--model "meta-llama/Meta-Llama-3-8B-Instruct" \
--max-num-completed-requests 150 \
--timeout 600 \
--num-ray-clients 2 \
--num-concurrent-requests-per-client 5 \
--output-dir "result_outputs" \
--request-interval-generator-provider "poisson" \
--poisson-request-interval-generator-qps 0.5 \
--request-length-generator-provider "trace" \
--trace-request-length-generator-trace-file "./data/processed_traces/arxiv_summarization_filtered_stats_llama2_tokenizer.csv" \
--request-generator-max-tokens 8192 \
--ttft-deadline 0.3 \
--tbt-deadline 0.03 \
--should-write-metrics \
--wandb-project Project \
--wandb-group Group \
--wandb-run-name Run
There are many more arguments for running benchmark, run the following to know more:
python -m etalon.run_benchmark -h
etalon can be run with any open source LLM inference system. If open source system does not provide OpenAI Compatible APIs, then kindly implement new LLM clients to support new open source system as explained in here.
Here we give an example with vLLM.
python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123 -tp 1 --rope-scaling '{"type":"dynamic","factor":2.0}'
If we need higher context length than supported by the model with certain scale factor, then we can add rope-scaling as --rope-scaling '{"type":"dynamic","factor":2.0}'
. Adjust type and factor as per the use case.
export OPENAI_API_KEY=token-abc123
export OPENAI_API_BASE=http://localhost:8000/v1
And then we can run the benchmark as shown here. Be sure to update --model
flag to same model used to launch vLLM.
The results of the benchmark are saved in the results directory specified by the --output-dir
argument.
To profile prefill times of open source systems and create a prefill time predictor for a given model and open source system, based on input prompt length, we can run etalon.prefill_profiler
.
Launch any open source system and setup API keys and URL as shown for vLLM.
python -m etalon.prefill_profiler \
--model "meta-llama/Meta-Llama-3-8B-Instruct" \
--timeout 600 \
--fixed-request-generator-decode-tokens 16 \
--output-dir "prefill_experiments/prefill_profiler_vllm_llama-3-8b" \
--should-use-given-dir true
To modify range of prompt tokens for which prefill times get profiled, use the flag --prefill-lengths
as follows:
python -m etalon.prefill_profiler \
--model "meta-llama/Meta-Llama-3-8B-Instruct" \
--timeout 600 \
--output-dir "prefill_experiments/prefill_profiler_vllm_llama-3-8b" \
--should-use-given-dir true \
--prefill-lengths 256 512 1024 2048 4096 8192 16384 32768 65536
Important
: Run prefill profiler for a given model and open source system before running capacity search of deadline-based
SLO type.
Refer to readme file of etalon/capacity_search
folder to know more about how to run capacity search.
To implement a new LLM client, you need to implement the base class etalon.llm_client.BaseLLMClient
and decorate it as a ray actor.
from etalon.llm_client import BaseLLMClient
import ray
@ray.remote
class CustomLLMClient(BaseLLMClient):
def send_llm_request(self, request_config: RequestConfig) -> Tuple[Metrics, str, RequestConfig]:
"""Make a single completion request to a LLM API
Returns:
Metrics about the performance charateristics of the request.
The text generated by the request to the LLM API.
The request_config used to make the request. This is mainly for logging purposes.
"""
...
If you use our work, please consider citing our paper:
@misc{agrawal2024etalonholisticperformanceevaluation,
title={etalon: Holistic Performance Evaluation Framework for LLM Inference Systems},
author={Amey Agrawal and Anmol Agarwal and Nitin Kedia and Jayashree Mohan and Souvik Kundu and Nipun Kwatra and Ramachandran Ramjee and Alexey Tumanov},
year={2024},
eprint={2407.07000},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2407.07000},
}
This repository was originally created as fork from LLMPerf project.