diff --git a/docs/index.md b/docs/index.md index ac877a4a..8bf25e3a 100644 --- a/docs/index.md +++ b/docs/index.md @@ -27,25 +27,23 @@ LoRAX (LoRA eXchange) is a framework that allows users to serve thousands of fin ## 🌳 Features -- 🚅 **Dynamic Adapter Loading:** include any fine-tuned LoRA adapter from [HuggingFace](./models/adapters.md#huggingface-hub), [Predibase](./models/adapters.md#predibase), or [any filesystem](./models/adapters.md#local) in your request, it will be loaded just-in-time without blocking concurrent requests. [Merge adapters](./guides/merging_adapters.md) per request to instantly create powerful ensembles. -- 🏋️♀️ **Heterogeneous Continuous Batching:** packs requests for different adapters together into the same batch, keeping latency and throughput nearly constant with the number of concurrent adapters. -- 🧁 **Adapter Exchange Scheduling:** asynchronously prefetches and offloads adapters between GPU and CPU memory, schedules request batching to optimize the aggregate throughput of the system. -- 👬 **Optimized Inference:** high throughput and low latency optimizations including tensor parallelism, pre-compiled CUDA kernels ([flash-attention](https://arxiv.org/abs/2307.08691), [paged attention](https://arxiv.org/abs/2309.06180), [SGMV](https://arxiv.org/abs/2310.18547)), quantization, token streaming. -- 🚢 **Ready for Production** prebuilt Docker images, Helm charts for Kubernetes, Prometheus metrics, and distributed tracing with Open Telemetry. OpenAI compatible API supporting multi-turn chat conversations. Private adapters through per-request tenant isolation. [Structured Output](./guides/structured_output.md) (JSON mode). -- 🤯 **Free for Commercial Use:** Apache 2.0 License. Enough said 😎. - +- 🚅 **Dynamic Adapter Loading:** include any fine-tuned LoRA adapter from [HuggingFace](./models/adapters/index.md#huggingface-hub), [Predibase](./models/adapters/index.md#predibase), or [any filesystem](./models/adapters/index.md#local) in your request, it will be loaded just-in-time without blocking concurrent requests. [Merge adapters](./guides/merging_adapters.md) per request to instantly create powerful ensembles. +- 🏋️♀️ **Heterogeneous Continuous Batching:** packs requests for different adapters together into the same batch, keeping latency and throughput nearly constant with the number of concurrent adapters. +- 🧁 **Adapter Exchange Scheduling:** asynchronously prefetches and offloads adapters between GPU and CPU memory, schedules request batching to optimize the aggregate throughput of the system. +- 👬 **Optimized Inference:** high throughput and low latency optimizations including tensor parallelism, pre-compiled CUDA kernels ([flash-attention](https://arxiv.org/abs/2307.08691), [paged attention](https://arxiv.org/abs/2309.06180), [SGMV](https://arxiv.org/abs/2310.18547)), quantization, token streaming. +- 🚢 **Ready for Production** prebuilt Docker images, Helm charts for Kubernetes, Prometheus metrics, and distributed tracing with Open Telemetry. OpenAI compatible API supporting multi-turn chat conversations. Private adapters through per-request tenant isolation. [Structured Output](./guides/structured_output.md) (JSON mode). +- 🤯 **Free for Commercial Use:** Apache 2.0 License. Enough said 😎.
- ## 🏠 Models Serving a fine-tuned model with LoRAX consists of two components: -- [Base Model](./models/base_models.md): pretrained large model shared across all adapters. -- [Adapter](./models/adapter.md): task-specific adapter weights dynamically loaded per request. +- [Base Model](./models/base_models.md): pretrained large model shared across all adapters. +- [Adapter](./models/adapters/index.md): task-specific adapter weights dynamically loaded per request. LoRAX supports a number of Large Language Models as the base model including [Llama](https://huggingface.co/meta-llama) (including [CodeLlama](https://huggingface.co/codellama)), [Mistral](https://huggingface.co/mistralai) (including [Zephyr](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta)), and [Qwen](https://huggingface.co/Qwen). See [Supported Architectures](./models/base_models.md#supported-architectures) for a complete list of supported base models. @@ -61,10 +59,10 @@ We recommend starting with our pre-built Docker image to avoid compiling custom The minimum system requirements need to run LoRAX include: -- Nvidia GPU (Ampere generation or above) -- CUDA 11.8 compatible device drivers and above -- Linux OS -- Docker (for this guide) +- Nvidia GPU (Ampere generation or above) +- CUDA 11.8 compatible device drivers and above +- Linux OS +- Docker (for this guide) ### Launch LoRAX Server @@ -124,7 +122,7 @@ adapter_id = "vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k" print(client.generate(prompt, max_new_tokens=64, adapter_id=adapter_id).generated_text) ``` -See [Reference - Python Client](./reference/python_client.md) for full details. +See [Reference - Python Client](./reference/python_client/client.md) for full details. For other ways to run LoRAX, see [Getting Started - Kubernetes](./getting_started/kubernetes.md), [Getting Started - SkyPilot](./getting_started/skypilot.md), and [Getting Started - Local](./getting_started/local.md). diff --git a/docs/reference/metrics.md b/docs/reference/metrics.md new file mode 100644 index 00000000..0b211558 --- /dev/null +++ b/docs/reference/metrics.md @@ -0,0 +1,19 @@ +# Metrics + +Prometheus-compatible metrics are made available on the default port, on the `/metrics` endpoint. + +Below is a list of the metrics that are exposed: +| Metric Name | Type | +| -------------------------------------------- | --------- | +| `lorax_request_count` | Counter | +| `lorax_request_success` | Counter | +| `lorax_request_failure` | Counter | +| `lorax_request_duration` | Histogram | +| `lorax_request_queue_duration` | Histogram | +| `lorax_request_validation_duration` | Histogram | +| `lorax_request_inference_duration` | Histogram | +| `lorax_request_mean_time_per_token_duration` | Histogram | +| `lorax_request_generated_tokens` | Histogram | +| `lorax_request_input_length` | Histogram | + +For all histograms, there are metrics that are autogenerated which are the metric name + `_sum` and `_count`, which are the sum of all values for that histogram, and the count of all instances of that histogram respectively. diff --git a/mkdocs.yml b/mkdocs.yml index 7f3b55c8..03b07fe8 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -50,6 +50,7 @@ nav: - lorax.client: reference/python_client/client.md # - lorax.types: reference/python_client/types.md - OpenAI Compatible API: reference/openai_api.md + - Metrics: reference/metrics.md - 🔬 Guides: - Quantization: guides/quantization.md - Structured Output (JSON): guides/structured_output.md