Skip to content

Commit

Permalink
added metrics docs, updated links in main docs (#663)
Browse files Browse the repository at this point in the history
  • Loading branch information
noyoshi authored Nov 1, 2024
1 parent bd92e52 commit b3944ad
Show file tree
Hide file tree
Showing 3 changed files with 33 additions and 15 deletions.
28 changes: 13 additions & 15 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,25 +27,23 @@ LoRAX (LoRA eXchange) is a framework that allows users to serve thousands of fin

## 🌳 Features

- 🚅 **Dynamic Adapter Loading:** include any fine-tuned LoRA adapter from [HuggingFace](./models/adapters.md#huggingface-hub), [Predibase](./models/adapters.md#predibase), or [any filesystem](./models/adapters.md#local) in your request, it will be loaded just-in-time without blocking concurrent requests. [Merge adapters](./guides/merging_adapters.md) per request to instantly create powerful ensembles.
- 🏋️‍♀️ **Heterogeneous Continuous Batching:** packs requests for different adapters together into the same batch, keeping latency and throughput nearly constant with the number of concurrent adapters.
- 🧁 **Adapter Exchange Scheduling:** asynchronously prefetches and offloads adapters between GPU and CPU memory, schedules request batching to optimize the aggregate throughput of the system.
- 👬 **Optimized Inference:** high throughput and low latency optimizations including tensor parallelism, pre-compiled CUDA kernels ([flash-attention](https://arxiv.org/abs/2307.08691), [paged attention](https://arxiv.org/abs/2309.06180), [SGMV](https://arxiv.org/abs/2310.18547)), quantization, token streaming.
- 🚢 **Ready for Production** prebuilt Docker images, Helm charts for Kubernetes, Prometheus metrics, and distributed tracing with Open Telemetry. OpenAI compatible API supporting multi-turn chat conversations. Private adapters through per-request tenant isolation. [Structured Output](./guides/structured_output.md) (JSON mode).
- 🤯 **Free for Commercial Use:** Apache 2.0 License. Enough said 😎.

- 🚅 **Dynamic Adapter Loading:** include any fine-tuned LoRA adapter from [HuggingFace](./models/adapters/index.md#huggingface-hub), [Predibase](./models/adapters/index.md#predibase), or [any filesystem](./models/adapters/index.md#local) in your request, it will be loaded just-in-time without blocking concurrent requests. [Merge adapters](./guides/merging_adapters.md) per request to instantly create powerful ensembles.
- 🏋️‍♀️ **Heterogeneous Continuous Batching:** packs requests for different adapters together into the same batch, keeping latency and throughput nearly constant with the number of concurrent adapters.
- 🧁 **Adapter Exchange Scheduling:** asynchronously prefetches and offloads adapters between GPU and CPU memory, schedules request batching to optimize the aggregate throughput of the system.
- 👬 **Optimized Inference:** high throughput and low latency optimizations including tensor parallelism, pre-compiled CUDA kernels ([flash-attention](https://arxiv.org/abs/2307.08691), [paged attention](https://arxiv.org/abs/2309.06180), [SGMV](https://arxiv.org/abs/2310.18547)), quantization, token streaming.
- 🚢 **Ready for Production** prebuilt Docker images, Helm charts for Kubernetes, Prometheus metrics, and distributed tracing with Open Telemetry. OpenAI compatible API supporting multi-turn chat conversations. Private adapters through per-request tenant isolation. [Structured Output](./guides/structured_output.md) (JSON mode).
- 🤯 **Free for Commercial Use:** Apache 2.0 License. Enough said 😎.

<p align="center">
<img src="https://github.com/predibase/lorax/assets/29719151/f88aa16c-66de-45ad-ad40-01a7874ed8a9" />
</p>


## 🏠 Models

Serving a fine-tuned model with LoRAX consists of two components:

- [Base Model](./models/base_models.md): pretrained large model shared across all adapters.
- [Adapter](./models/adapter.md): task-specific adapter weights dynamically loaded per request.
- [Base Model](./models/base_models.md): pretrained large model shared across all adapters.
- [Adapter](./models/adapters/index.md): task-specific adapter weights dynamically loaded per request.

LoRAX supports a number of Large Language Models as the base model including [Llama](https://huggingface.co/meta-llama) (including [CodeLlama](https://huggingface.co/codellama)), [Mistral](https://huggingface.co/mistralai) (including [Zephyr](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta)), and [Qwen](https://huggingface.co/Qwen). See [Supported Architectures](./models/base_models.md#supported-architectures) for a complete list of supported base models.

Expand All @@ -61,10 +59,10 @@ We recommend starting with our pre-built Docker image to avoid compiling custom

The minimum system requirements need to run LoRAX include:

- Nvidia GPU (Ampere generation or above)
- CUDA 11.8 compatible device drivers and above
- Linux OS
- Docker (for this guide)
- Nvidia GPU (Ampere generation or above)
- CUDA 11.8 compatible device drivers and above
- Linux OS
- Docker (for this guide)

### Launch LoRAX Server

Expand Down Expand Up @@ -124,7 +122,7 @@ adapter_id = "vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k"
print(client.generate(prompt, max_new_tokens=64, adapter_id=adapter_id).generated_text)
```

See [Reference - Python Client](./reference/python_client.md) for full details.
See [Reference - Python Client](./reference/python_client/client.md) for full details.

For other ways to run LoRAX, see [Getting Started - Kubernetes](./getting_started/kubernetes.md), [Getting Started - SkyPilot](./getting_started/skypilot.md), and [Getting Started - Local](./getting_started/local.md).

Expand Down
19 changes: 19 additions & 0 deletions docs/reference/metrics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Metrics

Prometheus-compatible metrics are made available on the default port, on the `/metrics` endpoint.

Below is a list of the metrics that are exposed:
| Metric Name | Type |
| -------------------------------------------- | --------- |
| `lorax_request_count` | Counter |
| `lorax_request_success` | Counter |
| `lorax_request_failure` | Counter |
| `lorax_request_duration` | Histogram |
| `lorax_request_queue_duration` | Histogram |
| `lorax_request_validation_duration` | Histogram |
| `lorax_request_inference_duration` | Histogram |
| `lorax_request_mean_time_per_token_duration` | Histogram |
| `lorax_request_generated_tokens` | Histogram |
| `lorax_request_input_length` | Histogram |

For all histograms, there are metrics that are autogenerated which are the metric name + `_sum` and `_count`, which are the sum of all values for that histogram, and the count of all instances of that histogram respectively.
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ nav:
- lorax.client: reference/python_client/client.md
# - lorax.types: reference/python_client/types.md
- OpenAI Compatible API: reference/openai_api.md
- Metrics: reference/metrics.md
- 🔬 Guides:
- Quantization: guides/quantization.md
- Structured Output (JSON): guides/structured_output.md
Expand Down

0 comments on commit b3944ad

Please sign in to comment.