Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server : refactor middleware and /health endpoint #9056

Merged
merged 6 commits into from
Aug 16, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 26 additions & 9 deletions examples/server/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -368,15 +368,16 @@ node index.js

## API Endpoints

### GET `/health`: Returns the current state of the server
### GET `/health`: Returns heath check result

- 503 -> `{"status": "loading model"}` if the model is still being loaded.
- 500 -> `{"status": "error"}` if the model failed to load.
- 200 -> `{"status": "ok", "slots_idle": 1, "slots_processing": 2 }` if the model is successfully loaded and the server is ready for further requests mentioned below.
- 200 -> `{"status": "no slot available", "slots_idle": 0, "slots_processing": 32}` if no slots are currently available.
- 503 -> `{"status": "no slot available", "slots_idle": 0, "slots_processing": 32}` if the query parameter `fail_on_no_slot` is provided and no slots are currently available.
**Response format**

If the query parameter `include_slots` is passed, `slots` field will contain internal slots data except if `--slots-endpoint-disable` is set.
- HTTP status code 503
- Body: `{"error": {"code": 503, "message": "Loading model", "type": "unavailable_error"}}`
- Explanation: the model is still being loaded.
- HTTP status code 200
- Body: `{"status": "ok" }`
- Explanation: the model is successfully loaded and the server is ready.

### POST `/completion`: Given a `prompt`, it returns the predicted completion.

Expand Down Expand Up @@ -639,10 +640,16 @@ Given a ChatML-formatted json description in `messages`, it returns the predicte
}'
```

### GET `/slots`: Returns the current slots processing state. Can be disabled with `--slots-endpoint-disable`.
### GET `/slots`: Returns the current slots processing state

This endpoint can be disabled with `--no-slots`

If query param `?fail_on_no_slot=1` is set, this endpoint will respond with status code 503 if there is no available slots.

**Response format**

Example:

```json
[
{
Expand Down Expand Up @@ -702,7 +709,13 @@ Given a ChatML-formatted json description in `messages`, it returns the predicte
]
```

### GET `/metrics`: Prometheus compatible metrics exporter endpoint if `--metrics` is enabled:
Possible values for `slot[i].state` are:
- `0`: SLOT_STATE_IDLE
- `1`: SLOT_STATE_PROCESSING

### GET `/metrics`: Prometheus compatible metrics exporter

This endpoint is only accessible if `--metrics` is set.

Available metrics:
- `llamacpp:prompt_tokens_total`: Number of prompt tokens processed.
Expand Down Expand Up @@ -767,6 +780,10 @@ Available metrics:

### GET `/lora-adapters`: Get list of all LoRA adapters

This endpoint returns the loaded LoRA adapters. You can add adapters using `--lora` when starting the server, for example: `--lora my_adapter_1.gguf --lora my_adapter_2.gguf ...`

By default, all adapters will be loaded with scale set to 1. To initialize all adapters scale to 0, add `--lora-init-without-apply`

If an adapter is disabled, the scale will be set to 0.

**Response format**
Expand Down
Loading
Loading