Skip to content

Commit

Permalink
feat(docs): improve HPA documentation (#6091)
Browse files Browse the repository at this point in the history
* highlight constraints and limitations of a HPA-based approach
* remove note on statefulsets being created sequentially - we are specifically configuring k8s to allow for parallel creation of statefulset pods.
* highlight importance of the `metrics-relist-interval` setting
* simplify config example to no longer use regex metric matches
* clarify example using HPA label selectors
* clarify the need to use the `AverageValue` target type
* clarify the relation between query rate window size and prometheus scrape interval
  • Loading branch information
lc525 authored Dec 6, 2024
1 parent c431854 commit 6d89d57
Showing 1 changed file with 137 additions and 48 deletions.
185 changes: 137 additions & 48 deletions docs-gb/kubernetes/hpa-rps-autoscaling.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,19 @@ and servers (single-model serving). This will require:
* Configuring HPA manifests to scale Models and the corresponding Server replicas based on the
custom metrics

{% hint style="warning" %}
The Core 2 HPA-based autoscaling has the following constraints/limitations:

- HPA scaling only targets single-model serving, where there is a 1:1 correspondence between models and servers. Autoscaling for multi-model serving (MMS) is supported for specific models and workloads via the Core 2 native features described [here](autoscaling.md).
- Significant improvements to MMS autoscaling are planned for future releases.

- **Only custom metrics** from Prometheus are supported. Native Kubernetes resource metrics such as CPU or memory are not. This limitation exists because of HPA's design: In order to prevent multiple HPA CRs from issuing conflicting scaling instructions, each HPA CR must exclusively control a set of pods which is disjoint from the pods controlled by other HPA CRs. In Seldon Core 2, CPU/memory metrics can be used to scale the number of Server replicas via HPA. However, this also means that the CPU/memory metrics from the same set of pods can no longer be used to scale the number of model replicas.
- We are working on improvements in Core 2 to allow both servers and models to be scaled based on a single HPA manifest, targeting the Model CR.

- Each Kubernetes cluster supports only one active custom metrics provider. If your cluster already uses a custom metrics provider different from `prometheus-adapter`, it will need to be removed before being able to scale Core 2 models and servers via HPA.
- The Kubernetes community is actively exploring solutions for allowing multiple custom metrics providers to coexist.
{% endhint %}

## Installing and configuring the Prometheus Adapter

The role of the Prometheus Adapter is to expose queries on metrics in Prometheus as k8s custom
Expand All @@ -36,6 +49,14 @@ If you are running Prometheus on a different port than the default 9090, you can
prometheus.port=[custom_port]` You may inspect all the options available as helm values by
running `helm show values prometheus-community/prometheus-adapter`

{% hint style="warning" %}
Please check that the `metricsRelistInterval` helm value (default to 1m) works well in your
setup, and update it otherwise. This value needs to be larger than or equal to your Prometheus
scrape interval. The corresponding prometheus adapter command-line argument is
`--metrics-relist-interval`. If the relist interval is set incorrectly, it will lead to some of
the custom metrics being intermittently reported as missing.
{% endhint %}

We now need to configure the adapter to look for the correct prometheus metrics and compute
per-model RPS values. On install, the adapter has created a `ConfigMap` in the same namespace as
itself, named `[helm_release_name]-prometheus-adapter`. In our case, it will be
Expand All @@ -60,46 +81,90 @@ data:
"rules":
-
"seriesQuery": |
{__name__=~"^seldon_model.*_total",namespace!=""}
"seriesFilters":
- "isNot": "^seldon_.*_seconds_total"
- "isNot": "^seldon_.*_aggregate_.*"
{__name__="seldon_model_infer_total",namespace!=""}
"resources":
"overrides":
"model": {group: "mlops.seldon.io", resource: "model"}
"server": {group: "mlops.seldon.io", resource: "server"}
"pod": {resource: "pod"}
"namespace": {resource: "namespace"}
"name":
"matches": "^seldon_model_(.*)_total"
"as": "${1}_rps"
"matches": "seldon_model_infer_total"
"as": "infer_rps"
"metricsQuery": |
sum by (<<.GroupBy>>) (
rate (
<<.Series>>{<<.LabelMatchers>>}[1m]
<<.Series>>{<<.LabelMatchers>>}[2m]
)
)
````
{% endcode %}

In this example, a single rule is defined to fetch the `seldon_model_infer_total` metric
from Prometheus, compute its rate over a 1 minute window, and expose this to k8s as the `infer_rps`
metric, with aggregations at model, server, inference server pod and namespace level.
In this example, a single rule is defined to fetch the `seldon_model_infer_total` metric from
Prometheus, compute its per second change rate based on data within a 2 minute sliding window,
and expose this to Kubernetes as the `infer_rps` metric, with aggregations available at model,
server, inference server pod and namespace level.

When HPA requests the `infer_rps` metric via the custom metrics API for a specific model,
prometheus-adapter issues a Prometheus query in line with what it is defined in its config.

For the configuration in our example, the query for a model named `irisa0` in namespace
`seldon-mesh` would be:

```
sum by (model) (
rate (
seldon_model_infer_total{model="irisa0", namespace="seldon-mesh"}[2m]
)
)
```

You may want to modify the query in the example to match the one that you typically use in your
monitoring setup for RPS metrics. The example calls [`rate()`](https://prometheus.io/docs/prometheus/latest/querying/functions/#rate)
with a 2 minute sliding window. Values scraped at the beginning and end of the 2 minute window
before query time are used to compute the RPS.

It is important to sanity-check the query by executing it against your Prometheus instance. To
do so, pick an existing model CR in your Seldon Core 2 install, and send some inference requests
towards it. Then, wait for a period equal to at least twice the Prometheus scrape interval
(Prometheus default 1 minute), so that two values from the series are captured and a rate can be
computed. Finally, you can modify the model name and namespace in the query above to match the
model you've picked and execute the query.

If the query result is empty, please adjust it until it consistently returns the expected metric
values. Pay special attention to the window size (2 minutes in the example): if it is smaller
than twice the Prometheus scrape interval, the query may return no results. A compromise needs
to be reached to set the window size large enough to reject noise but also small enough to make
the result responsive to quick changes in load.

A list of all the Prometheus metrics exposed by Seldon Core 2 in relation to Models, Servers and Pipelines is available [here](../metrics/operational.md),
and those may be used when customizing the configuration.
Update the `metricsQuery` in the prometheus-adapter ConfigMap to match any query changes you
have made during tests.

### Understanding prometheus-adapter rule definitions
A list of all the Prometheus metrics exposed by Seldon Core 2 in relation to Models, Servers and
Pipelines is available [here](../metrics/operational.md), and those may be used when customizing
the configuration.

### Customizing prometheus-adapter rule definitions

The rule definition can be broken down in four parts:

* _Discovery_ (the `seriesQuery` and `seriesFilters` keys) controls what Prometheus
metrics are considered for exposure via the k8s custom metrics API.

In the example, all the Seldon Prometheus metrics of the form `seldon_model_*_total` are
considered, excluding metrics pre-aggregated across all models (`.*_aggregate_.*`) as well as
the cummulative infer time per model (`.*_seconds_total`). For RPS, we are only interested in
the model inference count (`seldon_model_infer_total`)
As an alternative to the example above, all the Seldon Prometheus metrics of the form `seldon_model.*_total`
could be considered, followed by excluding metrics pre-aggregated across all models (`.*_aggregate_.*`) as well as
the cummulative infer time per model (`.*_seconds_total`):

```yaml
"seriesQuery": |
{__name__=~"^seldon_model.*_total",namespace!=""}
"seriesFilters":
- "isNot": "^seldon_.*_seconds_total"
- "isNot": "^seldon_.*_aggregate_.*"
...
```

For RPS, we are only interested in the model inference count (`seldon_model_infer_total`)

* _Association_ (the `resources` key) controls the Kubernetes resources that a particular
metric can be attached to or aggregated over.
Expand All @@ -125,8 +190,14 @@ The rule definition can be broken down in four parts:
`seldon_model_infer_total` and expose custom metric endpoints named `infer_rps`, which when
called return the result of a query over the Prometheus metric.

The matching over the Prometheus metric name uses regex group capture expressions (line 22),
which are then be referenced in the custom metric name (line 23).
Instead of a literal match, one could also use regex group capture expressions,
which can then be referenced in the custom metric name:

```yaml
"name":
"matches": "^seldon_model_(.*)_total"
"as": "${1}_rps"
```

* _Querying_ (the `metricsQuery` key) defines how a request for a specific k8s custom metric gets
converted into a Prometheus query.
Expand All @@ -141,16 +212,11 @@ The rule definition can be broken down in four parts:
- .GroupBy is replaced by the resource type of the requested metric (e.g. `model`,
`server`, `pod` or `namespace`).

You may want to modify the query in the example to match the one that you typically use in
your monitoring setup for RPS metrics. The example calls [`rate()`](https://prometheus.io/docs/prometheus/latest/querying/functions/#rate)
with a 1 minute window.


For a complete reference for how `prometheus-adapter` can be configured via the `ConfigMap`, please
consult the docs [here](https://github.com/kubernetes-sigs/prometheus-adapter/blob/master/docs/config.md).



Once you have applied any necessary customizations, replace the default prometheus-adapter config
with the new one, and restart the deployment (this restart is required so that prometheus-adapter
picks up the new config):
Expand Down Expand Up @@ -301,8 +367,8 @@ spec:
```
{% endcode %}

In the preceding HPA manifests, the scaling metric is exactly the same, and uses the exact same
parameters. This is to ensure that both the Models and the Servers are scaled up/down at
It is important to keep both the scaling metric and any scaling policies the same across the two
HPA manifests. This is to ensure that both the Models and the Servers are scaled up/down at
approximately the same time. Small variations in the scale-up time are expected because each HPA
samples the metrics independently, at regular intervals.

Expand All @@ -317,9 +383,17 @@ In order to ensure similar scaling behaviour between Models and Servers, the num
`minReplicas` and `maxReplicas`, as well as any other configured scaling policies should be kept
in sync across the HPA for the model and the server.

### Details on custom metrics of type Object
{% hint style="danger" %}
The Object metric allows for two target value types: `AverageValue` and `Value`. Of the two,
only `AverageValue` is supported for the current Seldon Core 2 setup. The `Value` target type is
typically used for metrics describing the utilization of a resource and would not be suitable
for RPS-based scaling.
{% endhint %}


### HPA metrics of type Object

The HPA manifests use metrics of type "Object" that fetch the data used in scaling
The example HPA manifests use metrics of type "Object" that fetch the data used in scaling
decisions by querying k8s metrics associated with a particular k8s object. The endpoints that
HPA uses for fetching those metrics are the same ones that were tested in the previous section
using `kubectl get --raw ...`. Because you have configured the Prometheus Adapter to expose those
Expand Down Expand Up @@ -348,7 +422,7 @@ query template configured in our example would be transformed into:
```
sum by (namespace) (
rate (
seldon_model_infer_total{namespace="seldon-mesh"}[1m]
seldon_model_infer_total{namespace="seldon-mesh"}[2m]
)
)
```
Expand All @@ -360,25 +434,37 @@ identifying the namespace where the HPA manifest resides in.:
```
sum by (pod) (
rate (
seldon_model_infer_total{pod="mlserver-0", namespace="seldon-mesh"}[1m]
seldon_model_infer_total{pod="mlserver-0", namespace="seldon-mesh"}[2m]
)
)
```

For the `target` of the Object metric you **must** use a `type` of `AverageValue`. The value
given in `averageValue` represents the per replica RPS scaling threshold of the `scaleTargetRef`
object (either a Model or a Server in our case), with the target number of replicas being
computed by HPA according to the following formula:
The `target` section establishes the thresholds used in scaling decisions. For RPS, the
`AverageValue` target type refers to the threshold per replica RPS above which the number of the
`scaleTargetRef` (Model or Server) replicas should be increased. The target number of replicas
is being computed by HPA according to the following formula:

$$\texttt{targetReplicas} = \frac{\texttt{infer\_rps}}{\texttt{thresholdPerReplicaRPS}}$$
$$\texttt{targetReplicas} = \frac{\texttt{infer\_rps}}{\texttt{averageValue}}$$

As an example, if `averageValue=50` and `infer_rps=150`, the `targetReplicas` would be 3.

Importantly, computing the target number of replicas does not require knowing the number of
active pods currently associated with the Server or Model. This is what allows both the Model
and the Server to be targeted by two separate HPA manifests. Otherwise, both HPA CRs would
attempt to take ownership of the same set of pods, and transition into a failure state.

This is also why the `Value` target type is **not currently supported**. In this case, HPA first
computes an `utilizationRatio`:

$$\texttt{utilizationRatio} = \frac{\texttt{custom\_metric\_value}}{\texttt{threshold\_value}}$$

As an example, if `threshold_value=100` and `custom_metric_value=200`, the `utilizationRatio`
would be 2. HPA deduces from this that the number of active pods associated with the
`scaleTargetRef` object should be doubled, and expects that once that target is achieved, the
`custom_metric_value` will become equal to the `threshold_value` (`utilizationRatio=1`). However,
by using the number of active pods, the HPA CRs for both the Model and the Server also try to
take exclusive ownership of the same set of pods, and fail.

{% hint style="info" %}
**Note**: Attempting other target types does not work under the current Seldon Core 2 setup, because they
use the number of active Pods associated with the Model CR (i.e. the associated Server pods) in
the `targetReplicas` computation. However, this also means that this set of pods becomes "owned"
by the Model HPA. Once a pod is owned by a given HPA it is not available for other HPAs to use,
so we would no longer be able to scale the Server CRs using HPA.
{% endhint %}

### HPA sampling of custom metrics

Expand Down Expand Up @@ -421,7 +507,8 @@ inspecting the corresponding Server HPA CR, or by fetching the metric directly v

* Filtering metrics by additional labels on the prometheus metric:

The prometheus metric from which the model RPS is computed has the following labels:
The prometheus metric from which the model RPS is computed has the following labels managed
by Seldon Core 2:

```c-like
seldon_model_infer_total{
Expand All @@ -440,9 +527,11 @@ inspecting the corresponding Server HPA CR, or by fetching the metric directly v
}
```

If you want the scaling metric to be computed based on inferences with a particular value
for any of those labels, you can add this in the HPA metric config, as in the example
(targeting `method_type="rest"`):
If you want the scaling metric to be computed based on a subset of the Prometheus time
series with particular label values (labels either managed by Seldon Core 2 or added
automatically within your infrastructure), you can add this as a selector the HPA metric
config. This is shown in the following example, which scales only based on the RPS of REST
requests as opposed to REST + gRPC:

```yaml
metrics:
Expand All @@ -461,6 +550,7 @@ inspecting the corresponding Server HPA CR, or by fetching the metric directly v
type: AverageValue
averageValue: "3"
```

* Customize scale-up / scale-down rate & properties by using scaling policies as described in
the [HPA scaling policies docs](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#configurable-scaling-behavior)

Expand Down Expand Up @@ -592,8 +682,7 @@ into account when setting the HPA policies.
within the set `periodSeconds`) is not recommended because of this.
- Perhaps more importantly, there is no reason to scale faster than the time it takes for
replicas to become available - this is the true maximum rate with which scaling up can
happen anyway. Because the underlying Server replica pods are part of a stateful set, they
are created sequentially by k8s.
happen anyway.

{% code title="hpa-custom-policy.yaml" lineNumbers="true" %}
```yaml
Expand Down

0 comments on commit 6d89d57

Please sign in to comment.