feat(docs): improve HPA documentation (#6091)

* highlight constraints and limitations of a HPA-based approach * remove note on statefulsets being created sequentially - we are specifically configuring k8s to allow for parallel creation of statefulset pods. * highlight importance of the `metrics-relist-interval` setting * simplify config example to no longer use regex metric matches * clarify example using HPA label selectors * clarify the need to use the `AverageValue` target type * clarify the relation between query rate window size and prometheus scrape interval
SeldonIO · Dec 6, 2024 · 6d89d57 · 6d89d57
1 parent c431854
commit 6d89d57
Showing 1 changed file with 137 additions and 48 deletions.
diff --git a/docs-gb/kubernetes/hpa-rps-autoscaling.md b/docs-gb/kubernetes/hpa-rps-autoscaling.md
@@ -13,6 +13,19 @@ and servers (single-model serving). This will require:
 * Configuring HPA manifests to scale Models and the corresponding Server replicas based on the
   custom metrics
 
+{% hint style="warning" %}
+The Core 2 HPA-based autoscaling has the following constraints/limitations:
+
+- HPA scaling only targets single-model serving, where there is a 1:1 correspondence between models and servers. Autoscaling for multi-model serving (MMS) is supported for specific models and workloads via the Core 2 native features described [here](autoscaling.md).
+  - Significant improvements to MMS autoscaling are planned for future releases.
+
+- **Only custom metrics** from Prometheus are supported. Native Kubernetes resource metrics such as CPU or memory are not. This limitation exists because of HPA's design: In order to prevent multiple HPA CRs from issuing conflicting scaling instructions, each HPA CR must exclusively control a set of pods which is disjoint from the pods controlled by other HPA CRs. In Seldon Core 2, CPU/memory metrics can be used to scale the number of Server replicas via HPA. However, this also means that the CPU/memory metrics from the same set of pods can no longer be used to scale the number of model replicas.
+  - We are working on improvements in Core 2 to allow both servers and models to be scaled based on a single HPA manifest, targeting the Model CR.
+
+- Each Kubernetes cluster supports only one active custom metrics provider. If your cluster already uses a custom metrics provider different from `prometheus-adapter`, it will need to be removed before being able to scale Core 2 models and servers via HPA.
+  - The Kubernetes community is actively exploring solutions for allowing multiple custom metrics providers to coexist.
+{% endhint %}
+
 ## Installing and configuring the Prometheus Adapter
 
 The role of the Prometheus Adapter is to expose queries on metrics in Prometheus as k8s custom
@@ -36,6 +49,14 @@ If you are running Prometheus on a different port than the default 9090, you can
 prometheus.port=[custom_port]` You may inspect all the options available as helm values by
 running `helm show values prometheus-community/prometheus-adapter`
 
+{% hint style="warning" %}
+Please check that the `metricsRelistInterval` helm value (default to 1m) works well in your
+setup, and update it otherwise. This value needs to be larger than or equal to your Prometheus
+scrape interval. The corresponding prometheus adapter command-line argument is
+`--metrics-relist-interval`. If the relist interval is set incorrectly, it will lead to some of
+the custom metrics being intermittently reported as missing.
+{% endhint %}
+
 We now need to configure the adapter to look for the correct prometheus metrics and compute
 per-model RPS values. On install, the adapter has created a `ConfigMap` in the same namespace as
 itself, named `[helm_release_name]-prometheus-adapter`. In our case, it will be
@@ -60,46 +81,90 @@ data:
     "rules":
     -
       "seriesQuery": |
-         {__name__=~"^seldon_model.*_total",namespace!=""}
-      "seriesFilters":
-        - "isNot": "^seldon_.*_seconds_total"
-        - "isNot": "^seldon_.*_aggregate_.*"
+         {__name__="seldon_model_infer_total",namespace!=""}
       "resources":
         "overrides":
           "model": {group: "mlops.seldon.io", resource: "model"}
           "server": {group: "mlops.seldon.io", resource: "server"}
           "pod": {resource: "pod"}
           "namespace": {resource: "namespace"}
       "name":
-        "matches": "^seldon_model_(.*)_total"
-        "as": "${1}_rps"
+        "matches": "seldon_model_infer_total"
+        "as": "infer_rps"
       "metricsQuery": |
         sum by (<<.GroupBy>>) (
           rate (
-            <<.Series>>{<<.LabelMatchers>>}[1m]
+            <<.Series>>{<<.LabelMatchers>>}[2m]
           )
         )
 ````
 {% endcode %}
 
-In this example, a single rule is defined to fetch the `seldon_model_infer_total` metric
-from Prometheus, compute its rate over a 1 minute window, and expose this to k8s as the `infer_rps`
-metric, with aggregations at model, server, inference server pod and namespace level.
+In this example, a single rule is defined to fetch the `seldon_model_infer_total` metric from
+Prometheus, compute its per second change rate based on data within a 2 minute sliding window,
+and expose this to Kubernetes as the `infer_rps` metric, with aggregations available at model,
+server, inference server pod and namespace level.
+
+When HPA requests the `infer_rps` metric via the custom metrics API for a specific model,
+prometheus-adapter issues a Prometheus query in line with what it is defined in its config.
+
+For the configuration in our example, the query for a model named `irisa0` in namespace
+`seldon-mesh` would be:
+
+```
+sum by (model) (
+  rate (
+    seldon_model_infer_total{model="irisa0", namespace="seldon-mesh"}[2m]
+  )
+)
+```
+
+You may want to modify the query in the example to match the one that you typically use in your
+monitoring setup for RPS metrics. The example calls [`rate()`](https://prometheus.io/docs/prometheus/latest/querying/functions/#rate)
+with a 2 minute sliding window. Values scraped at the beginning and end of the 2 minute window
+before query time are used to compute the RPS.
+
+It is important to sanity-check the query by executing it against your Prometheus instance. To
+do so, pick an existing model CR in your Seldon Core 2 install, and send some inference requests
+towards it. Then, wait for a period equal to at least twice the Prometheus scrape interval
+(Prometheus default 1 minute), so that two values from the series are captured and a rate can be
+computed. Finally, you can modify the model name and namespace in the query above to match the
+model you've picked and execute the query.
+
+If the query result is empty, please adjust it until it consistently returns the expected metric
+values. Pay special attention to the window size (2 minutes in the example): if it is smaller
+than twice the Prometheus scrape interval, the query may return no results. A compromise needs
+to be reached to set the window size large enough to reject noise but also small enough to make
+the result responsive to quick changes in load.
 
-A list of all the Prometheus metrics exposed by Seldon Core 2 in relation to Models, Servers and Pipelines is available [here](../metrics/operational.md),
-and those may be used when customizing the configuration.
+Update the `metricsQuery` in the prometheus-adapter ConfigMap to match any query changes you
+have made during tests.
 
-### Understanding prometheus-adapter rule definitions
+A list of all the Prometheus metrics exposed by Seldon Core 2 in relation to Models, Servers and
+Pipelines is available [here](../metrics/operational.md), and those may be used when customizing
+the configuration.
+
+### Customizing prometheus-adapter rule definitions
 
 The rule definition can be broken down in four parts:
 
 * _Discovery_ (the `seriesQuery` and `seriesFilters` keys) controls what Prometheus
     metrics are considered for exposure via the k8s custom metrics API.
 
-  In the example, all the Seldon Prometheus metrics of the form `seldon_model_*_total` are
-  considered, excluding metrics pre-aggregated across all models (`.*_aggregate_.*`) as well as
-  the cummulative infer time per model (`.*_seconds_total`). For RPS, we are only interested in
-  the model inference count (`seldon_model_infer_total`)
+  As an alternative to the example above, all the Seldon Prometheus metrics of the form `seldon_model.*_total`
+  could be considered, followed by excluding metrics pre-aggregated across all models (`.*_aggregate_.*`) as well as
+  the cummulative infer time per model (`.*_seconds_total`):
+
+    ```yaml
+    "seriesQuery": |
+            {__name__=~"^seldon_model.*_total",namespace!=""}
+        "seriesFilters":
+            - "isNot": "^seldon_.*_seconds_total"
+            - "isNot": "^seldon_.*_aggregate_.*"
+    ...
+    ```
+
+  For RPS, we are only interested in the model inference count (`seldon_model_infer_total`)
 
 * _Association_ (the `resources` key) controls the Kubernetes resources that a particular
     metric can be attached to or aggregated over.
@@ -125,8 +190,14 @@ The rule definition can be broken down in four parts:
   `seldon_model_infer_total` and expose custom metric endpoints named `infer_rps`, which when
   called return the result of a query over the Prometheus metric.
 
-  The matching over the Prometheus metric name uses regex group capture expressions (line 22),
-  which are then be referenced in the custom metric name (line 23).
+  Instead of a literal match, one could also use regex group capture expressions,
+  which can then be referenced in the custom metric name:
+
+  ```yaml
+  "name":
+    "matches": "^seldon_model_(.*)_total"
+    "as": "${1}_rps"
+  ```
 
 * _Querying_ (the `metricsQuery` key) defines how a request for a specific k8s custom metric gets
     converted into a Prometheus query.
@@ -141,16 +212,11 @@ The rule definition can be broken down in four parts:
     - .GroupBy is replaced by the resource type of the requested metric (e.g. `model`,
     `server`, `pod` or `namespace`).
 
-  You may want to modify the query in the example to match the one that you typically use in
-  your monitoring setup for RPS metrics. The example calls [`rate()`](https://prometheus.io/docs/prometheus/latest/querying/functions/#rate)
-  with a 1 minute window.
-
 
 For a complete reference for how `prometheus-adapter` can be configured via the `ConfigMap`, please
 consult the docs [here](https://github.com/kubernetes-sigs/prometheus-adapter/blob/master/docs/config.md).
 
 
-
 Once you have applied any necessary customizations, replace the default prometheus-adapter config
 with the new one, and restart the deployment (this restart is required so that prometheus-adapter
 picks up the new config):
@@ -301,8 +367,8 @@ spec:
 ```
 {% endcode %}
 
-In the preceding HPA manifests, the scaling metric is exactly the same, and uses the exact same
-parameters. This is to ensure that both the Models and the Servers are scaled up/down at
+It is important to keep both the scaling metric and any scaling policies the same across the two
+HPA manifests. This is to ensure that both the Models and the Servers are scaled up/down at
 approximately the same time. Small variations in the scale-up time are expected because each HPA
 samples the metrics independently, at regular intervals.
 
@@ -317,9 +383,17 @@ In order to ensure similar scaling behaviour between Models and Servers, the num
 `minReplicas` and `maxReplicas`, as well as any other configured scaling policies should be kept
 in sync across the HPA for the model and the server.
 
-### Details on custom metrics of type Object
+{% hint style="danger" %}
+The Object metric allows for two target value types: `AverageValue` and `Value`. Of the two,
+only `AverageValue` is supported for the current Seldon Core 2 setup. The `Value` target type is
+typically used for metrics describing the utilization of a resource and would not be suitable
+for RPS-based scaling.
+{% endhint %}
+
+
+### HPA metrics of type Object
 
-The HPA manifests use metrics of type "Object" that fetch the data used in scaling
+The example HPA manifests use metrics of type "Object" that fetch the data used in scaling
 decisions by querying k8s metrics associated with a particular k8s object.  The endpoints that
 HPA uses for fetching those metrics are the same ones that were tested in the previous section
 using `kubectl get --raw ...`. Because you have configured the Prometheus Adapter to expose those
@@ -348,7 +422,7 @@ query template configured in our example would be transformed into:
 ```
 sum by (namespace) (
   rate (
-    seldon_model_infer_total{namespace="seldon-mesh"}[1m]
+    seldon_model_infer_total{namespace="seldon-mesh"}[2m]
   )
 )
 ```
@@ -360,25 +434,37 @@ identifying the namespace where the HPA manifest resides in.:
 ```
 sum by (pod) (
   rate (
-    seldon_model_infer_total{pod="mlserver-0", namespace="seldon-mesh"}[1m]
+    seldon_model_infer_total{pod="mlserver-0", namespace="seldon-mesh"}[2m]
   )
 )
 ```
 
-For the `target` of the Object metric you **must** use a `type` of `AverageValue`. The value
-given in `averageValue` represents the per replica RPS scaling threshold of the `scaleTargetRef`
-object (either a Model or a Server in our case), with the target number of replicas being
-computed by HPA according to the following formula:
+The `target` section establishes the thresholds used in scaling decisions. For RPS, the
+`AverageValue` target type refers to the threshold per replica RPS above which the number of the
+`scaleTargetRef` (Model or Server) replicas should be increased. The target number of replicas
+is being computed by HPA according to the following formula:
 
-$$\texttt{targetReplicas} = \frac{\texttt{infer\_rps}}{\texttt{thresholdPerReplicaRPS}}$$
+$$\texttt{targetReplicas} = \frac{\texttt{infer\_rps}}{\texttt{averageValue}}$$
+
+As an example, if `averageValue=50` and `infer_rps=150`, the `targetReplicas` would be 3.
+
+Importantly, computing the target number of replicas does not require knowing the number of
+active pods currently associated with the Server or Model. This is what allows both the Model
+and the Server to be targeted by two separate HPA manifests. Otherwise, both HPA CRs would
+attempt to take ownership of the same set of pods, and transition into a failure state.
+
+This is also why the `Value` target type is **not currently supported**. In this case, HPA first
+computes an `utilizationRatio`:
+
+$$\texttt{utilizationRatio} = \frac{\texttt{custom\_metric\_value}}{\texttt{threshold\_value}}$$
+
+As an example, if `threshold_value=100` and `custom_metric_value=200`, the `utilizationRatio`
+would be 2. HPA deduces from this that the number of active pods associated with the
+`scaleTargetRef` object should be doubled, and expects that once that target is achieved, the
+`custom_metric_value` will become equal to the `threshold_value` (`utilizationRatio=1`). However,
+by using the number of active pods, the HPA CRs for both the Model and the Server also try to
+take exclusive ownership of the same set of pods, and fail.
 
-{% hint style="info" %}
-**Note**: Attempting other target types does not work under the current Seldon Core 2 setup, because they
-use the number of active Pods associated with the Model CR (i.e. the associated Server pods) in
-the `targetReplicas` computation. However, this also means that this set of pods becomes "owned"
-by the Model HPA. Once a pod is owned by a given HPA it is not available for other HPAs to use,
-so we would no longer be able to scale the Server CRs using HPA.
-{% endhint %}
 
 ### HPA sampling of custom metrics
 
@@ -421,7 +507,8 @@ inspecting the corresponding Server HPA CR, or by fetching the metric directly v
 
 *   Filtering metrics by additional labels on the prometheus metric:
 
-    The prometheus metric from which the model RPS is computed has the following labels:
+    The prometheus metric from which the model RPS is computed has the following labels managed
+    by Seldon Core 2:
 
     ```c-like
     seldon_model_infer_total{
@@ -440,9 +527,11 @@ inspecting the corresponding Server HPA CR, or by fetching the metric directly v
     }
     ```
 
-    If you want the scaling metric to be computed based on inferences with a particular value
-    for any of those labels, you can add this in the HPA metric config, as in the example
-    (targeting `method_type="rest"`):
+    If you want the scaling metric to be computed based on a subset of the Prometheus time
+    series with particular label values (labels either managed by Seldon Core 2 or added
+    automatically within your infrastructure), you can add this as a selector the HPA metric
+    config. This is shown in the following example, which scales only based on the RPS of REST
+    requests as opposed to REST + gRPC:
 
     ```yaml
       metrics:
@@ -461,6 +550,7 @@ inspecting the corresponding Server HPA CR, or by fetching the metric directly v
     	    type: AverageValue
             averageValue: "3"
     ```
+
 *   Customize scale-up / scale-down rate & properties by using scaling policies as described in
     the [HPA scaling policies docs](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#configurable-scaling-behavior)
 
@@ -592,8 +682,7 @@ into account when setting the HPA policies.
       within the set `periodSeconds`) is not recommended because of this.
     - Perhaps more importantly, there is no reason to scale faster than the time it takes for
       replicas to become available - this is the true maximum rate with which scaling up can
-      happen anyway. Because the underlying Server replica pods are part of a stateful set, they
-      are created sequentially by k8s.
+      happen anyway.
 
 {% code title="hpa-custom-policy.yaml" lineNumbers="true" %}
 ```yaml