SeldonIO · lc525 · Dec 6, 2024 · Nov 22, 2024 · Nov 22, 2024 · Nov 22, 2024
@@ -13,6 +13,19 @@ and servers (single-model serving). This will require:
 * Configuring HPA manifests to scale Models and the corresponding Server replicas based on the
   custom metrics
 
+{% hint style="warning" %}
+The Core 2 HPA-based autoscaling has the following constraints/limitations:
+
+- HPA scaling only targets single-model serving, where there is a 1:1 correspondence between models and servers. Autoscaling for multi-model serving (MMS) is supported for specific models and workloads via the Core 2 native features described [here](autoscaling.md). 
+  - Significant improvements to MMS autoscaling are planned for future releases.
+
+- **Only custom metrics** from Prometheus are supported. Native Kubernetes resource metrics such as CPU or memory are not. This limitation exists because of HPA's design: In order to prevent multiple HPA CRs from issuing conflicting scaling instructions, each HPA CR must exclusively control a set of pods which is disjoint from the pods controlled by other HPA CRs. In Seldon Core 2, CPU/memory metrics can be used to scale the number of Server replicas via HPA. However, this also means that the CPU/memory metrics from the same set of pods can no longer be used to scale the number of model replicas. 
+  - We are working on improvements in Core 2 to allow both servers and models to be scaled based on a single HPA manifest, targeting the Model CR.
+
+- Each Kubernetes cluster supports only one active custom metrics provider. If your cluster already uses a custom metrics provider different from `prometheus-adapter`, it will need to be removed before being able to scale Core 2 models and servers via HPA. 
+  - The Kubernetes community is actively exploring solutions for allowing multiple custom metrics providers to coexist.
+{% endhint %}
+
 ## Installing and configuring the Prometheus Adapter
 
 The role of the Prometheus Adapter is to expose queries on metrics in Prometheus as k8s custom
@@ -36,6 +49,14 @@ If you are running Prometheus on a different port than the default 9090, you can
 prometheus.port=[custom_port]` You may inspect all the options available as helm values by
 running `helm show values prometheus-community/prometheus-adapter`
 
+{% hint style="warning" %}
+Please check that the `metricsRelistInterval` helm value (default to 1m) works well in your
+setup, and update it otherwise. This value needs to be larger than or equal to your Prometheus
+scrape interval. The corresponding prometheus adapter command-line argument is
+`--metrics-relist-interval`. If the relist interval is set incorrectly, it will lead to some of
+the custom metrics being intermittently reported as missing.
+{% endhint %}
+
 We now need to configure the adapter to look for the correct prometheus metrics and compute
 per-model RPS values. On install, the adapter has created a `ConfigMap` in the same namespace as
 itself, named `[helm_release_name]-prometheus-adapter`. In our case, it will be
@@ -60,19 +81,16 @@ data:
     "rules":
     -
       "seriesQuery": |
-         {__name__=~"^seldon_model.*_total",namespace!=""}
-      "seriesFilters":
-        - "isNot": "^seldon_.*_seconds_total"
-        - "isNot": "^seldon_.*_aggregate_.*"
+         {__name__="seldon_model_infer_total",namespace!=""}
       "resources":
         "overrides":
           "model": {group: "mlops.seldon.io", resource: "model"}
           "server": {group: "mlops.seldon.io", resource: "server"}
           "pod": {resource: "pod"}
           "namespace": {resource: "namespace"}
       "name":
-        "matches": "^seldon_model_(.*)_total"
-        "as": "${1}_rps"
+        "matches": "seldon_model_infer_total"
+        "as": "infer_rps"
       "metricsQuery": |
         sum by (<<.GroupBy>>) (
           rate (
@@ -84,22 +102,59 @@ data:
 
 In this example, a single rule is defined to fetch the `seldon_model_infer_total` metric
 from Prometheus, compute its rate over a 1 minute window, and expose this to k8s as the `infer_rps`
-metric, with aggregations at model, server, inference server pod and namespace level.
+metric, with aggregations available at model, server, inference server pod and namespace level.
+
+When HPA requests the `infer_rps` metric via the custom metrics API for a specific model,
+prometheus-adapter issues a Prometheus query in line with what it is defined in its config.
+
+For the configuration in our example, the query for a model named `irisa0` in namespace
+`seldon-mesh` would be:
+
+```
+sum by (model) (
+  rate (
+    seldon_model_infer_total{model="irisa0", namespace="seldon-mesh"}[1m]
+  )
+)
+```
+
+Before configuring prometheus-adapter via the ConfigMap, it is important to sanity-check the query by executing it against
+your Prometheus instance. To do so, pick an existing model CR in your Seldon Core 2 install, and
+send some inference requests towards it. Then, wait for a period equal to the Prometheus scrape
+interval (Prometheus default 1 minute) so that the metric values are updated. Finally, you can
+modify the model name and namespace in the query above to match the model you've picked and
+execute the query.
 
-A list of all the Prometheus metrics exposed by Seldon Core 2 in relation to Models, Servers and Pipelines is available [here](../metrics/operational.md),
-and those may be used when customizing the configuration.
+If the query result is non-empty, you may proceed with the next steps, or customize the query
+according to your needs and re-test. If the query result is empty, please adjust it until it
+returns the expected metric values. Update the `metricsQuery` in the prometheus-adapter
+ConfigMap to match.
 
-### Understanding prometheus-adapter rule definitions
+A list of all the Prometheus metrics exposed by Seldon Core 2 in relation to Models, Servers and
+Pipelines is available [here](../metrics/operational.md), and those may be used when customizing
+the configuration.
+
+### Customizing prometheus-adapter rule definitions
 
 The rule definition can be broken down in four parts:
 
 * _Discovery_ (the `seriesQuery` and `seriesFilters` keys) controls what Prometheus
     metrics are considered for exposure via the k8s custom metrics API.
 
-  In the example, all the Seldon Prometheus metrics of the form `seldon_model_*_total` are
-  considered, excluding metrics pre-aggregated across all models (`.*_aggregate_.*`) as well as
-  the cummulative infer time per model (`.*_seconds_total`). For RPS, we are only interested in
-  the model inference count (`seldon_model_infer_total`)
+  As an alternative to the example above, all the Seldon Prometheus metrics of the form `seldon_model.*_total`
+  could be considered, followed by excluding metrics pre-aggregated across all models (`.*_aggregate_.*`) as well as
+  the cummulative infer time per model (`.*_seconds_total`):
+
+    ```yaml
+    "seriesQuery": |
+            {__name__=~"^seldon_model.*_total",namespace!=""}
+        "seriesFilters":
+            - "isNot": "^seldon_.*_seconds_total"
+            - "isNot": "^seldon_.*_aggregate_.*"
+    ...
+    ```
+
+  For RPS, we are only interested in the model inference count (`seldon_model_infer_total`)
 
 * _Association_ (the `resources` key) controls the Kubernetes resources that a particular
     metric can be attached to or aggregated over.
@@ -125,8 +180,14 @@ The rule definition can be broken down in four parts:
   `seldon_model_infer_total` and expose custom metric endpoints named `infer_rps`, which when
   called return the result of a query over the Prometheus metric.
 
-  The matching over the Prometheus metric name uses regex group capture expressions (line 22),
-  which are then be referenced in the custom metric name (line 23).
+  Instead of a literal match, one could also use regex group capture expressions,
+  which can then be referenced in the custom metric name:
+
+  ```yaml
+  "name":
+    "matches": "^seldon_model_(.*)_total"
+    "as": "${1}_rps"
+  ```
 
 * _Querying_ (the `metricsQuery` key) defines how a request for a specific k8s custom metric gets
     converted into a Prometheus query.
@@ -150,7 +211,6 @@ For a complete reference for how `prometheus-adapter` can be configured via the
 consult the docs [here](https://github.com/kubernetes-sigs/prometheus-adapter/blob/master/docs/config.md).
 
 
-
 Once you have applied any necessary customizations, replace the default prometheus-adapter config
 with the new one, and restart the deployment (this restart is required so that prometheus-adapter
 picks up the new config):
@@ -421,7 +481,8 @@ inspecting the corresponding Server HPA CR, or by fetching the metric directly v
 
 *   Filtering metrics by additional labels on the prometheus metric:
 
-    The prometheus metric from which the model RPS is computed has the following labels:
+    The prometheus metric from which the model RPS is computed has the following labels managed
+    by Seldon Core 2:
 
     ```c-like
     seldon_model_infer_total{
@@ -440,9 +501,11 @@ inspecting the corresponding Server HPA CR, or by fetching the metric directly v
     }
     ```
 
-    If you want the scaling metric to be computed based on inferences with a particular value
-    for any of those labels, you can add this in the HPA metric config, as in the example
-    (targeting `method_type="rest"`):
+    If you want the scaling metric to be computed based on a subset of the Prometheus time
+    series with particular label values (labels either managed by Seldon Core 2 or added
+    automatically within your infrastructure), you can add this as a selector the HPA metric
+    config. This is shown in the following example, which scales only based on the RPS of REST
+    requests as opposed to REST + gRPC:
 
     ```yaml
       metrics:
@@ -461,6 +524,7 @@ inspecting the corresponding Server HPA CR, or by fetching the metric directly v
     	    type: AverageValue
             averageValue: "3"
     ```
+
 *   Customize scale-up / scale-down rate & properties by using scaling policies as described in
     the [HPA scaling policies docs](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#configurable-scaling-behavior)
 
@@ -592,8 +656,7 @@ into account when setting the HPA policies.
       within the set `periodSeconds`) is not recommended because of this.
     - Perhaps more importantly, there is no reason to scale faster than the time it takes for
       replicas to become available - this is the true maximum rate with which scaling up can
-      happen anyway. Because the underlying Server replica pods are part of a stateful set, they
-      are created sequentially by k8s.
+      happen anyway.
 
 {% code title="hpa-custom-policy.yaml" lineNumbers="true" %}
 ```yaml