diff --git a/docs/modules/ROOT/pages/explanations/slos.adoc b/docs/modules/ROOT/pages/explanations/slis.adoc similarity index 86% rename from docs/modules/ROOT/pages/explanations/slos.adoc rename to docs/modules/ROOT/pages/explanations/slis.adoc index c4a23a30..ccf21f11 100644 --- a/docs/modules/ROOT/pages/explanations/slos.adoc +++ b/docs/modules/ROOT/pages/explanations/slis.adoc @@ -1,13 +1,12 @@ -= Service Level Objectives += Service Level Indicator (SLI) +:page-aliases: explanations/slos.adoc -APPUiO Managed OpenShift 4 comes with a collection of https://sre.google/sre-book/service-level-objectives/[service level objectives (SLOs)]. -This document defines and explains these SLOs. -An APPUiO Managed cluster should meet these objectives to provide the expected service level to our customers. +APPUiO Managed OpenShift comes with a collection of https://sre.google/sre-book/service-level-objectives/[service level indicators (SLIs)]. +This document defines and explains these SLIs. +All of the SLIs are in the scope of the https://products.vshn.ch/service_levels.html["Guaranteed Availability" Service Level]. -We use the SLOs and https://sre.google/workbook/alerting-on-slos/#6-multiwindow-multi-burn-rate-alerts[Multiwindow, Mulit-Brun-Rate Alerts] as the basis of our on-call alerting. +We use the SLIs and https://sre.google/workbook/alerting-on-slos/#6-multiwindow-multi-burn-rate-alerts[Multiwindow, Mulit-Burn-Rate Alerts] as the basis of our on-call alerting. -IMPORTANT: These are internal service level *objectives*, not service level *agreements*. -We don't guarantee to meet these objectives at all times. == Ingress @@ -17,7 +16,7 @@ If the workloads running on the cluster aren't accessible, it might as well be d === Canary **** -*99.75% of all HTTP probes to a canary application succeed* +*HTTP probes to a canary application* **** Probes are sent every minute from the ingress operator, inside the cluster, to the external address of the canary target. @@ -71,11 +70,10 @@ If the API isn't available, users can't change configuration or run new workload A misbehaving Kubernetes API directly impacts the service level. - === Request Error Rate **** -*99.9% of all requests to the Kubernetes API server succeed or are invalid* +*Requests to the Kubernetes API server succeed or are invalid* **** This is measured directly at the API server through the following metrics. @@ -95,7 +93,7 @@ NOTE: We only look for HTTP 5xx errors, which indicate a server side error, and === Uptime **** -*99.9% of all HTTP probes to the Kubernetes API server succeed* +*HTTP probes to the Kubernetes API server succeed* **** Probes are sent every 10 seconds from a blackbox exporter inside the cluster to the readiness endpoint of the Kubernetes API server. @@ -112,7 +110,7 @@ This ability is essential and directly impacts the service level. === Canary **** -*99.75% of canary pods start successfully* +*Canary pods start successfully* **** A controller starts a known good canary pod every minute and checks if it successfully started after 3 minutes. @@ -128,7 +126,7 @@ Any storage issues directly impacts the service level for users. === CSI Operations **** -*99.5% of all CSI operations complete successfully* +*CSI operations complete successfully* **** CSI operations are any interactions of the kubelet or controller-manager with the CSI provider. @@ -159,7 +157,7 @@ Without it, users can't reliably access their workload and even moderate packet === Packet Loss **** -*99.5% of all ICMP pings between canary pods succeed* +*ICMP pings between canary pods succeed* **** A network canary daemonset starts a canary pod on every node. diff --git a/docs/modules/ROOT/partials/nav.adoc b/docs/modules/ROOT/partials/nav.adoc index 93593d5f..f7c94d93 100644 --- a/docs/modules/ROOT/partials/nav.adoc +++ b/docs/modules/ROOT/partials/nav.adoc @@ -165,7 +165,7 @@ * Monitoring ** xref:oc4:ROOT:explanations/cluster_monitoring.adoc[] -** xref:oc4:ROOT:explanations/slos.adoc[] +** xref:oc4:ROOT:explanations/slis.adoc[] ** xref:oc4:ROOT:how-tos/monitoring/global-monitoring.adoc[] ** xref:oc4:ROOT:how-tos/monitoring/handle_alerts.adoc[] ** xref:oc4:ROOT:how-tos/monitoring/remove_rules.adoc[]