Managed Upgrade Controller: Introduce Hooks (#255)

Co-authored-by: Simon Gerber <[email protected]>
appuio · Jun 16, 2023 · 3b8e944 · 3b8e944
1 parent 8458ef4
commit 3b8e944
Showing 1 changed file with 101 additions and 72 deletions.
diff --git a/docs/modules/ROOT/pages/references/architecture/upgrade_controller.adoc b/docs/modules/ROOT/pages/references/architecture/upgrade_controller.adoc
@@ -33,61 +33,11 @@ It's managed through a Custom Resource Definition (CRD) called `UpgradeConfig`.
 
 image:explanations/upgrade-controller-high-level-flow-chart.svg[]
 
-=== The controller is extendable through webhooks [[upgrade-webhooks]]
+=== The controller is extendable through hooks
 
-The controller should be able to send notifications to a webhook.
-Every step of the upgrade process should be notified.
-Failed notification deliveries must not block the upgrade flow.
-
-We might want to reuse the Alertmanager webhook definition, they already thought about TLS and authentication and other necessary features.
-
-[source,yaml]
-----
-url: "https://example.com/webhook"
-http_config: <1>
-  authorization:
-    credentials: "token"
-  proxy_url: "http://proxy.example.com"
-annotations: <2>
-  cluster_id: "bar"
-  tenant_id: "foo"
-----
-<1> https://prometheus.io/docs/alerting/latest/configuration/#http_config[Alertmanager HTTP Config]
-<2> Additional annotations to send with the webhook.
-
-The controller should send a POST request to the webhook with a JSON payload.
-
-[source,json]
-----
-{
-  "version": "1", <1>
-
-  "type": "UpgradeSkipped", <2>
-  "status": "True", <3>
-  "reason": "ClusterUnhealthy", <4>
-  "message": "Critical alerts [MultipleDefaultStorageClasses, NodeFilesystemAlmostOutOfFiles] are firing", <5>
-
-  "desiredVersion": { <6>
-    "version": "4.6.34",
-    "image": "quay.io/openshift-release-dev/ocp-release@sha256:1234567890abcdef"
-  },
-
-  "annotations": { <7>
-    "cluster_id": "bar",
-    "tenant_id": "foo"
-  }
-}
-----
-<1> The version of the webhook payload
-<2> The type of the notification.
-Inspired by https://github.com/kubernetes/apimachinery/blob/8d1258da8f386b809d312cdda316366d5612f54e/pkg/apis/meta/v1/types.go#L1481[`metav1.Condition`].
-<3> The status of the notification.
-`True`, `False`, `Unknown`.
-<4> The programmatic identifier of the notification indicating the reason for the notification.
-<5> The human-readable message indicating the reason for the notification.
-<6> The desired version of the cluster.
-Only present for certain notifications.
-<7> Additional annotations from the webhook configuration.
+The controller can run arbitrary commands if certain events happen during the upgrade.
+The commands are executed as Kubernetes jobs.
+Information about the running upgrade is passed to the jobs through environment variables.
 
 === The controller manages the content of the `ClusterVersion/version` object [[manage-version-object]]
 
@@ -105,7 +55,6 @@ The controller creates an `UpgradeJob` object at a time configured in the `Upgra
 The `UpgradeJob` contains a snapshot of the most recent version in the `.status.availableUpdates` field and a timestamp when the upgrade should start.
 
 The `UpgradeJob` rechecks the available updates at the time of the upgrade.
-If the version is no longer available, the upgrade is skipped and a notification is send to the webhook.
 
 [source,yaml]
 ----
@@ -150,10 +99,8 @@ Initially supported values are `@odd` and `@even`.
 The controller shouldn't try to upgrade a cluster that isn't healthy.
 
 An `UpgradeJob` checks the cluster health before the upgrade and skips the upgrade if the cluster is unhealthy.
-If an update is skipped, the controller should send a notification to the webhook.
 
 The controller should also check the cluster health after the upgrade.
-If the cluster is unhealthy, the controller should send a notification to the webhook.
 
 Having custom queries allows customers or VSHN to extend checks to skip upgrades easily.
 
@@ -226,7 +173,7 @@ spec:
 <1> Template for the `ClusterVersion/version` object.
 <2> The `desiredUpdate` is ignored and set by the `UpgradeJob` controller.
 
-=== UpgradeConfig
+=== UpgradeConfig [[upgrade-config]]
 
 The `UpgradeConfig` CRD defines the upgrade schedule and the upgrade job template.
 The reconciliation loop of the controller creates `UpgradeJob` objects based on the `UpgradeConfig` object.
@@ -246,6 +193,9 @@ spec:
   pinVersionWindow: "4h" <2>
   maxUpgradeStartDelay: "1h" <3>
   jobTemplate:
+    metadata:
+      labels:
+        upgrade-config: cluster-upgrade <7>
     spec:
       config:
         upgradeTimeout: "2h" <4>
@@ -273,16 +223,6 @@ spec:
           - openshift-monitoring
           customQueries:
           - query: "up{job=~"^argocd-.+$",namespace="syn"} != 1"
-        webhooks: <7>
-          - url: "https://example.com/webhook"
-            annotations:
-              cluster_id: "bar"
-              tenant_id: "foo"
-  webhooks: <7>
-    - url: "https://example.com/webhook"
-      annotations:
-        cluster_id: "bar"
-        tenant_id: "foo"
 ----
 <1> The upgrade schedule as defined in <<upgrade-schedule>>.
 <2> The time window before the maintenance window in which the upgrade version is pinned.
@@ -293,9 +233,8 @@ Influences the `UpgradeJob`'s `.status.upgradeBefore` field.
 The upgrade is marked as failed if it takes longer than this.
 <5> The health checks to perform before the upgrade as defined in <<upgrade-health-checks>>.
 <6> The health checks to perform after the upgrade as defined in <<upgrade-health-checks>>.
-<7> The webhook to send notifications to as defined in <<upgrade-webhook>>.
-Having multiple webhooks allows to send notifications to different systems.
-Both the `UpgradeConfig` and the `UpgradeJob` have a `webhooks` field since both might send notifications.
+<7> Set a label on the `UpgradeJob`.
+Allow selecting the created jobs in the `UpgradeJobHook` manifest.
 
 === UpgradeJob
 
@@ -318,7 +257,6 @@ spec:
     upgradeTimeout: "2h"
     preUpgradeHealthChecks: {} ...
     postUpgradeHealthChecks: {} ...
-    webhooks: []
 ----
 <1> The name of the `UpgradeJob` is the timestamp when the upgrade should start plus a hash of the `UpgradeConfig` object.
 The timestamp is primarily used for sorting the `UpgradeJob` objects should multiple exist.
@@ -328,6 +266,97 @@ If the upgrade doesn't start within this time window, for example when the contr
 <4> The version to upgrade to.
 <5> The config as defined in <<upgrade-config>> and copied from the `UpgradeConfig` object.
 
+=== UpgradeJobHook
+
+The `UpgradeJobHook` CRD allows to run arbitrary jobs before and after the upgrade.
+The hook can be run once for the next upgrade, or for every upgrade.
+
+Data about the upgrade is passed to the hook in environment variables.
+
+[source,yaml]
+----
+apiVersion: managedupgrade.appuio.io/v1beta1
+kind: UpgradeJobHook
+metadata:
+  name: cluster-upgrade-notify-ext
+spec:
+  on: <1>
+    - create
+    - start
+    - finish
+    - success
+    - failure
+  run: next # [next, all] <2>
+  failurePolicy: Ignore # [Abort, Ignore] <3>
+  selector: <4>
+    matchLabels:
+      upgrade-config: cluster-upgrade
+  template: <5>
+    spec:
+      template:
+        spec:
+          containers:
+          - name: notify
+            image: curlimages/curl:8.1.2 # sponsored OSS image
+            args:
+            - -XPOST
+            - -H
+            - Content-Type: application/json
+            - -d
+            - '{"event": $(EVENT_NAME), "version": $(JOB_spec_desiredVersion_image)}' <6>
+            - https://example.com/webhook
+          restartPolicy: Never
+      backoffLimit: 3
+      ttlSecondsAfterFinished: 43200 # 12h <7>
+      activeDeadlineSeconds: 300 # 5m <8>
+----
+<1> The events when to run the hook.
+`create` runs the hook when the `UpgradeJob` is created.
+The version is pinned at this point and the job is waiting for `startAfter`.
+This can be used to communicate the pending upgrade to other systems.
+See `pinVersionWindow` in <<upgrade-config>>.
+`start` runs the hook when the `UpgradeJob` starts.
+`finish` runs the hook when the `UpgradeJob` finishes, regardless of the outcome.
+`success` runs the hook when the `UpgradeJob` finishes successfully.
+`failure` runs the hook when the `UpgradeJob` finishes with an error.
+<2> Whether to run the hook for the next upgrade or for every upgrade.
+<3> What to do when the hook fails.
+`Ignore` is the default and continues the upgrade process.
+`Abort` marks the upgrade as failed and stops the upgrade process.
++
+[NOTE]
+====
+More advanced failure policies can be handled through the built-in https://kubernetes.io/docs/concepts/workloads/controllers/job/#handling-pod-and-container-failures[Job failure handling mechanisms].
+====
+<4> The selector to select the `UpgradeJob` objects to run the hook for.
+<5> The https://pkg.go.dev/k8s.io/api/batch/v1#JobTemplateSpec[batchv1.JobTemplateSpec] to run.
+<6> The controller injects the following environment variables:
+* `EVENT`: The event that triggered the hook as JSON.
++
+[NOTE]
+====
+The event definition isn't complete yet. It will be extended in the future.
+Guaranteed to be present are the `name`, `time`, `reason`, `message` fields.
+====
+* `EVENT_*`: The event definition is flattened into environment variables.
+The values are JSON encoded; `"string"` is encoded as `"\"string\""`, `null` is encoded as `null`.
+The keys are the field paths separated by `_`.
+For example:
+** `EVENT_name`: The name of the event that triggered the hook.
+** `EVENT_reason`: The reason why the event was triggered.
+* `JOB`: The full `UpgradeJob` object as JSON.
+* `JOB_*`: The job definition is flattened into environment variables.
+The values are JSON encoded; `"string"` is encoded as `"\"string\""`, `null` is encoded as `null`.
+The keys are the field paths separated by `_`.
+For example:
+** `JOB_metadata_name`: The name of the `UpgradeJob` that triggered the hook.
+** `JOB_metadata_labels_my_var_io_info`: The label `my-var.io/info` of the `UpgradeJob` that triggered the hook.
+** `JOB_spec_desiredVersion_image`: The image of the `UpgradeJob` that triggered the hook.
+<7> Jobs aren't deleted automatically.
+Use `ttlSecondsAfterFinished` to delete the job after a certain time.
+<8> There is no automatic timeout for jobs.
+Use `activeDeadlineSeconds` to set a timeout.
+
 == Resources
 
 - https://access.redhat.com/labs/ocpupgradegraph/update_channel[RedHat OCP Upgrade Graph]