Skip to content

Commit

Permalink
Managed Upgrade Controller: Introduce Hooks (#255)
Browse files Browse the repository at this point in the history
Co-authored-by: Simon Gerber <[email protected]>
  • Loading branch information
bastjan and simu authored Jun 16, 2023
1 parent 8458ef4 commit 3b8e944
Showing 1 changed file with 101 additions and 72 deletions.
173 changes: 101 additions & 72 deletions docs/modules/ROOT/pages/references/architecture/upgrade_controller.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -33,61 +33,11 @@ It's managed through a Custom Resource Definition (CRD) called `UpgradeConfig`.

image:explanations/upgrade-controller-high-level-flow-chart.svg[]

=== The controller is extendable through webhooks [[upgrade-webhooks]]
=== The controller is extendable through hooks

The controller should be able to send notifications to a webhook.
Every step of the upgrade process should be notified.
Failed notification deliveries must not block the upgrade flow.

We might want to reuse the Alertmanager webhook definition, they already thought about TLS and authentication and other necessary features.

[source,yaml]
----
url: "https://example.com/webhook"
http_config: <1>
authorization:
credentials: "token"
proxy_url: "http://proxy.example.com"
annotations: <2>
cluster_id: "bar"
tenant_id: "foo"
----
<1> https://prometheus.io/docs/alerting/latest/configuration/#http_config[Alertmanager HTTP Config]
<2> Additional annotations to send with the webhook.

The controller should send a POST request to the webhook with a JSON payload.

[source,json]
----
{
"version": "1", <1>
"type": "UpgradeSkipped", <2>
"status": "True", <3>
"reason": "ClusterUnhealthy", <4>
"message": "Critical alerts [MultipleDefaultStorageClasses, NodeFilesystemAlmostOutOfFiles] are firing", <5>
"desiredVersion": { <6>
"version": "4.6.34",
"image": "quay.io/openshift-release-dev/ocp-release@sha256:1234567890abcdef"
},
"annotations": { <7>
"cluster_id": "bar",
"tenant_id": "foo"
}
}
----
<1> The version of the webhook payload
<2> The type of the notification.
Inspired by https://github.com/kubernetes/apimachinery/blob/8d1258da8f386b809d312cdda316366d5612f54e/pkg/apis/meta/v1/types.go#L1481[`metav1.Condition`].
<3> The status of the notification.
`True`, `False`, `Unknown`.
<4> The programmatic identifier of the notification indicating the reason for the notification.
<5> The human-readable message indicating the reason for the notification.
<6> The desired version of the cluster.
Only present for certain notifications.
<7> Additional annotations from the webhook configuration.
The controller can run arbitrary commands if certain events happen during the upgrade.
The commands are executed as Kubernetes jobs.
Information about the running upgrade is passed to the jobs through environment variables.

=== The controller manages the content of the `ClusterVersion/version` object [[manage-version-object]]

Expand All @@ -105,7 +55,6 @@ The controller creates an `UpgradeJob` object at a time configured in the `Upgra
The `UpgradeJob` contains a snapshot of the most recent version in the `.status.availableUpdates` field and a timestamp when the upgrade should start.

The `UpgradeJob` rechecks the available updates at the time of the upgrade.
If the version is no longer available, the upgrade is skipped and a notification is send to the webhook.

[source,yaml]
----
Expand Down Expand Up @@ -150,10 +99,8 @@ Initially supported values are `@odd` and `@even`.
The controller shouldn't try to upgrade a cluster that isn't healthy.

An `UpgradeJob` checks the cluster health before the upgrade and skips the upgrade if the cluster is unhealthy.
If an update is skipped, the controller should send a notification to the webhook.

The controller should also check the cluster health after the upgrade.
If the cluster is unhealthy, the controller should send a notification to the webhook.

Having custom queries allows customers or VSHN to extend checks to skip upgrades easily.

Expand Down Expand Up @@ -226,7 +173,7 @@ spec:
<1> Template for the `ClusterVersion/version` object.
<2> The `desiredUpdate` is ignored and set by the `UpgradeJob` controller.

=== UpgradeConfig
=== UpgradeConfig [[upgrade-config]]

The `UpgradeConfig` CRD defines the upgrade schedule and the upgrade job template.
The reconciliation loop of the controller creates `UpgradeJob` objects based on the `UpgradeConfig` object.
Expand All @@ -246,6 +193,9 @@ spec:
pinVersionWindow: "4h" <2>
maxUpgradeStartDelay: "1h" <3>
jobTemplate:
metadata:
labels:
upgrade-config: cluster-upgrade <7>
spec:
config:
upgradeTimeout: "2h" <4>
Expand Down Expand Up @@ -273,16 +223,6 @@ spec:
- openshift-monitoring
customQueries:
- query: "up{job=~"^argocd-.+$",namespace="syn"} != 1"
webhooks: <7>
- url: "https://example.com/webhook"
annotations:
cluster_id: "bar"
tenant_id: "foo"
webhooks: <7>
- url: "https://example.com/webhook"
annotations:
cluster_id: "bar"
tenant_id: "foo"
----
<1> The upgrade schedule as defined in <<upgrade-schedule>>.
<2> The time window before the maintenance window in which the upgrade version is pinned.
Expand All @@ -293,9 +233,8 @@ Influences the `UpgradeJob`'s `.status.upgradeBefore` field.
The upgrade is marked as failed if it takes longer than this.
<5> The health checks to perform before the upgrade as defined in <<upgrade-health-checks>>.
<6> The health checks to perform after the upgrade as defined in <<upgrade-health-checks>>.
<7> The webhook to send notifications to as defined in <<upgrade-webhook>>.
Having multiple webhooks allows to send notifications to different systems.
Both the `UpgradeConfig` and the `UpgradeJob` have a `webhooks` field since both might send notifications.
<7> Set a label on the `UpgradeJob`.
Allow selecting the created jobs in the `UpgradeJobHook` manifest.

=== UpgradeJob

Expand All @@ -318,7 +257,6 @@ spec:
upgradeTimeout: "2h"
preUpgradeHealthChecks: {} ...
postUpgradeHealthChecks: {} ...
webhooks: []
----
<1> The name of the `UpgradeJob` is the timestamp when the upgrade should start plus a hash of the `UpgradeConfig` object.
The timestamp is primarily used for sorting the `UpgradeJob` objects should multiple exist.
Expand All @@ -328,6 +266,97 @@ If the upgrade doesn't start within this time window, for example when the contr
<4> The version to upgrade to.
<5> The config as defined in <<upgrade-config>> and copied from the `UpgradeConfig` object.

=== UpgradeJobHook

The `UpgradeJobHook` CRD allows to run arbitrary jobs before and after the upgrade.
The hook can be run once for the next upgrade, or for every upgrade.

Data about the upgrade is passed to the hook in environment variables.

[source,yaml]
----
apiVersion: managedupgrade.appuio.io/v1beta1
kind: UpgradeJobHook
metadata:
name: cluster-upgrade-notify-ext
spec:
on: <1>
- create
- start
- finish
- success
- failure
run: next # [next, all] <2>
failurePolicy: Ignore # [Abort, Ignore] <3>
selector: <4>
matchLabels:
upgrade-config: cluster-upgrade
template: <5>
spec:
template:
spec:
containers:
- name: notify
image: curlimages/curl:8.1.2 # sponsored OSS image
args:
- -XPOST
- -H
- Content-Type: application/json
- -d
- '{"event": $(EVENT_NAME), "version": $(JOB_spec_desiredVersion_image)}' <6>
- https://example.com/webhook
restartPolicy: Never
backoffLimit: 3
ttlSecondsAfterFinished: 43200 # 12h <7>
activeDeadlineSeconds: 300 # 5m <8>
----
<1> The events when to run the hook.
`create` runs the hook when the `UpgradeJob` is created.
The version is pinned at this point and the job is waiting for `startAfter`.
This can be used to communicate the pending upgrade to other systems.
See `pinVersionWindow` in <<upgrade-config>>.
`start` runs the hook when the `UpgradeJob` starts.
`finish` runs the hook when the `UpgradeJob` finishes, regardless of the outcome.
`success` runs the hook when the `UpgradeJob` finishes successfully.
`failure` runs the hook when the `UpgradeJob` finishes with an error.
<2> Whether to run the hook for the next upgrade or for every upgrade.
<3> What to do when the hook fails.
`Ignore` is the default and continues the upgrade process.
`Abort` marks the upgrade as failed and stops the upgrade process.
+
[NOTE]
====
More advanced failure policies can be handled through the built-in https://kubernetes.io/docs/concepts/workloads/controllers/job/#handling-pod-and-container-failures[Job failure handling mechanisms].
====
<4> The selector to select the `UpgradeJob` objects to run the hook for.
<5> The https://pkg.go.dev/k8s.io/api/batch/v1#JobTemplateSpec[batchv1.JobTemplateSpec] to run.
<6> The controller injects the following environment variables:
* `EVENT`: The event that triggered the hook as JSON.
+
[NOTE]
====
The event definition isn't complete yet. It will be extended in the future.
Guaranteed to be present are the `name`, `time`, `reason`, `message` fields.
====
* `EVENT_*`: The event definition is flattened into environment variables.
The values are JSON encoded; `"string"` is encoded as `"\"string\""`, `null` is encoded as `null`.
The keys are the field paths separated by `_`.
For example:
** `EVENT_name`: The name of the event that triggered the hook.
** `EVENT_reason`: The reason why the event was triggered.
* `JOB`: The full `UpgradeJob` object as JSON.
* `JOB_*`: The job definition is flattened into environment variables.
The values are JSON encoded; `"string"` is encoded as `"\"string\""`, `null` is encoded as `null`.
The keys are the field paths separated by `_`.
For example:
** `JOB_metadata_name`: The name of the `UpgradeJob` that triggered the hook.
** `JOB_metadata_labels_my_var_io_info`: The label `my-var.io/info` of the `UpgradeJob` that triggered the hook.
** `JOB_spec_desiredVersion_image`: The image of the `UpgradeJob` that triggered the hook.
<7> Jobs aren't deleted automatically.
Use `ttlSecondsAfterFinished` to delete the job after a certain time.
<8> There is no automatic timeout for jobs.
Use `activeDeadlineSeconds` to set a timeout.

== Resources

- https://access.redhat.com/labs/ocpupgradegraph/update_channel[RedHat OCP Upgrade Graph]
Expand Down

0 comments on commit 3b8e944

Please sign in to comment.