diff --git a/controllers/reconcile.go b/controllers/reconcile.go index 705f6a1e..fdd8b6d5 100644 --- a/controllers/reconcile.go +++ b/controllers/reconcile.go @@ -325,6 +325,14 @@ func (r *FrontendReconciliation) populateCacheBustContainer(j *batchv1.Job) erro // Add the restart policy j.Spec.Template.Spec.RestartPolicy = v1.RestartPolicyNever + annotations := j.Spec.Template.ObjectMeta.Annotations + if annotations == nil { + annotations = make(map[string]string) + } + annotations["kube-linter.io/ignore-all"] = "we don't need no any checking" + + j.Spec.Template.ObjectMeta.SetAnnotations(annotations) + // Add the akamai edgerc configmap to the deployment return nil diff --git a/docs/antora/modules/ROOT/pages/api_reference.adoc b/docs/antora/modules/ROOT/pages/api_reference.adoc index 9ede59d6..2324ed75 100644 --- a/docs/antora/modules/ROOT/pages/api_reference.adoc +++ b/docs/antora/modules/ROOT/pages/api_reference.adoc @@ -420,6 +420,8 @@ do this in epehemeral environments but not in production + | | | *`akamaiCacheBustImage`* __string__ | Set Akamai Cache Bust Image + | | | *`akamaiCacheBustURL`* __string__ | Set Akamai Cache Bust URL that the files will hang off of + | | | *`akamaiSecretName`* __string__ | The name of the secret we will use to get the akamai credentials + | | +| *`targetNamespaces`* __string array__ | List of namespaces that should receive a copy of the frontend configuration as a config map + +By configurations we mean the fed-modules.json, navigation files, etc. + | | |=== diff --git a/docs/antora/slos/frontend-operator-availability.md b/docs/antora/slos/frontend-operator-availability.md new file mode 100644 index 00000000..0721a5c3 --- /dev/null +++ b/docs/antora/slos/frontend-operator-availability.md @@ -0,0 +1,29 @@ +# Frontend Operator Availability SLO + +## Description + +The frontend operator availability SLO determines if the operator is functioning normally. +This SLO tracks the deployment number of the frontend operator. There should always be at least +1 deployment running for the operator. + +## SLI Rationale +Availability is the most important metric we can gather for this operator. If there are no running +pods, no operations can be conducted. Ensuring that we monitor the availability of the operator is +critical to running ConsoleDot Frontends. + +## Implmentation + +The SLI for availability is enabled through kubernetes metrics. We can use the `kube_deployment_status_replicas_available` +and filter on the `frontend-operator-system` namespace to determine if we have a running pod. Since +the only thing running in that namespace is the controller operator, we can match our desired pod numbers to our alerts. + +## SLO Rationale + +The operator's uptime should be at least 99%. Availability is the basis of OpenShift deployments. We cannot reconcile +Frontend resources without a running operator and it is a critical part of our deployment strategy for ConsoleDot. + +## Alerting + +Alerts for availability are high for now, but could become paged alerts in the future. When the operator becomes +unavailable, it will not delete or remove any resources. Instead, no changes can be made to CRs on the cluster. +While no destructive processes will be invoked, no changes can be made to frontend resources. diff --git a/docs/antora/slos/frontend-operator-reconciliation-time.md b/docs/antora/slos/frontend-operator-reconciliation-time.md new file mode 100644 index 00000000..6382f799 --- /dev/null +++ b/docs/antora/slos/frontend-operator-reconciliation-time.md @@ -0,0 +1,25 @@ +# Frontend Operator Reconciliation Time SLO + +## Description + +This metric tracks the reconciliation time for the Frontend Operator's `frontend` controller. Reconcilations should stay +below 4 seconds for at least 95% of the time. + +## SLI Rationale + +High reconciliation times backup the queue of objects needed to be reconciled. This could indicate an issue with the operator +and prevent objects from getting added or updated in a timely manor. + +## Implmentation + +The Operator SDK exposes the `controller_runtime_reconcile_time_seconds_bucket` metric to show reconciliation times. Using the +`sum(average_over_time)` modifier allows us to determine if that amount is staying under 4 seconds. + +## SLO Rationale +Almost all reconciler calls should be handled without issue in a timely manor. If we are hitting reconciliation times greater than +4 seconds, debugging should begin. + +## Alerting +Alerts should be kept to a medium level. Because there are a myriad of issues that could cause high reconciliation times, breaking +this SLO should not result in a page. It should be addressed, but higher than normal reconciliation times alone does not indiciate +an outage. diff --git a/docs/antora/slos/frontend-operator-reconciliation.md b/docs/antora/slos/frontend-operator-reconciliation.md new file mode 100644 index 00000000..7d0dabd0 --- /dev/null +++ b/docs/antora/slos/frontend-operator-reconciliation.md @@ -0,0 +1,26 @@ +# Frontend Operator Reconciliation SLO + +## Description + +The frontend operator implements metrics to expose the error rate of reconcilations targeting its CRDs. When that +error rate is too high, we will alert. Reconciliation errors can indicate a wide array of issue including misconfigurations, +outages, and quota/resource constraints. In general, if the operator's cannot reconcile successfully at a nominal rate, +investigation is needed. + +## SLI Rationale + +Reconciliation error rates show many different issues across several environments. We can use this metric to catch +misconfigurations in production apps and find deploy time issues with the operator. + +## Implmentation + +The Operator SDK exposes the `controller_runtime_reconcile_total` metric to show the nominal reconcilation rate. Using the +`sum(increase)` modifier allows us to determine if that amount is less than 100%. + +## SLO Rationale +Almost all reconciler calls should be handled without issue. If we are hitting more than 10% errors on reconcile, debugging +should begin. + +## Alerting +Alerts should be kept to a medium level. Because there are a myriad of issues that could cause a reconciliation error, breaking +this SLO should not result in a page. It should be addressed, but error rate alone does not indiciate an outage.