Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more annotation, maybe cachebust will start behaving #200

Merged
merged 2 commits into from
Oct 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions controllers/reconcile.go
Original file line number Diff line number Diff line change
Expand Up @@ -325,6 +325,14 @@ func (r *FrontendReconciliation) populateCacheBustContainer(j *batchv1.Job) erro
// Add the restart policy
j.Spec.Template.Spec.RestartPolicy = v1.RestartPolicyNever

annotations := j.Spec.Template.ObjectMeta.Annotations
if annotations == nil {
annotations = make(map[string]string)
}
annotations["kube-linter.io/ignore-all"] = "we don't need no any checking"

j.Spec.Template.ObjectMeta.SetAnnotations(annotations)

// Add the akamai edgerc configmap to the deployment

return nil
Expand Down
2 changes: 2 additions & 0 deletions docs/antora/modules/ROOT/pages/api_reference.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -420,6 +420,8 @@ do this in epehemeral environments but not in production + | |
| *`akamaiCacheBustImage`* __string__ | Set Akamai Cache Bust Image + | |
| *`akamaiCacheBustURL`* __string__ | Set Akamai Cache Bust URL that the files will hang off of + | |
| *`akamaiSecretName`* __string__ | The name of the secret we will use to get the akamai credentials + | |
| *`targetNamespaces`* __string array__ | List of namespaces that should receive a copy of the frontend configuration as a config map +
By configurations we mean the fed-modules.json, navigation files, etc. + | |
|===


Expand Down
29 changes: 29 additions & 0 deletions docs/antora/slos/frontend-operator-availability.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Frontend Operator Availability SLO

## Description

The frontend operator availability SLO determines if the operator is functioning normally.
This SLO tracks the deployment number of the frontend operator. There should always be at least
1 deployment running for the operator.

## SLI Rationale
Availability is the most important metric we can gather for this operator. If there are no running
pods, no operations can be conducted. Ensuring that we monitor the availability of the operator is
critical to running ConsoleDot Frontends.

## Implmentation

The SLI for availability is enabled through kubernetes metrics. We can use the `kube_deployment_status_replicas_available`
and filter on the `frontend-operator-system` namespace to determine if we have a running pod. Since
the only thing running in that namespace is the controller operator, we can match our desired pod numbers to our alerts.

## SLO Rationale

The operator's uptime should be at least 99%. Availability is the basis of OpenShift deployments. We cannot reconcile
Frontend resources without a running operator and it is a critical part of our deployment strategy for ConsoleDot.

## Alerting

Alerts for availability are high for now, but could become paged alerts in the future. When the operator becomes
unavailable, it will not delete or remove any resources. Instead, no changes can be made to CRs on the cluster.
While no destructive processes will be invoked, no changes can be made to frontend resources.
25 changes: 25 additions & 0 deletions docs/antora/slos/frontend-operator-reconciliation-time.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Frontend Operator Reconciliation Time SLO

## Description

This metric tracks the reconciliation time for the Frontend Operator's `frontend` controller. Reconcilations should stay
below 4 seconds for at least 95% of the time.

## SLI Rationale

High reconciliation times backup the queue of objects needed to be reconciled. This could indicate an issue with the operator
and prevent objects from getting added or updated in a timely manor.

## Implmentation

The Operator SDK exposes the `controller_runtime_reconcile_time_seconds_bucket` metric to show reconciliation times. Using the
`sum(average_over_time)` modifier allows us to determine if that amount is staying under 4 seconds.

## SLO Rationale
Almost all reconciler calls should be handled without issue in a timely manor. If we are hitting reconciliation times greater than
4 seconds, debugging should begin.

## Alerting
Alerts should be kept to a medium level. Because there are a myriad of issues that could cause high reconciliation times, breaking
this SLO should not result in a page. It should be addressed, but higher than normal reconciliation times alone does not indiciate
an outage.
26 changes: 26 additions & 0 deletions docs/antora/slos/frontend-operator-reconciliation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Frontend Operator Reconciliation SLO

## Description

The frontend operator implements metrics to expose the error rate of reconcilations targeting its CRDs. When that
error rate is too high, we will alert. Reconciliation errors can indicate a wide array of issue including misconfigurations,
outages, and quota/resource constraints. In general, if the operator's cannot reconcile successfully at a nominal rate,
investigation is needed.

## SLI Rationale

Reconciliation error rates show many different issues across several environments. We can use this metric to catch
misconfigurations in production apps and find deploy time issues with the operator.

## Implmentation

The Operator SDK exposes the `controller_runtime_reconcile_total` metric to show the nominal reconcilation rate. Using the
`sum(increase)` modifier allows us to determine if that amount is less than 100%.

## SLO Rationale
Almost all reconciler calls should be handled without issue. If we are hitting more than 10% errors on reconcile, debugging
should begin.

## Alerting
Alerts should be kept to a medium level. Because there are a myriad of issues that could cause a reconciliation error, breaking
this SLO should not result in a page. It should be addressed, but error rate alone does not indiciate an outage.
Loading