diff --git a/enhancements/sig-policy/99-policy-placement-strategy/README.md b/enhancements/sig-policy/99-policy-placement-strategy/README.md new file mode 100644 index 00000000..a0e2ad66 --- /dev/null +++ b/enhancements/sig-policy/99-policy-placement-strategy/README.md @@ -0,0 +1,266 @@ +# Policy placement strategy + +## Release Signoff Checklist + +- [ ] Enhancement is `implementable` +- [ ] Design details are appropriately documented from clear requirements +- [ ] Test plan is defined +- [ ] Graduation criteria for dev preview, tech preview, GA +- [ ] User-facing documentation is created in + [website](https://github.com/open-cluster-management-io/open-cluster-management-io.github.io/) + +## Summary + +Following from the [`DecisionStrategy`](../../sig-architecture/64-placementStrategy/README.md) field in the `Placement` +API, policies can leverage this new logic to have a configurable and systematic way to roll out policy updates to +clusters. + +## Motivation + +Currently policies and subsequent updates are pushed out to clusters en masse based on the placement to which it has +been bound. The new `DecisionStrategy` field in the `Placement` API creates cluster groupings that controllers can +leverage to have segmented, configurable rollouts of resources. This will aid use cases where high availability may be a +priority or a set of test clusters should receive and verify the updates before the remaining clusters get updated. + +### Goals + +- Make `Placement` the primary API for placement (currently the governance propagator is somewhat + `PlacementRule`-centric) +- Leverage the `Placement` helper library to retrieve placement decisions +- Reflect rollout status per cluster in the root policy for discoverability (whether a cluster is up-to-date or not) +- (stretch) Implement the `RolloutStrategy` struct for policies, including: + - `RolloutStrategy` + - "All": all clusters at once + - "Progressive": one cluster at a time + - "ProgressivePerGroup": one group of clusters at a time + - `Timeout`: Maximum amount of time to wait for success before continuing the rollout + - `MandatoryDecisionGroups`: groups that should be handled before other groups + - `MaxConcurrency`: Concurrency during a progressive rollout +- (stretch) Add an aggregated rollout status for the root policy status. + +### Non-Goals + +Any specialized behaviors outside of those provided by the `Placement` library and, by extension, the `DecisionStrategy` +enhancement, that might require additional code other than that already provided/required by the `Placement` library. + +## Proposal + +### User Stories + +#### Story 1 / Phase 1 + +As a system administrator, I want to know the status of a rollout for a policy and what clusters have been updated. + +- **Summary** + + - Add an annotation to the replicated Policy indicating whether a policy is up-to-date. + +- **Details** + + - Update the `governance-policy-propagator` controller to add a `open-cluster-management.io/root-policy-generation` + annotation to the replicated policy for each cluster. + - Update the annotation with the policy's generation at the time of the update when a policy has been replicated. + - (Optional--requirement for phase 3) Update the `governance-policy-propagator` controller to add a + `open-cluster-management.io/root-policy-timestamp` annotation to the replicated policy for each cluster indicating + when the replicated policy was last applied. + +#### Story 2 / Phase 2 + +As a system administrator, I want to use a placement `DecisionStrategy` with policies. + +- **Summary** + + - Use the `Placement` library to gather the list of clusters and iterate over them as it is today. + +- **Details** + + - Make `Placement` the primary placement resource and parse `PlacementRule` decisions into the `Placement` struct + instead of the `PlacementRule` struct. + - Using the `Placement` library and implementing a basic `ClusterRolloutStatusFunc`, iterate over all clusters as + usual using the `All` rollout strategy. The `RolloutStatus` will initially be either `ToApply` or `Succeed`. + - (Optional cleanup) Remove the current `Placement.Decisions` from the `status` of the `Policy` CRD. It's uncertain + why `Policy` would store something that's discoverable elsewhere, and it's not certain it's in use currently. + - (Optional) Remove the `ClusterNamespace` from the `status` of the `Policy` CRD (`Placement` doesn't have this, so it + ought to be safe to remove it at this point if it's feasible and not overly complex to do so). + +- **Snippet (untested)** + + ```golang + // common.go + type placementDecisionGetter struct { + c client.Client + } + + func (pd placementDecisionGetter) List(selector labels.Selector, namespace string) ([]*clusterv1beta1.PlacementDecision, error) { + pdList := &clusterv1beta1.PlacementDecisionList{} + lopts := &client.ListOptions{ + LabelSelector: selector, + Namespace: namespace, + } + + err := pd.c.List(context.TODO(), pdList, lopts) + pdPtrList := []*clusterv1beta1.PlacementDecision{} + for _, pd := range pdList.Items { + pdPtrList = append(pdPtrList, &pd) + } + + return pdPtrList, err + } + + func GetRolloutHandler(c client.Client, placement *clusterv1beta1.Placement) (*clusterv1alpha1.RolloutHandler, error) { + pdTracker := clusterv1beta1.NewPlacementDecisionClustersTracker(placement, placementDecisionGetter{c}, nil, nil) + + return clusterv1alpha1.NewRolloutHandler(pdTracker) + } + + var GetClusterRolloutStatus clusterv1alpha1.ClusterRolloutStatusFunc = func(clusterName string) clusterv1alpha1.ClusterRolloutStatus { + // (Phase 3 rollout logic goes here) + + return clusterv1alpha1.ClusterRolloutStatus{ + Status: "", // initially determined by comparing the generation of the policy with the particular + // cluster's generation, resolving to either "ToApply" or "Succeed" + LastTransitionTime: "", // retrieved from the last transition time in the cluster's status + } + } + ``` + + ```golang + rolloutHandler := common.GetRolloutHandler(client, placement) + strategy, rolloutResult, err := rolloutHandler.GetRolloutCluster(clusterv1alpha1.All, common.GetClusterRolloutStatus) + + // ... Use rolloutResults for propagation + ``` + +#### Story 3 / Phase 3 (Stretch goal) + +As a system administrator, I want policies to be rolled out only when the policy on previous clusters show as +"Compliant" and control their concurrency and priority. + +- **Summary** + + - Only continue policy deployment once the previously deployed clusters show as "Compliant" after a configurable + timeout. + +- **Details** + + - Update the `PlacementBinding` CRD to contain the `RolloutStrategy` struct. (See the + [`v1alpha1/RolloutStrategy`](https://github.com/open-cluster-management-io/api/blob/main/cluster/v1alpha1/types_rolloutstrategy.go)) + Default to `All` if a strategy is not provided. + - Update the `governance-policy-propagator` controller to add a `open-cluster-management.io/root-policy-timestamp` + annotation to the replicated policy for each cluster indicating when the replicated policy was last applied. + - Store a `lastEvaluated` in the status of the replicated policies so that the `RolloutStatus` can be determined + inside of `ClusterRolloutStatusFunc`. This should only be updated if the status changes (current behavior), or the + generation changes and/or the root policy timestamp is less than the `lastEvaluated` timestamp. + - Update the `ClusterRolloutStatusFunc`, (the `GetClusterRolloutStatus` function from Phase 2) to set up logic to work + with the `Placement` library to provide the information it needs to continue only once the previous group is + compliant (or after a configurable timeout): + + | Status | Description | + | ------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------- | + | `ToApply` | No generation set | + | `Progressing` | Current generation and `lastEvaluated` has not been updated | + | `Succeed` | Compliant and current generation | + | `Failed` | NonCompliant and current generation | + | `TimeOut` | Current generation, `lastEvaluated` is less than `root-policy-timestamp`,
and time has passed beyond timeout specified in the `RolloutStrategy` | + | `Skip` | (unused) | + + - (Stretch) Update the `Policy` CRD to contain a `RolloutStatus`. This will be used to provide an aggregated rollout + status for the root policy of `Progressing` (this would include `ToApply`), `Succeed`, or `Failed` (this would + include `TimeOut`). + +- **Snippet** + + ```golang + type PlacementBinding struct { + metav1.TypeMeta `json:",inline"` + metav1.ObjectMeta `json:"metadata,omitempty"` + // +kubebuilder:validation:Optional + BindingOverrides BindingOverrides `json:"bindingOverrides,omitempty"` + // +kubebuilder:validation:Optional + RolloutStrategy clusterv1alpha1.RolloutStrategy `json:"rolloutStrategy,omitempty"` // <-- New field + // This field provides the ability to select a subset of bound clusters + // +kubebuilder:validation:Optional + // +kubebuilder:validation:Enum=restricted + SubFilter SubFilter `json:"subFilter,omitempty"` + // +kubebuilder:validation:Required + PlacementRef PlacementSubject `json:"placementRef"` + // +kubebuilder:validation:Required + // +kubebuilder:validation:MinItems=1 + Subjects []Subject `json:"subjects"` + Status PlacementBindingStatus `json:"status,omitempty"` + } + ``` + + ```golang + // ComplianceHistory defines compliance details history + type ComplianceHistory struct { + LastEvaluated metav1.Time `json:"lastEvaluated,omitempty" protobuf:"bytes,7,opt,name=lastEvaluated"` // <-- New field + LastTimestamp metav1.Time `json:"lastTimestamp,omitempty" protobuf:"bytes,7,opt,name=lastTimestamp"` + Message string `json:"message,omitempty" protobuf:"bytes,4,opt,name=message"` + EventName string `json:"eventName,omitempty"` + } + + // PolicyStatus defines the observed state of Policy + type PolicyStatus struct { + Placement []*Placement `json:"placement,omitempty"` + Status []*CompliancePerClusterStatus `json:"status,omitempty"` + + // +kubebuilder:validation:Enum=Compliant;Pending;NonCompliant + ComplianceState ComplianceState `json:"compliant,omitempty"` + RolloutStatus clusterv1alpha1.RolloutStatus `json:"rolloutStatus,omitempty"` // <-- New field + Details []*DetailsPerTemplate `json:"details,omitempty"` + } + ``` + +### Implementation Details/Notes/Constraints + +For the `Placement` library, this requires importing at least this package version (the `Placement` library is in the +`v1alpha1` version): + +``` + open-cluster-management.io/api v0.11.1-0.20230809113658-2b2399b5f6e8 +``` + +For testing, the `governance-policy-propagator` doesn't currently account for multiple managed clusters. As part of this +enhancement, the test flows would need to be enhanced (and/or a separate workflow created) that deploys multiple managed +clusters. + +### Risks and Mitigation + +- The optional proposal to remove items from the `Policy` CRD here is to prevent bloat and try to slim down the CRD, but + updating the CRD could be a breaking change for users so further consideration might need to be made and the updates + aren't required. +- The `Placement` library is relatively new and untested outside of its repo and this implementation leans heavily on + its logic. While it works in theory, there could be some tweaks/adjustments as development proceeds, lengthening time + for implementation. The phased approach intends to address this to make partial implementation feasible. +- Having a `LastEvaluated` field in the status is going to increase load on the hub cluster Kubernetes API server that + efforts originally sought to reduce by not sending updates when the status hadn't changed. This could be remedied by + only sending an updated timestamp if the generation was updated and/or the `LastEvaluated` timestamp is less than the + root policy timstamp or (the current behavior) if the status indeed changed. + +## Design Details + +### Open Questions + +1. Should `RolloutStatus` be reflected per-cluster in the root policy? (Removing `clusterNamespace` could essentially + make the CRD the same size as it currently is, remediating concerns around bloat, but the rollout status is also + reflected in each replicated policy.) +2. Is there opposition to removing `Placement.Decisions` from the `status`? (This could alternatively be handled outside + of this enhancement.) +3. Should the `RolloutStrategy` be implemented in the `Policy` CRD instead of `PlacementBinding`? +4. How should a policy handle when placement decision groups shift? (This implementation leans on the `Placement` + library to handle such shifts, only handling the clusters that it returns.) +5. (An aside) What's with the pointer in `[]*clusterv1beta1.PlacementDecision`? + +### Test Plan + +- Unit tests in the repo +- E2E tests in the repo (would need to add additional managed clusters for this, potentially in a separate workflow, + though ideally alongside the existing tests) +- Policy integration tests in the framework (would need to add additional managed clusters for this, potentially in a + separate workflow, though ideally alongside the existing tests) + +## Drawbacks / Alternatives + +The maturity of the `Placement` library could be brought into question. This enhancement could hold at Phase 2 until we +ensure that we have a robust test environment that can test the various configurations before moving forward. diff --git a/enhancements/sig-policy/99-policy-placement-strategy/metadata.yaml b/enhancements/sig-policy/99-policy-placement-strategy/metadata.yaml new file mode 100644 index 00000000..4cfb9503 --- /dev/null +++ b/enhancements/sig-policy/99-policy-placement-strategy/metadata.yaml @@ -0,0 +1,10 @@ +title: policy-placement-strategy +authors: + - "@dhaiducek" +reviewers: + - TBD +approvers: + - TBD +creation-date: 2023-08-03 +last-updated: 2023-08-03 +status: implementable