[99][sig-policy] Policy placement strategy

ref: https://issues.redhat.com/browse/ACM-6523 Signed-off-by: Dale Haiducek <[email protected]>
open-cluster-management-io · Aug 14, 2023 · 6b3e4e7 · 6b3e4e7
1 parent 11e15da
commit 6b3e4e7
Show file tree

Hide file tree

Showing 2 changed files with 277 additions and 0 deletions.
diff --git a/enhancements/sig-policy/99-policy-placement-strategy/README.md b/enhancements/sig-policy/99-policy-placement-strategy/README.md
@@ -0,0 +1,267 @@
+# Policy placement strategy
+
+## Release Signoff Checklist
+
+- [ ] Enhancement is `implementable`
+- [ ] Design details are appropriately documented from clear requirements
+- [ ] Test plan is defined
+- [ ] Graduation criteria for dev preview, tech preview, GA
+- [ ] User-facing documentation is created in
+      [website](https://github.com/open-cluster-management-io/open-cluster-management-io.github.io/)
+
+## Summary
+
+Following from the [`DecisionStrategy`](../../sig-architecture/64-placementStrategy/README.md) field in the `Placement`
+API, policies can leverage this new logic to have a configurable and systematic way to roll out policy updates to
+clusters.
+
+## Motivation
+
+Currently policies and subsequent updates are pushed out to clusters en masse based on the placement to which it has
+been bound. The new `DecisionStrategy` field in the `Placement` API creates cluster groupings that controllers can
+leverage to have segmented, configurable rollouts of resources. This will aid use cases where high availability may be a
+priority or a set of test clusters should receive and verify the updates before the remaining clusters get updated.
+
+### Goals
+
+- Make `Placement` the primary API for placement (currently the governance propagator is somewhat
+  `PlacementRule`-centric)
+- Leverage the `Placement` helper library to retrieve placement decisions
+- Reflect rollout status per cluster in the root policy for discoverability (whether a cluster is up-to-date or not)
+- (stretch) Implement the `RolloutStrategy` struct for policies, including:
+  - `RolloutStrategy`
+    - "All": all clusters at once
+    - "Progressive": one cluster at a time
+    - "ProgressivePerGroup": one group of clusters at a time
+  - `Timeout`: Maximum amount of time to wait for success before continuing the rollout
+  - `MandatoryDecisionGroups`: groups that should be handled before other groups
+  - `MaxConcurrency`: Concurrency during a progressive rollout
+- (stretch) Add an aggregated rollout status for the root policy status.
+
+### Non-Goals
+
+Any specialized behaviors outside of those provided by the `Placement` library and, by extension, the `DecisionStrategy`
+enhancement, that might require additional code other than that already provided/required by the `Placement` library.
+
+## Proposal
+
+### User Stories
+
+#### Story 1 / Phase 1
+
+As a system administrator, I want to know the status of a rollout for a policy and what clusters have been updated.
+
+- **Summary**
+
+  - Add an annotation to the replicated Policy indicating whether a policy is up-to-date.
+
+- **Details**
+
+  - Update the `governance-policy-propagator` controller to add a `open-cluster-management.io/root-policy-generation`
+    annotation to the replicated policy for each cluster.
+  - Update the annotation with the policy's generation at the time of the update when a policy has been replicated.
+  - (Optional--requirement for phase 3) Update the `governance-policy-propagator` controller to add a
+    `open-cluster-management.io/root-policy-timestamp` annotation to the replicated policy for each cluster indicating
+    when the replicated policy was last applied.
+
+#### Story 2 / Phase 2
+
+As a system administrator, I want to use a placement `DecisionStrategy` with policies.
+
+- **Summary**
+
+  - Use the `Placement` library to gather the list of clusters and iterate over them as it is today.
+
+- **Details**
+
+  - Make `Placement` the primary placement resource and parse `PlacementRule` decisions into the `Placement` struct
+    instead of the `PlacementRule` struct.
+  - Using the `Placement` library and implementing a basic `ClusterRolloutStatusFunc`, iterate over all clusters as
+    usual using the `All` rollout strategy. The `RolloutStatus` will initially be either `ToApply` or `Succeed`.
+  - (Optional cleanup) Remove the current `Placement.Decisions` from the `status` of the `Policy` CRD. It's uncertain
+    why `Policy` would store something that's discoverable elsewhere, and it's not certain it's in use currently.
+  - (Optional) Remove the `ClusterNamespace` from the `status` of the `Policy` CRD (`Placement` doesn't have this, so it
+    ought to be safe to remove it at this point if it's feasible and not overly complex to do so).
+
+- **Snippet (untested)**
+
+  ```golang
+  // common.go
+  type placementDecisionGetter struct {
+    c client.Client
+  }
+
+  func (pd placementDecisionGetter) List(selector labels.Selector, namespace string) ([]*clusterv1beta1.PlacementDecision, error) {
+    pdList := &clusterv1beta1.PlacementDecisionList{}
+    lopts := &client.ListOptions{
+      LabelSelector: selector,
+      Namespace:     namespace,
+    }
+
+    err := pd.c.List(context.TODO(), pdList, lopts)
+    pdPtrList := []*clusterv1beta1.PlacementDecision{}
+    for _, pd := range pdList.Items {
+      pdPtrList = append(pdPtrList, &pd)
+    }
+
+    return pdPtrList, err
+  }
+
+  func GetRolloutHandler(c client.Client, placement *clusterv1beta1.Placement) (*clusterv1alpha1.RolloutHandler, error) {
+    pdTracker := clusterv1beta1.NewPlacementDecisionClustersTracker(placement, placementDecisionGetter{c}, nil, nil)
+
+    return clusterv1alpha1.NewRolloutHandler(pdTracker)
+  }
+
+  var GetClusterRolloutStatus clusterv1alpha1.ClusterRolloutStatusFunc = func(clusterName string) clusterv1alpha1.ClusterRolloutStatus {
+    // (Phase 3 rollout logic goes here)
+
+    return clusterv1alpha1.ClusterRolloutStatus{
+      Status: "",             // initially determined by comparing the generation of the policy with the particular
+                              //   cluster's generation, resolving to either "ToApply" or "Succeed"
+      LastTransitionTime: "", // retrieved from the last transition time in the cluster's status
+    }
+  }
+  ```
+
+  ```golang
+  rolloutHandler := common.GetRolloutHandler(client, placement)
+  strategy, rolloutResult, err := rolloutHandler.GetRolloutCluster(clusterv1alpha1.All, common.GetClusterRolloutStatus)
+
+  // ... Use rolloutResults for propagation
+  ```
+
+#### Story 3 / Phase 3 (Stretch goal)
+
+As a system administrator, I want policies to be rolled out only when the policy on previous clusters show as
+"Compliant" and control their concurrency and priority.
+
+- **Summary**
+
+  - Only continue policy deployment once the previously deployed clusters show as "Compliant" after a configurable
+    timeout.
+
+- **Details**
+
+  - Update the `PlacementBinding` CRD to contain the `RolloutStrategy` struct. (See the
+    [`v1alpha1/RolloutStrategy`](https://github.com/open-cluster-management-io/api/blob/main/cluster/v1alpha1/types_rolloutstrategy.go))
+    Default to `All` if a strategy is not provided.
+  - Update the `governance-policy-propagator` controller to add a `open-cluster-management.io/root-policy-timestamp`
+    annotation to the replicated policy for each cluster indicating when the replicated policy was last applied.
+  - Store a `lastEvaluated` in the status of the replicated policies so that the `RolloutStatus` can be determined
+    inside of `ClusterRolloutStatusFunc`. This should only be updated if the status changes (current behavior), or the
+    generation changes and/or the root policy timestamp is less than the `lastEvaluated` timestamp.
+  - Update the `ClusterRolloutStatusFunc`, (the `GetClusterRolloutStatus` function from Phase 2) to set up logic to work
+    with the `Placement` library to provide the information it needs to continue only once the previous group is
+    compliant (or after a configurable timeout):
+
+    | Status        | Description                                                                                                                                          |
+    | ------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------- |
+    | `ToApply`     | No generation set                                                                                                                                    |
+    | `Progressing` | Current generation and `lastEvaluated` has not been updated                                                                                          |
+    | `Succeed`     | Compliant and current generation                                                                                                                     |
+    | `Failed`      | NonCompliant and current generation                                                                                                                  |
+    | `TimeOut`     | Current generation, `lastEvaluated` is less than `root-policy-timestamp`, </br>and time has passed beyond timeout specified in the `RolloutStrategy` |
+    | `Skip`        | (unused)                                                                                                                                             |
+
+  - (Stretch) Update the `Policy` CRD to contain a `RolloutStatus`. This will be used to provide an aggregated rollout
+    status for the root policy of `Progressing` (this would include `ToApply`), `Succeed`, or `Failed` (this would
+    include `TimeOut`).
+
+- **Snippet**
+
+  ```golang
+  type PlacementBinding struct {
+    metav1.TypeMeta   `json:",inline"`
+    metav1.ObjectMeta `json:"metadata,omitempty"`
+    // +kubebuilder:validation:Optional
+    BindingOverrides BindingOverrides `json:"bindingOverrides,omitempty"`
+    // +kubebuilder:validation:Optional
+    RolloutStrategy clusterv1alpha1.RolloutStrategy `json:"rolloutStrategy,omitempty"` // <-- New field
+    // This field provides the ability to select a subset of bound clusters
+    // +kubebuilder:validation:Optional
+    // +kubebuilder:validation:Enum=restricted
+    SubFilter SubFilter `json:"subFilter,omitempty"`
+    // +kubebuilder:validation:Required
+    PlacementRef PlacementSubject `json:"placementRef"`
+    // +kubebuilder:validation:Required
+    // +kubebuilder:validation:MinItems=1
+    Subjects []Subject              `json:"subjects"`
+    Status   PlacementBindingStatus `json:"status,omitempty"`
+  }
+  ```
+
+  ```golang
+  // ComplianceHistory defines compliance details history
+  type ComplianceHistory struct {
+    LastEvaluated metav1.Time `json:"lastEvaluated,omitempty" protobuf:"bytes,7,opt,name=lastEvaluated"`  // <-- New field
+    LastTimestamp metav1.Time `json:"lastTimestamp,omitempty" protobuf:"bytes,7,opt,name=lastTimestamp"`
+    Message       string      `json:"message,omitempty" protobuf:"bytes,4,opt,name=message"`
+    EventName     string      `json:"eventName,omitempty"`
+  }
+
+  // PolicyStatus defines the observed state of Policy
+  type PolicyStatus struct {
+    Placement []*Placement                  `json:"placement,omitempty"`
+    Status    []*CompliancePerClusterStatus `json:"status,omitempty"`
+
+    // +kubebuilder:validation:Enum=Compliant;Pending;NonCompliant
+    ComplianceState ComplianceState               `json:"compliant,omitempty"`
+    RolloutStatus   clusterv1alpha1.RolloutStatus `json:"rolloutStatus,omitempty"` // <-- New field
+    Details         []*DetailsPerTemplate         `json:"details,omitempty"`
+  }
+  ```
+
+### Implementation Details/Notes/Constraints
+
+For the `Placement` library, this requires importing at least this package version (the `Placement` library is in the
+`v1alpha1` version):
+
+```
+  open-cluster-management.io/api v0.11.1-0.20230809113658-2b2399b5f6e8
+```
+
+For testing, the `governance-policy-propagator` doesn't currently account for multiple managed clusters. As part of this
+enhancement, the test flows would need to be enhanced (and/or a separate workflow created) that deploys multiple managed
+clusters.
+
+### Risks and Mitigation
+
+- The optional proposal to remove items from the `Policy` CRD here is to prevent bloat and try to slim down the CRD, but
+  updating the CRD could be a breaking change for users so further consideration might need to be made and the updates
+  aren't required.
+- The `Placement` library is relatively new and untested outside of its repo and this implementation leans heavily on
+  its logic. While it works in theory, there could be some tweaks/adjustments as development proceeds, lengthening time
+  for implementation. The phased approach intends to address this to make partial implementation feasible.
+- Having a `LastEvaluated` field in the status is going to increase load on the hub cluster Kubernetes API server that
+  efforts originally sought to reduce by not sending updates when the status hadn't changed. This could be remedied by
+  only sending an updated timestamp if the generation was updated and/or the `LastEvaluated` timestamp is less than the
+  root policy timstamp or (the current behavior) if the status indeed changed.
+
+## Design Details
+
+### Open Questions
+
+1. Should the rollout handler be instantiated with the dynamic watcher instead of the controller client?
+2. Should `RolloutStatus` be reflected per-cluster in the root policy? (Removing `clusterNamespace` could essentially
+   make the CRD the same size as it currently is, remediating concerns around bloat, but the rollout status is also
+   reflected in each replicated policy.)
+3. Is there opposition to removing `Placement.Decisions` from the `status`? (This could alternatively be handled outside
+   of this enhancement.)
+4. Should the `RolloutStrategy` be implemented in the `Policy` CRD instead of `PlacementBinding`?
+5. How should a policy handle when placement decision groups shift? (This implementation leans on the `Placement`
+   library to handle such shifts, only handling the clusters that it returns.)
+6. (An aside) What's with the pointer in `[]*clusterv1beta1.PlacementDecision`?
+
+### Test Plan
+
+- Unit tests in the repo
+- E2E tests in the repo (would need to add additional managed clusters for this, potentially in a separate workflow,
+  though ideally alongside the existing tests)
+- Policy integration tests in the framework (would need to add additional managed clusters for this, potentially in a
+  separate workflow, though ideally alongside the existing tests)
+
+## Drawbacks / Alternatives
+
+The maturity of the `Placement` library could be brought into question. This enhancement could hold at Phase 2 until we
+ensure that we have a robust test environment that can test the various configurations before moving forward.
diff --git a/enhancements/sig-policy/99-policy-placement-strategy/metadata.yaml b/enhancements/sig-policy/99-policy-placement-strategy/metadata.yaml
@@ -0,0 +1,10 @@
+title: policy-placement-strategy
+authors:
+  - "@dhaiducek"
+reviewers:
+  - TBD
+approvers:
+  - TBD
+creation-date: 2023-08-03
+last-updated: 2023-08-03
+status: implementable