[99][sig-policy] Policy placement strategy

ref: https://issues.redhat.com/browse/ACM-6523 Signed-off-by: Dale Haiducek <[email protected]>
open-cluster-management-io · Aug 29, 2023 · e40482c · e40482c
1 parent 11e15da
commit e40482c
Show file tree

Hide file tree

Showing 2 changed files with 312 additions and 0 deletions.
diff --git a/enhancements/sig-policy/99-policy-placement-strategy/README.md b/enhancements/sig-policy/99-policy-placement-strategy/README.md
@@ -0,0 +1,302 @@
+# Policy placement strategy
+
+## Release Signoff Checklist
+
+- [ ] Enhancement is `implementable`
+- [ ] Design details are appropriately documented from clear requirements
+- [ ] Test plan is defined
+- [ ] Graduation criteria for dev preview, tech preview, GA
+- [ ] User-facing documentation is created in
+      [website](https://github.com/open-cluster-management-io/open-cluster-management-io.github.io/)
+
+## Summary
+
+Following from the [`DecisionStrategy`](../../sig-architecture/64-placementStrategy/README.md) field in the `Placement`
+API, policies can leverage this new logic to have a configurable and systematic way to roll out policy updates to
+clusters.
+
+## Motivation
+
+Currently policies and subsequent updates are pushed out to clusters en masse based on the placement to which it has
+been bound. The new `DecisionStrategy` field in the `Placement` API creates cluster groupings that controllers can
+leverage to have segmented, configurable rollouts of resources. This will aid use cases where high availability may be a
+priority or a set of test clusters should receive and verify the updates before the remaining clusters get updated.
+
+### Goals
+
+- Make `Placement` the primary API for placement (currently the governance propagator is somewhat
+  `PlacementRule`-centric)
+- Leverage the `Placement` helper library to retrieve placement decisions
+- Reflect rollout status per cluster in the root policy for discoverability (whether a cluster is up-to-date or not)
+- Implement the `RolloutStrategy` struct for policies, including:
+  - `RolloutStrategy`
+    - "All": all clusters at once
+    - "Progressive": one cluster at a time
+    - "ProgressivePerGroup": one group of clusters at a time
+  - `Timeout`: Maximum amount of time to wait for success before continuing the rollout
+  - `MandatoryDecisionGroups`: groups that should be handled before other groups
+  - `MaxConcurrency`: Concurrency during a progressive rollout
+- Add an aggregated rollout status for the root policy status.
+
+### Non-Goals
+
+Any specialized behaviors outside of those provided by the `Placement` library and, by extension, the `DecisionStrategy`
+enhancement, that might require additional code other than that already provided/required by the `Placement` library.
+This includes the ability to roll back a rollout. A rollback should be implemented by the user by applying the previous
+version of the policy, and GitOps is the recommended method for this.
+
+## Proposal
+
+### Background
+
+Users create policies on the hub cluster (which are referred to as the root policy). These policies are replicated to
+managed cluster namespaces based on the placement to which they have been bound by the associated `PlacementBinding`
+object. The managed clusters monitor the associated managed cluster namespace on the hub for policy updates.
+
+Additionally, policies have a `remediationAction` specified, which can either be `inform`, meaning the policy controller
+monitors objects on the managed cluster reporting status without making changes, or `enforce`, meaning the policy
+controller attempts to make changes on the managed cluster based on the policy definition and reports status accordingly
+based on the success or failure of the update.
+
+### Design
+
+For this rollout enhancement, policy propagation by the `governance-policy-propagator` controller on the hub would take
+place in a multiple-pass approach:
+
+1. On the first pass following a root policy creation or update, the `governance-policy-propagator` controller on the
+   hub cluster will replicate all policies to managed cluster namespaces and are created regardless of the rollout
+   strategy specified, with the remediation action set to `inform`. This way, all policies are up-to-date and reporting
+   a compliance status based on the current version of the root policy. The aggregated rollout status on the root policy
+   is set to `Progressing`. If the remediation action on the root policy is already `inform`, the rollout status on each
+   cluster is set to `Progressing` and no second pass occurs. If the remediation action is `enforce`, the rollout status
+   on each cluster is set to `ToApply`.
+2. On the subsequent passes, if the `remediationAction` is `enforce`, the `governance-policy-propagator` fetches the
+   managed clusters returned from the `Placement` library (based on the user-configured rollout strategy) and sets the
+   rollout status of each returned cluster to `Progressing`, updating any `remediationAction` to `enforce` as defined in
+   the root policy.
+
+From here, work is picked up by the `governance-policy-framework` controllers on the managed clusters:
+
+1. The template sync controller of the framework updates the applicable `ConfigurationPolicy` objects like usual, which
+   triggers reevaluation from the `config-policy-controller`.
+2. Once the `ConfigurationPolicies` have a generation that matches the `lastEvaluatedGeneration` set on the policy
+   status, the status sync controller of the framework will set the rollout status to to `Succeeded` after all of the
+   `ConfigurationPolicies` have a compliant status or `Failed` if there is any non-compliant status.
+
+### User Stories
+
+#### Story 1
+
+As a system administrator, I want to know the status of a rollout for a policy and what clusters have been updated.
+
+- **Summary**
+
+  - Add a `RolloutStatus` to the `Policy` status in the CRD.
+
+- **Details**
+
+  - The `RolloutStatus` would be added to reflect: rollout status on the replicated policy, per-cluster on the root
+    policy, and an aggregated status on the root policy. The aggregated status would specify: `Progressing` (this would
+    include `ToApply`), `Succeeded`, `Failed`, or (this would include `TimeOut`).
+
+- **Snippet**
+
+  ```golang
+  // CompliancePerClusterStatus defines compliance per cluster status
+  type CompliancePerClusterStatus struct {
+    ComplianceState  ComplianceState               `json:"compliant,omitempty"`
+    RolloutStatus    clusterv1alpha1.RolloutStatus `json:"rolloutStatus,omitempty"` // <-- New field
+    ClusterName      string                        `json:"clustername,omitempty"`
+    ClusterNamespace string                        `json:"clusternamespace,omitempty"`
+  }
+
+  // PolicyStatus defines the observed state of Policy
+  type PolicyStatus struct {
+    Placement []*Placement                  `json:"placement,omitempty"`
+    Status    []*CompliancePerClusterStatus `json:"status,omitempty"`
+
+    // +kubebuilder:validation:Enum=Compliant;Pending;NonCompliant
+    ComplianceState ComplianceState               `json:"compliant,omitempty"`
+    RolloutStatus   clusterv1alpha1.RolloutStatus `json:"rolloutStatus,omitempty"` // <-- New field
+    Details         []*DetailsPerTemplate         `json:"details,omitempty"`
+  }
+  ```
+
+#### Story 2
+
+As a system administrator, I want to use a placement `DecisionStrategy` with policies to control the order of the
+updates.
+
+- **Summary**
+
+  - Use the `Placement` library to gather the list of clusters and iterate over them.
+
+- **Details**
+
+  - Make `Placement` the primary placement resource and parse `PlacementRule` decisions into the `Placement` struct
+    instead of the `PlacementRule` struct.
+  - Using the `Placement` library and implementing a `ClusterRolloutStatusFunc`, iterate over all clusters as usual
+    using the `All` rollout strategy.
+
+- **Snippet (untested)**
+
+  ```golang
+  // common.go
+  type placementDecisionGetter struct {
+    c client.Client
+  }
+
+  func (pd placementDecisionGetter) List(selector labels.Selector, namespace string) ([]*clusterv1beta1.PlacementDecision, error) {
+    pdList := &clusterv1beta1.PlacementDecisionList{}
+    lopts := &client.ListOptions{
+      LabelSelector: selector,
+      Namespace:     namespace,
+    }
+
+    err := pd.c.List(context.TODO(), pdList, lopts)
+    if err != nil {
+      return nil, err
+    }
+
+    pdPtrList := []*clusterv1beta1.PlacementDecision{}
+    for _, pd := range pdList.Items {
+      pdPtrList = append(pdPtrList, &pd)
+    }
+
+    return pdPtrList, err
+  }
+
+  func GetRolloutHandler(c client.Client, placement *clusterv1beta1.Placement) (*clusterv1alpha1.RolloutHandler, error) {
+    pdTracker := clusterv1beta1.NewPlacementDecisionClustersTracker(placement, placementDecisionGetter{c}, nil, nil)
+
+    _, _, err := pdTracker.Get()
+    if err != nil {
+      log.Error(err, "Error retrieving PlacementDecisions from tracker")
+    }
+
+    return clusterv1alpha1.NewRolloutHandler(pdTracker)
+  }
+
+  var GetClusterRolloutStatus clusterv1alpha1.ClusterRolloutStatusFunc = func(clusterName string) clusterv1alpha1.ClusterRolloutStatus {
+    // Fetch and return the rollout status and the last transition time from the status in the replicated policy
+
+    return clusterv1alpha1.ClusterRolloutStatus{
+      Status: "<rollout-status>",
+      LastTransitionTime: "<transition-time>",
+    }
+  }
+  ```
+
+  ```golang
+  rolloutHandler := common.GetRolloutHandler(client, placement)
+  strategy, rolloutResult, err := rolloutHandler.GetRolloutCluster(clusterv1alpha1.All, common.GetClusterRolloutStatus)
+
+  // ... Use rolloutResults for propagation
+  ```
+
+#### Story 3
+
+As a system administrator, I want policies to be rolled out only when the policy on previous clusters show as
+"Compliant" and control their concurrency and priority.
+
+- **Summary**
+
+  - Only continue policy deployment once the previously deployed clusters show as "Compliant".
+
+- **Details**
+
+  - Update the `PlacementBinding` CRD to contain the `RolloutStrategy` struct. (See the
+    [`v1alpha1/RolloutStrategy`](https://github.com/open-cluster-management-io/api/blob/main/cluster/v1alpha1/types_rolloutstrategy.go))
+    Defaults to `All` if a strategy is not provided or the remediation action is `inform`.
+  - When the `remediationAction` is set to `enforce`, policies not currently being rolled out will be set to `inform` to
+    continue to report status without making changes on the managed cluster while waiting for the new version of the
+    policy to be enforced.
+  - Update the `ClusterRolloutStatusFunc`, (the `GetClusterRolloutStatus` function from Story 2) to set up logic to work
+    with the `Placement` library to provide the information it needs to continue only once the previous group is
+    compliant (or after a configurable timeout):
+
+    | Status        | Description                                                                                                                                           |
+    | ------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
+    | `ToApply`     | A policy to be enforced is created/updated in the first pass of the propagator on the hub                                                             |
+    | `Progressing` | Policy was defined as `inform`, or selected by the `Placement` library and updated to `enforce`                                                       |
+    | `Succeeded`   | Policy was defined as `inform` and has status, or status is compliant and the `lastEvaluatedGeneration` matches the generation on the managed cluster |
+    | `Failed`      | Policy is non-compliant and the `lastEvaluatedGeneration` matches the generation on the managed cluster                                               |
+    | `TimeOut`     | Time has passed beyond the timeout specified in the `RolloutStrategy` without a returned status                                                       |
+    | `Skip`        | (unused)                                                                                                                                              |
+
+- **Snippet**
+
+  ```golang
+  type PlacementBinding struct {
+    metav1.TypeMeta   `json:",inline"`
+    metav1.ObjectMeta `json:"metadata,omitempty"`
+    // +kubebuilder:validation:Optional
+    BindingOverrides BindingOverrides `json:"bindingOverrides,omitempty"`
+    // +kubebuilder:validation:Optional
+    RolloutStrategy clusterv1alpha1.RolloutStrategy `json:"rolloutStrategy,omitempty"` // <-- New field
+    // This field provides the ability to select a subset of bound clusters
+    // +kubebuilder:validation:Optional
+    // +kubebuilder:validation:Enum=restricted
+    SubFilter SubFilter `json:"subFilter,omitempty"`
+    // +kubebuilder:validation:Required
+    PlacementRef PlacementSubject `json:"placementRef"`
+    // +kubebuilder:validation:Required
+    // +kubebuilder:validation:MinItems=1
+    Subjects []Subject              `json:"subjects"`
+    Status   PlacementBindingStatus `json:"status,omitempty"`
+  }
+  ```
+
+### Implementation Details/Notes/Constraints
+
+For the `Placement` library, this requires importing at least this package version (the `Placement` library is in the
+`v1alpha1` version):
+
+```
+  open-cluster-management.io/api v0.11.1-0.20230828015110-b39eb9026c6e
+```
+
+For testing, the `governance-policy-propagator` doesn't currently account for multiple managed clusters. As part of this
+enhancement, the test flows would need to be enhanced (and/or a separate workflow created) that deploys multiple managed
+clusters.
+
+The `PlacementDecisionGetter` returns an array of pointers (`[]*clusterv1beta1.PlacementDecision`) because it was
+intended to retrieve from a cache, so the implementation could consider setting up a `PlacementDecision` cache instead
+of using the Kubernetes client directly.
+
+### Risks and Mitigation
+
+- The `Placement` library is relatively new and untested outside of its repo and this implementation leans heavily on
+  its logic. While it works in theory, there could be some tweaks/adjustments as development proceeds, lengthening time
+  for implementation. The phased approach intends to address this to make partial implementation feasible.
+
+## Design Details
+
+### Open Questions
+
+1. Should the rollout handler be instantiated with a dynamic watcher/cache instead of the controller client?
+2. Should `RolloutStatus` be reflected per-cluster in the root policy? (The rollout status would also be reflected in
+   each replicated policy.)
+3. Should the `RolloutStrategy` be implemented in the `Policy` CRD instead of `PlacementBinding`? (If that were the
+   case, it would also need to be added to `PolicySet`.)
+4. How should a policy handle when placement decision groups shift? (This implementation leans on the `Placement`
+   library to handle such shifts, only handling the clusters that it returns.)
+5. How is "soak time" (the minimum time for a policy to reach a successful state) and timeout (the maximum amount of
+   time to wait for a successful state), as defined in the placement enhancement, handled for Policies?
+
+### Test Plan
+
+- Unit tests in the repo
+- E2E tests in the repo (would need to add additional managed clusters for this, potentially in a separate workflow,
+  though ideally alongside the existing tests)
+- Policy integration tests in the framework (would need to add additional managed clusters for this, potentially in a
+  separate workflow, though ideally alongside the existing tests)
+
+## Drawbacks / Alternatives
+
+The maturity of the `Placement` library could be brought into question. This enhancement could hold after migrating to
+the `Placement` library in Story 2 and only support the `All` strategy until we ensure that we have a robust test
+environment that can test the various configurations before moving forward.
+
+If a user only updates a single policy-template out of many in a `Policy`, the rollout would signal a status update for
+all of the policy-templates. This is a drawback that this enhancement does not seek to address.
diff --git a/enhancements/sig-policy/99-policy-placement-strategy/metadata.yaml b/enhancements/sig-policy/99-policy-placement-strategy/metadata.yaml
@@ -0,0 +1,10 @@
+title: policy-placement-strategy
+authors:
+  - "@dhaiducek"
+reviewers:
+  - TBD
+approvers:
+  - TBD
+creation-date: 2023-08-03
+last-updated: 2023-08-03
+status: implementable