[99][sig-policy] Policy placement strategy #99

dhaiducek · 2023-08-14T13:57:22Z

ref: https://issues.redhat.com/browse/ACM-6523

dhaiducek · 2023-08-14T13:57:43Z

/cc @yiraeChristineKim @JustinKuli @gparvin

enhancements/sig-policy/99-policy-placement-strategy/README.md

mprahl · 2023-08-15T14:33:27Z

enhancements/sig-policy/99-policy-placement-strategy/README.md

+    Default to `All` if a strategy is not provided.
+  - Update the `governance-policy-propagator` controller to add a `open-cluster-management.io/root-policy-timestamp`
+    annotation to the replicated policy for each cluster indicating when the replicated policy was last applied.
+  - Store a `lastEvaluated` in the status of the replicated policies so that the `RolloutStatus` can be determined


Could you please elaborate on this more?

I'm thinking about the case where the root policy is updated and the ConfigurationPolicy gets evaluated but the status doesn't change, how would this work? Also, if the root policy is updated but the generation of the ConfigurationPolicy doesn't change and its evaluation interval is like 2h or never when compliant, this would not trigger an evaluation in time.

Other controllers also don't post when they last evaluated.

We do provide the LastEvaluated in the ConfigurationPolicy, so I was thinking we can add logic to have the status-sync controller update the evaluation time when the root-policy-timestamp is greater than the last evaluated time, but I'd have to dig into when the LastEvaluated gets sent if it's not on every reconcile...

// An ISO-8601 timestamp of the last time the policy was evaluated LastEvaluated string `json:"lastEvaluated,omitempty"`

https://github.com/open-cluster-management-io/config-policy-controller/blob/7e648171a8bd1ac051fe130f5539fbc36347907b/api/v1/configurationpolicy_types.go#L197

Good point about other controllers, though. I'm not sure--I'll have to consider it more and look at our code. Maybe wipe the status on an update to let it repopulate or something?

Looks like you have the same thought! I'll poke at it a bit to see whether this is feasible.

on a rollout, we could consider resetting the policy status from the Policy Propagator or setting it to a new status indicating a rollout is in progress. The status sync could recognize this and reset the compliance status on the policy templates. There is already logic in the controllers to always send a compliance event when there is no status and it would cause configuration policies to be evaluated immediately regardless of evaluation interval. Then when the policies get a new status, we know the deployment has occurred.

You'd have to think about the specific workflow but I think that may be a good option to communicate between the hub and managed clusters during a rollout.

#99 (comment)

@dhaiducek after brainstorming with @gparvin since he has a good sense of the requirements from a core user, I'd like to propose the following:

New policies are progressively propagated to clusters regardless of whether they are inform or enforce. The status of the root policy should indicate that cluster in later groups are pending rollout or some kind of other state.

Existing policies that are updated should only have their remediationAction: enforce progressively applied. Take for instance a scenario where we have clusters A, B, and C and A is a canary cluster. An update to my policy that sets remediationAction: enforce would at first have remedation: enforce on cluster A and remediationAction: inform on clusters B and C. The root policy status needs to indicate that clusters B and C are pending enforcement. For example, their compliance state could be NonCompliant - Pending Rollout.

PolicyAutomation (separate story) would have a new mode of everyRollout so can be triggered when a rollout fails as opposed to each cluster becoming noncompliant.

@dhaiducek and @JustinKuli , what do you think?

I like the idea to switch to inform to prevent unintended updates. We could add the RolloutStatus to each cluster's status, but I'm not sure we should change the compliance state to NonCompliant during a rollout, especially if users have alerts set up or are active users of the dashboards--that could get unintentionally noisy? Maybe we just use the Pending state? But overall I think this is the right direction to having a controlled rollout.

It's a fair point but at the same time, as a user, I would like to see the state of the policy on the cluster if there are inform policies applied even if they are pending rollout.

We could use the Pending state but include the noncompliant message so at least the information is there.

What do you think @dhaiducek and @gparvin?

I prefer the compliance being NonCompliant when that reflects reality and "maybe" have a separate Remediation Pending status (new field). Though there may be too many things we could do in this extra remediation status field:

pending rollout

remediation applied by policy controller

automated remediation applied

remediated by automation or user intervention

no remediation needed

@gparvin I'm unclear on when you say "when that reflects reality". Are you saying policies should continue to report based on the content it has at the time, or are you saying that if the policy is out-of-date, it should be considered NonCompliant? For the latter, I suppose I'm thinking about a case where it might have been a minor update to the policy where the ultimate compliance wouldn't have changed due to the update? But maybe there's a "Policy" best practice I'm missing there if that's the meaning.

And, if we set it to "NonCompliant" because a rollout is happening, how would setting it to "NonCompliant" affect Ansible automation? As @gparvin points out, we might need some way to distinguish "NonCompliant--this is out of date" vs "NonCompliant--this is broken"? In that case, should Ansible automation only act when the RolloutStatus is Succeed? (I'm crossing my fingers that gets updated to Succeeded 😆)

Interestingly, to the "too many things we could do" point, having an enum for status is apparently against the Kubernetes status conventions (though it seems RolloutStatus is already set as such), where "Status" should simply be "True", "False", or "Unknown": https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#typical-status-properties

@dhaiducek, I think the point is that if we follow the idea that a progressive rollout to a policy update makes the enforce part progressive and nothing else, then the pending rollout clusters should still accurately report back their inform status.

Having an additional field in the root policy status, propagated policy status, or both indicating that a rollout is pending would be sufficient in my opinion.

For the PolicyAutomation front, I don't want us to get caught up on this since it's a very underutilized feature. For now, we could have an additional optional spec field on the PolicyAutomation such as ignorePendingRollout.

Okay, that all sounds good to me, too. I've updated the enhancement with a point about setting the remediationAction to inform. I still need to check whether removing the status might work as a better alternative to the lastEvaluated status.

mprahl · 2023-08-15T14:34:34Z

enhancements/sig-policy/99-policy-placement-strategy/README.md

+  ```golang
+  // ComplianceHistory defines compliance details history
+  type ComplianceHistory struct {
+    LastEvaluated metav1.Time `json:"lastEvaluated,omitempty" protobuf:"bytes,7,opt,name=lastEvaluated"`  // <-- New field


Similar to my last comment, how is this field set?

Addressing thread in #99 (comment)

mprahl · 2023-08-15T14:36:03Z

enhancements/sig-policy/99-policy-placement-strategy/README.md

+   reflected in each replicated policy.)
+3. Is there opposition to removing `Placement.Decisions` from the `status`? (This could alternatively be handled outside
+   of this enhancement.)
+4. Should the `RolloutStrategy` be implemented in the `Policy` CRD instead of `PlacementBinding`?


I prefer this level.

This means it needs to be in PolicySet as well.

@mprahl I just realized that "this" was unclear. Are you saying you prefer it being in the Policy?

gparvin

I added my comments. Thanks for the work on this!

gparvin · 2023-08-14T19:10:05Z

enhancements/sig-policy/99-policy-placement-strategy/README.md

+
+- **Summary**
+
+  - Only continue policy deployment once the previously deployed clusters show as "Compliant" after a configurable


Is this a specific feature request? To me it sounds like this is worded wrong -- maybe. Is the idea the policies must be compliant within the timeout to continue if Compliant "after" the timeout? I was expecting the word "before" instead of "after"

In the decision strategy, there's a timeout to wait for "Success" (which in the case of Policy is "Compliant"). The rollout will continue when clusters in the current rollout have succeeded OR have timed out.

gparvin · 2023-08-14T19:29:07Z

enhancements/sig-policy/99-policy-placement-strategy/README.md

+    | `Progressing` | Current generation and `lastEvaluated` has not been updated                                                                                          |
+    | `Succeed`     | Compliant and current generation                                                                                                                     |
+    | `Failed`      | NonCompliant and current generation                                                                                                                  |
+    | `TimeOut`     | Current generation, `lastEvaluated` is less than `root-policy-timestamp`, </br>and time has passed beyond timeout specified in the `RolloutStrategy` |


I'm trying to understand "TimeOut":
Is it possible the "Current generation" and lastEvaluated gets updated but is somehow less than the root policy timestamp? I'm not sure I've figured that out yet.

The lastEvaluated might be reworked to use the transition time instead (See #99 (comment)). The root-policy-timestamp would be when the replicated policy spec was created/updated with the current generation, so any time before that would have been a previous version of the policy.

gparvin · 2023-08-14T19:31:55Z

enhancements/sig-policy/99-policy-placement-strategy/README.md

+    annotation to the replicated policy for each cluster.
+  - Update the annotation with the policy's generation at the time of the update when a policy has been replicated.
+  - (Optional--requirement for phase 3) Update the `governance-policy-propagator` controller to add a
+    `open-cluster-management.io/root-policy-timestamp` annotation to the replicated policy for each cluster indicating


So this is really a "replication time" -- not a timestamp related to something that happened on the root policy?

Correct--and that might be a better name for it 🙂

+1 on something like open-cluster-management.io/replication-timestamp

How about last-replicated?

gparvin · 2023-08-14T19:37:00Z

enhancements/sig-policy/99-policy-placement-strategy/README.md

+    instead of the `PlacementRule` struct.
+  - Using the `Placement` library and implementing a basic `ClusterRolloutStatusFunc`, iterate over all clusters as
+    usual using the `All` rollout strategy. The `RolloutStatus` will initially be either `ToApply` or `Succeed`.
+  - (Optional cleanup) Remove the current `Placement.Decisions` from the `status` of the `Policy` CRD. It's uncertain


This item and maybe next item -- api removal would require a new version and deprecation of old version (I didn't look at the CRD to see how we handle this today since I know we do different things for root vs replicated policy) -- edit -- I see this is mentioned later so maybe this comment isn't needed...

Right--I might make a separate issue to look at the CRD cleanup outside of this enhancement. But I'm also not sure it's used anywhere...

Re: API updates, I think we could stop populating the field, without removing it from the CRD (and therefore without requiring a version bump). Especially since it's just in the status, it doesn't feel like it's as much of a breaking change.

@JustinKuli We'll have to investigate it (and I'll probably create a new issue to not muddle this enhancement), but I don't think we populate it currently 😅

gparvin · 2023-08-14T19:43:50Z

enhancements/sig-policy/99-policy-placement-strategy/README.md

+  only sending an updated timestamp if the generation was updated and/or the `LastEvaluated` timestamp is less than the
+  root policy timstamp or (the current behavior) if the status indeed changed.
+
+## Design Details


I don't see PolicySet mentioned, this needs to work the same way for PolicySet as it does with Policy.

gparvin · 2023-08-14T19:50:35Z

enhancements/sig-policy/99-policy-placement-strategy/README.md

+5. How should a policy handle when placement decision groups shift? (This implementation leans on the `Placement`
+   library to handle such shifts, only handling the clusters that it returns.)
+6. (An aside) What's with the pointer in `[]*clusterv1beta1.PlacementDecision`?
+


What about PolicyAutomation? I don't think there's really any impact other than awareness automation can be triggered against a policy that is in the mixed state.

Odd behavior if you change object templates and drill down to the old templates on a cluster without the new generation. Going back to the root policy you would not be able to edit/see that old object template. Expected behavior I think...

Right, I'm not sure placement impacts PolicyAutomation, but definitely something to verify.

Odd indeed, but the old templates ought to be reflected in the replicated policy, and it'd hopefully be relatively straightforward to find the rollout status in the root policy to figure out that it hadn't been updated yet. That said, maybe storing a per-cluster rollout status in the root policy would be good for visibility.

gparvin · 2023-08-14T21:03:44Z

enhancements/sig-policy/99-policy-placement-strategy/README.md

+    // +kubebuilder:validation:Optional
+    BindingOverrides BindingOverrides `json:"bindingOverrides,omitempty"`
+    // +kubebuilder:validation:Optional
+    RolloutStrategy clusterv1alpha1.RolloutStrategy `json:"rolloutStrategy,omitempty"` // <-- New field


Can this be used with selective policy enforcement?
Also, will there be CRD changes to additionalPrinterColumns to indicate a rollout scenario is underway or complete for the binding which can be for multiple policies and/or policysets?

Good questions. I need to figure out how selective policy enforcement works, but I think they could work together, with the propagator filtering out clusters from the clusters returned from the Placement library.

The additionalPrinterColumns would be a good addition to the CRD!

@dhaiducek I worry about the rollout strategy being defined on the PlacementBinding instead of the Policy because you can have multiple PlacementBindings per Policy. If multiple set rolloutStrategy, which one wins? Could a subset of clusters rollout Progressive and others rollout ProgressivePerGroup?

Also, the PlacementBinding can be used for multiple policies and it'd be annoying to have to use a separate PlacementBinding just because your policies have different rollout strategies.

Lastly, I think the UX around PlacementBinding is challenging. I find it much more intuitive to configure how my policy is deployed by setting it on the policy.

In regards to selective policy enforcement, I personally think it should be ignored or an error returned when a rollout strategy is also defined.

The multiple placement bindings question also seems to apply to the order/timing of when the enforce is done. If two bindings for the same policy have overlapping sets of clusters included both will (as I understand it) calculate decisiongroups independently. Any given cluster may be in group 1 of binding A and group 100 of binding B. Will it get enforced twice? If the max concurrency is 10, do you end up with 11 clusters being enforced due to the different groupings? etc.

enhancements/sig-policy/99-policy-placement-strategy/README.md

gparvin · 2023-08-16T17:58:26Z

enhancements/sig-policy/99-policy-placement-strategy/README.md

+   reflected in each replicated policy.)
+3. Is there opposition to removing `Placement.Decisions` from the `status`? (This could alternatively be handled outside
+   of this enhancement.)
+4. Should the `RolloutStrategy` be implemented in the `Policy` CRD instead of `PlacementBinding`?


This means it needs to be in PolicySet as well.

JustinKuli · 2023-08-16T20:34:55Z

enhancements/sig-policy/99-policy-placement-strategy/README.md

+
+- **Details**
+
+  - Update the `governance-policy-propagator` controller to add a `open-cluster-management.io/root-policy-generation`


Just thinking out loud, but I don't know if the generation number is enough here. Maybe I'm being paranoid, but if I delete the root policy and recreate it quickly enough that the propagator hasn't removed the old ones yet, I may have changed the policy while keeping the generation the same (or maybe the generation here is actually a larger than the one on the root policy). With the long-term compliance history stuff, we're already considering making a hash of the policy to identify it, I think? Maybe that could also be used here...

I thought about the hash and especially if the compliance history would make use of it, that'd definitely be the way to go. It becomes complex, however, if hub templates are involved since the root policy and replicated policy would look different, which is why I went with the generation. If the root policy generation is less than the annotation on the replicated one, that could also be a signal to delete the replicated policy and recreate it.

JustinKuli · 2023-08-16T20:35:41Z

enhancements/sig-policy/99-policy-placement-strategy/README.md

+    annotation to the replicated policy for each cluster.
+  - Update the annotation with the policy's generation at the time of the update when a policy has been replicated.
+  - (Optional--requirement for phase 3) Update the `governance-policy-propagator` controller to add a
+    `open-cluster-management.io/root-policy-timestamp` annotation to the replicated policy for each cluster indicating


+1 on something like open-cluster-management.io/replication-timestamp

JustinKuli · 2023-08-16T20:37:54Z

enhancements/sig-policy/99-policy-placement-strategy/README.md

+    instead of the `PlacementRule` struct.
+  - Using the `Placement` library and implementing a basic `ClusterRolloutStatusFunc`, iterate over all clusters as
+    usual using the `All` rollout strategy. The `RolloutStatus` will initially be either `ToApply` or `Succeed`.
+  - (Optional cleanup) Remove the current `Placement.Decisions` from the `status` of the `Policy` CRD. It's uncertain


Re: API updates, I think we could stop populating the field, without removing it from the CRD (and therefore without requiring a version bump). Especially since it's just in the status, it doesn't feel like it's as much of a breaking change.

JustinKuli · 2023-08-16T20:47:37Z

enhancements/sig-policy/99-policy-placement-strategy/README.md

+  open-cluster-management.io/api v0.11.1-0.20230809113658-2b2399b5f6e8
+```
+
+For testing, the `governance-policy-propagator` doesn't currently account for multiple managed clusters. As part of this


I think it does currently consider multiple clusters, just in a fake way (which is probably what we want). I'm not saying that what it does is necessarily great, it probably needs to be looked at in detail for testing this feature, but it is already there to an extent: https://github.com/stolostron/governance-policy-propagator/blob/main/test/e2e/case1_propagation_test.go#L78

enhancements/sig-policy/99-policy-placement-strategy/README.md

gparvin · 2023-08-29T14:54:58Z

enhancements/sig-policy/99-policy-placement-strategy/README.md

+   strategy specified, with the remediation action set to `inform`. This way, all policies are up-to-date and reporting
+   a compliance status based on the current version of the root policy. The aggregated rollout status on the root policy
+   is set to `Progressing`. If the remediation action on the root policy is already `inform`, the rollout status on each
+   cluster is set to `Progressing` and no second pass occurs. If the remediation action is `enforce`, the rollout status


Does an inform policy that rolls out with some clusters being NonCompliant ever transition out of Progressing to Succeeded? I think a clarification is needed on how an inform policy rollout can be successful and still identify NonCompliant clusters.

It's an interesting question. I was thinking inform would follow the enforce behaviors, transitioning to Succeeded or Failed based on the compliance.

But for inform, I suppose we could transition the rollout to Succeeded once a status is returned regardless of compliance? It'd differ in behavior from enforce, but maybe that's okay?

Yes, for inform I think as customers learn where problems "may" be they will improve inform policies to detect those issues. Then the rollout will help identify those problem clusters and I don't think the rollout should fail or be limited because inform policies are working as expected. Example: Initially we want our addon to just be available. In v2 we learn we want to look out for terminating config policies....

A user shouldn't define a progressive rollout for an inform policy since it will cause the policy to be enforced unless the ConfigurationPolicy has remediationAction: informOnly which in that event, the rollout would have no effect.

I think the rollout status should be set to Succeeded if any status is provided as Dale mentioned. Syntax errors in the ConfigurationPolicy would lead to a failed rollout status.

The nice part of this is that even users that don't use progressive policy rollouts will get a single field to know if all policy templates have reported an up to date status for the cluster.

Can I not roll out a policyset that contains inform and enforce policies and the inform policies would not be enforced? I should be able to put any random policies in a policyset with some enforce and some inform and it should be honored appropriately during rollout. Enforcing some policies could have unintended consequences -- rebooting nodes, upgrading, etc

I think it needs more clarification on how the RolloutStatus will move from ToApply->progressing->Succeeded/TimeOut.
Here is some thoughts;

ToApply;
1- when the child policy is not get created yet for the managedCluster
2- when the child policy is created for the managedCluster AND the policy remediation is Inform and the policy in NonCompliant state.

Progressing;
1- When the child policy is created for the managedCluster AND the policy remediation is Enforce and the policy in NonCompliant state.

Succeeded;
1- When the child policy is created for the managedCluster and it is in Compliant state for either enforce/inform remediation

TimeOut;
1- When the child policy is created for the managedCluster AND the policy remediation is Enforce and the policy in NonCompliant state for longer time than it is defined in the rolloutStrategy->TimeOut

enhancements/sig-policy/99-policy-placement-strategy/README.md

mprahl · 2023-08-29T17:47:27Z

enhancements/sig-policy/99-policy-placement-strategy/README.md

+    policy, and an aggregated status on the root policy. The aggregated status would specify: `Progressing` (this would
+    include `ToApply`), `Succeeded`, `Failed`, or (this would include `TimeOut`).


I think ToApply means the the policy hasn't been enforced yet on the cluster.

Progressing would mean the policy has been enforced on the cluster but the rollout hasn't completed yet.

Is the set of "valid" status fields different on the root policy vs the replicated policies? It sounds like ToApply would be valid on a replicated policy but not the root. Similarly it sounds like progressing would be valid on the root but not the replicated set. If so it might help to define the set of valid values for root vs replicated.

Good point, @mprahl--I wrapped ToApply in for simplicity but it would be clearer to use ToApply.

@imiller0 The RolloutStatus is an iota we're taking from the Placement library: https://github.com/open-cluster-management-io/api/blob/9675ab737d2086dd7094fc962c9617cc71ec7484/cluster/v1alpha1/helpers.go#L21-L38 but having these well defined for each the root and replicated policies would be good.

mprahl · 2023-08-29T17:56:58Z

enhancements/sig-policy/99-policy-placement-strategy/README.md

+### Open Questions
+
+1. Should the rollout handler be instantiated with a dynamic watcher/cache instead of the controller client?
+2. Should `RolloutStatus` be reflected per-cluster in the root policy? (The rollout status would also be reflected in


Yes to this.

mprahl · 2023-08-29T17:57:23Z

enhancements/sig-policy/99-policy-placement-strategy/README.md

+1. Should the rollout handler be instantiated with a dynamic watcher/cache instead of the controller client?
+2. Should `RolloutStatus` be reflected per-cluster in the root policy? (The rollout status would also be reflected in
+   each replicated policy.)
+3. Should the `RolloutStrategy` be implemented in the `Policy` CRD instead of `PlacementBinding`? (If that were the


The Policy CRD in my opinion based on the comment here:
https://github.com/open-cluster-management-io/enhancements/pull/99/files#r1309173623

mprahl · 2023-08-29T17:59:18Z

enhancements/sig-policy/99-policy-placement-strategy/README.md

+5. How is "soak time" (the minimum time for a policy to reach a successful state) and timeout (the maximum amount of
+   time to wait for a successful state), as defined in the placement enhancement, handled for Policies?


Suggested change

5. How is "soak time" (the minimum time for a policy to reach a successful state) and timeout (the maximum amount of

time to wait for a successful state), as defined in the placement enhancement, handled for Policies?

5. How is "soak time" (the minimum time for a policy to reach a successful state) and timeout (the maximum amount of

time to wait for a state), as defined in the placement enhancement, handled for Policies?

imiller0 · 2023-08-29T21:45:04Z

enhancements/sig-policy/99-policy-placement-strategy/README.md

+    - "Progressive": one cluster at a time
+    - "ProgressivePerGroup": one group of clusters at a time
+  - `Timeout`: Maximum amount of time to wait for success before continuing the rollout
+  - `MandatoryDecisionGroups`: groups that should be handled before other groups


Is this solely a priority ordering or does "mandatory" mean "fail if timed out"?

This would solely be priority ordering, but that's an interesting point to have a different timeout setting for the mandatory groups and the subsequent groups--that could be an update for the Placement library in the future.

MandatoryDecisionGroups is described in the DecisionStrategy enhancement here: https://github.com/open-cluster-management-io/enhancements/tree/main/enhancements/sig-architecture/64-placementStrategy#how-workload-applier-apis-will-benefits-from-using-placementstrategy

imiller0 · 2023-08-29T21:47:16Z

enhancements/sig-policy/99-policy-placement-strategy/README.md

+Any specialized behaviors outside of those provided by the `Placement` library and, by extension, the `DecisionStrategy`
+enhancement, that might require additional code other than that already provided/required by the `Placement` library.
+This includes the ability to roll back a rollout. A rollback should be implemented by the user by applying the previous
+version of the policy, and GitOps is the recommended method for this.


This may be out of scope here, but for rollback to work correctly in the general case I believe that the policy would need to be "mustonlyhave" compliance type.

@imiller0 That's true for objects on the managed cluster--this is referring specifically to the previous version of the policy itself on the hub rather than rolling back changes on the managed cluster.

enhancements/sig-policy/99-policy-placement-strategy/README.md

imiller0 · 2023-08-29T22:02:19Z

enhancements/sig-policy/99-policy-placement-strategy/README.md

+    policy, and an aggregated status on the root policy. The aggregated status would specify: `Progressing` (this would
+    include `ToApply`), `Succeeded`, `Failed`, or (this would include `TimeOut`).


Is the set of "valid" status fields different on the root policy vs the replicated policies? It sounds like ToApply would be valid on a replicated policy but not the root. Similarly it sounds like progressing would be valid on the root but not the replicated set. If so it might help to define the set of valid values for root vs replicated.

imiller0 · 2023-08-29T22:14:08Z

enhancements/sig-policy/99-policy-placement-strategy/README.md

+  - Update the `PlacementBinding` CRD to contain the `RolloutStrategy` struct. (See the
+    [`v1alpha1/RolloutStrategy`](https://github.com/open-cluster-management-io/api/blob/main/cluster/v1alpha1/types_rolloutstrategy.go))
+    Defaults to `All` if a strategy is not provided or the remediation action is `inform`.
+  - When the `remediationAction` is set to `enforce`, policies not currently being rolled out will be set to `inform` to


Once set to enforce do they remain in enforce mode until a subsequent change to the root policy or do they revert to inform once compliant?

They remain in enforce until the root policy is updated again.

imiller0 · 2023-08-29T22:37:00Z

enhancements/sig-policy/99-policy-placement-strategy/README.md

+    // +kubebuilder:validation:Optional
+    BindingOverrides BindingOverrides `json:"bindingOverrides,omitempty"`
+    // +kubebuilder:validation:Optional
+    RolloutStrategy clusterv1alpha1.RolloutStrategy `json:"rolloutStrategy,omitempty"` // <-- New field


The multiple placement bindings question also seems to apply to the order/timing of when the enforce is done. If two bindings for the same policy have overlapping sets of clusters included both will (as I understand it) calculate decisiongroups independently. Any given cluster may be in group 1 of binding A and group 100 of binding B. Will it get enforced twice? If the max concurrency is 10, do you end up with 11 clusters being enforced due to the different groupings? etc.

imiller0 · 2023-08-29T22:42:05Z

enhancements/sig-policy/99-policy-placement-strategy/README.md

+3. Should the `RolloutStrategy` be implemented in the `Policy` CRD instead of `PlacementBinding`? (If that were the
+   case, it would also need to be added to `PolicySet`.)
+4. How should a policy handle when placement decision groups shift? (This implementation leans on the `Placement`
+   library to handle such shifts, only handling the clusters that it returns.)


Does this risk clusters being skipped? Or gaps in the decision groups if clusters are unbound/rebound?

imiller0 · 2023-08-29T22:49:07Z

enhancements/sig-policy/99-policy-placement-strategy/README.md

+   library to handle such shifts, only handling the clusters that it returns.)
+5. How is "soak time" (the minimum time for a policy to reach a successful state) and timeout (the maximum amount of
+   time to wait for a successful state), as defined in the placement enhancement, handled for Policies?
+


How does the rolloutStrategy work with policy dependency? Is it possible that if the two policies are bound using separate policybindings and the policies are modified simultaneously the group ordering could be such that rollout of one policy is entirely blocked until completion of all groups of the other policy?

imiller0 · 2023-08-29T22:53:43Z

enhancements/sig-policy/99-policy-placement-strategy/README.md

+
+1. On the first pass following a root policy creation or update, the `governance-policy-propagator` controller on the
+   hub cluster will replicate all policies to managed cluster namespaces and are created regardless of the rollout
+   strategy specified, with the remediation action set to `inform`. This way, all policies are up-to-date and reporting


If a policy is modified/updated by the user while a rollout is ongoing, does the process start over with all policies being set back to inform and rollout beginning with group 1 again?

That's correct

serngawy · 2023-08-30T10:30:26Z

enhancements/sig-policy/99-policy-placement-strategy/README.md

+   strategy specified, with the remediation action set to `inform`. This way, all policies are up-to-date and reporting
+   a compliance status based on the current version of the root policy. The aggregated rollout status on the root policy
+   is set to `Progressing`. If the remediation action on the root policy is already `inform`, the rollout status on each
+   cluster is set to `Progressing` and no second pass occurs. If the remediation action is `enforce`, the rollout status


I think it needs more clarification on how the RolloutStatus will move from ToApply->progressing->Succeeded/TimeOut.
Here is some thoughts;

ToApply;
1- when the child policy is not get created yet for the managedCluster
2- when the child policy is created for the managedCluster AND the policy remediation is Inform and the policy in NonCompliant state.

Progressing;
1- When the child policy is created for the managedCluster AND the policy remediation is Enforce and the policy in NonCompliant state.

Succeeded;
1- When the child policy is created for the managedCluster and it is in Compliant state for either enforce/inform remediation

TimeOut;
1- When the child policy is created for the managedCluster AND the policy remediation is Enforce and the policy in NonCompliant state for longer time than it is defined in the rolloutStrategy->TimeOut

serngawy · 2023-08-30T10:34:07Z

enhancements/sig-policy/99-policy-placement-strategy/README.md

+
+  - The `RolloutStatus` would be added to reflect: rollout status on the replicated policy, per-cluster on the root
+    policy, and an aggregated status on the root policy. The aggregated status would specify: `Progressing` (this would
+    include `ToApply`), `Succeeded`, `Failed`, or (this would include `TimeOut`).


I don't think there is a Failed status cause the policy-controller will continue to enforce the policy to the managedCluster as long as it is in NonCompliant state. So status could be ToApply, Progressing, Succeeded, TimeOut.

@serngawy True, but if the enforcement is blocked for some reason (like a missing API on the managed cluster or an invalid manifest), it may be good to reflect that in the root policy since there's no way for the rollout to progress?

enhancements/sig-policy/99-policy-placement-strategy/README.md

serngawy · 2023-08-31T10:26:28Z

enhancements/sig-policy/99-policy-placement-strategy/README.md

+    disabled: false
+    remediationAction: inform # <-- Remediation action updated to prepare for rollout
+    rolloutStrategy:
+      type: ProgressivePerGroup


I guess better to include the progressivePerGroup fields as well.

remediationAction: inform # <-- Remediation action updated to prepare for rollout rolloutStrategy: type: ProgressivePerGroup progressivePerGroup: mandatoryDecisionGroups: - groupName: dev-emea (Same group name as defined in the placement.) timeout: None```

ref: https://issues.redhat.com/browse/ACM-6523 Signed-off-by: Dale Haiducek <[email protected]>

mprahl

This looks great. We'll work on adding the new requirements discussed today in a separate PR.

openshift-ci · 2023-09-28T21:02:15Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dhaiducek, mprahl

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~enhancements/sig-policy/OWNERS~~ [dhaiducek,mprahl]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci bot requested review from deads2k and mprahl August 14, 2023 13:57

openshift-ci bot added the approved label Aug 14, 2023

openshift-ci bot requested review from gparvin, JustinKuli and yiraeChristineKim August 14, 2023 13:57

dhaiducek force-pushed the policy-placement branch 2 times, most recently from 73da702 to 6b3e4e7 Compare August 14, 2023 14:14

gparvin reviewed Aug 14, 2023

View reviewed changes

enhancements/sig-policy/99-policy-placement-strategy/README.md Show resolved Hide resolved

dhaiducek force-pushed the policy-placement branch from 6b3e4e7 to 8d12e7d Compare August 14, 2023 18:02

yiraeChristineKim reviewed Aug 14, 2023

View reviewed changes

enhancements/sig-policy/99-policy-placement-strategy/README.md Show resolved Hide resolved

yiraeChristineKim reviewed Aug 14, 2023

View reviewed changes

enhancements/sig-policy/99-policy-placement-strategy/README.md Outdated Show resolved Hide resolved

dhaiducek force-pushed the policy-placement branch from 8d12e7d to fd68266 Compare August 14, 2023 21:32

mprahl reviewed Aug 15, 2023

View reviewed changes

enhancements/sig-policy/99-policy-placement-strategy/README.md Show resolved Hide resolved

mprahl reviewed Aug 15, 2023

View reviewed changes

gparvin reviewed Aug 16, 2023

View reviewed changes

JustinKuli reviewed Aug 16, 2023

View reviewed changes

dhaiducek force-pushed the policy-placement branch from fd68266 to 9803f60 Compare August 18, 2023 03:05

dhaiducek force-pushed the policy-placement branch 2 times, most recently from 43bdaf3 to f7ed48d Compare August 25, 2023 20:13

mprahl reviewed Aug 28, 2023

View reviewed changes

enhancements/sig-policy/99-policy-placement-strategy/README.md Outdated Show resolved Hide resolved

mprahl reviewed Aug 28, 2023

View reviewed changes

enhancements/sig-policy/99-policy-placement-strategy/README.md Outdated Show resolved Hide resolved

dhaiducek force-pushed the policy-placement branch 3 times, most recently from 7eb8447 to 5af01e9 Compare August 29, 2023 13:20

gparvin reviewed Aug 29, 2023

View reviewed changes

dhaiducek force-pushed the policy-placement branch from 5af01e9 to e40482c Compare August 29, 2023 15:03

mprahl reviewed Aug 29, 2023

View reviewed changes

enhancements/sig-policy/99-policy-placement-strategy/README.md Outdated Show resolved Hide resolved

mprahl reviewed Aug 29, 2023

View reviewed changes

enhancements/sig-policy/99-policy-placement-strategy/README.md Outdated Show resolved Hide resolved

mprahl reviewed Aug 29, 2023

View reviewed changes

imiller0 reviewed Aug 29, 2023

View reviewed changes

serngawy reviewed Aug 30, 2023

View reviewed changes

dhaiducek force-pushed the policy-placement branch 2 times, most recently from ffadaa3 to 70cc2e8 Compare August 30, 2023 12:27

mprahl reviewed Aug 30, 2023

View reviewed changes

enhancements/sig-policy/99-policy-placement-strategy/README.md Outdated Show resolved Hide resolved

mprahl reviewed Aug 30, 2023

View reviewed changes

enhancements/sig-policy/99-policy-placement-strategy/README.md Outdated Show resolved Hide resolved

mprahl reviewed Aug 30, 2023

View reviewed changes

enhancements/sig-policy/99-policy-placement-strategy/README.md Outdated Show resolved Hide resolved

dhaiducek force-pushed the policy-placement branch 2 times, most recently from c35160a to 08d4db2 Compare August 30, 2023 21:06

serngawy reviewed Aug 31, 2023

View reviewed changes

dhaiducek force-pushed the policy-placement branch 2 times, most recently from 751cc8b to fc2bc75 Compare September 11, 2023 21:40

[99][sig-policy] Policy placement strategy

bb31ca5

ref: https://issues.redhat.com/browse/ACM-6523 Signed-off-by: Dale Haiducek <[email protected]>

dhaiducek force-pushed the policy-placement branch from fc2bc75 to bb31ca5 Compare September 28, 2023 20:49

mprahl approved these changes Sep 28, 2023

View reviewed changes

openshift-ci bot assigned mprahl Sep 28, 2023

openshift-ci bot added the lgtm label Sep 28, 2023

openshift-merge-robot merged commit dffc91e into open-cluster-management-io:main Sep 28, 2023
1 check passed

dhaiducek deleted the policy-placement branch September 29, 2023 13:41

dhaiducek mentioned this pull request Oct 2, 2023

[99][sig-policy] PolicySet rollout #103

Merged


		- Summary

		- Only continue policy deployment once the previously deployed clusters show as "Compliant" after a configurable


		- Details

		- Update the `governance-policy-propagator` controller to add a `open-cluster-management.io/root-policy-generation`

		policy, and an aggregated status on the root policy. The aggregated status would specify: `Progressing` (this would
		include `ToApply`), `Succeeded`, `Failed`, or (this would include `TimeOut`).

		5. How is "soak time" (the minimum time for a policy to reach a successful state) and timeout (the maximum amount of
		time to wait for a successful state), as defined in the placement enhancement, handled for Policies?

[99][sig-policy] Policy placement strategy #99

[99][sig-policy] Policy placement strategy #99

Conversation

dhaiducek commented Aug 14, 2023

dhaiducek commented Aug 14, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhaiducek Aug 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhaiducek Aug 15, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gparvin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

serngawy Aug 30, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

serngawy Aug 30, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhaiducek Aug 24, 2023 •

edited

Loading

dhaiducek Aug 15, 2023 •

edited

Loading

serngawy Aug 30, 2023 •

edited

Loading

serngawy Aug 30, 2023 •

edited

Loading