-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[99][sig-policy] Policy placement strategy #99
[99][sig-policy] Policy placement strategy #99
Conversation
73da702
to
6b3e4e7
Compare
6b3e4e7
to
8d12e7d
Compare
8d12e7d
to
fd68266
Compare
Default to `All` if a strategy is not provided. | ||
- Update the `governance-policy-propagator` controller to add a `open-cluster-management.io/root-policy-timestamp` | ||
annotation to the replicated policy for each cluster indicating when the replicated policy was last applied. | ||
- Store a `lastEvaluated` in the status of the replicated policies so that the `RolloutStatus` can be determined |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please elaborate on this more?
I'm thinking about the case where the root policy is updated and the ConfigurationPolicy gets evaluated but the status doesn't change, how would this work? Also, if the root policy is updated but the generation of the ConfigurationPolicy doesn't change and its evaluation interval is like 2h
or never
when compliant, this would not trigger an evaluation in time.
Other controllers also don't post when they last evaluated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do provide the LastEvaluated
in the ConfigurationPolicy, so I was thinking we can add logic to have the status-sync controller update the evaluation time when the root-policy-timestamp
is greater than the last evaluated time, but I'd have to dig into when the LastEvaluated
gets sent if it's not on every reconcile...
// An ISO-8601 timestamp of the last time the policy was evaluated
LastEvaluated string `json:"lastEvaluated,omitempty"`
Good point about other controllers, though. I'm not sure--I'll have to consider it more and look at our code. Maybe wipe the status on an update to let it repopulate or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you have the same thought! I'll poke at it a bit to see whether this is feasible.
on a rollout, we could consider resetting the policy status from the Policy Propagator or setting it to a new status indicating a rollout is in progress. The status sync could recognize this and reset the compliance status on the policy templates. There is already logic in the controllers to always send a compliance event when there is no status and it would cause configuration policies to be evaluated immediately regardless of evaluation interval. Then when the policies get a new status, we know the deployment has occurred.
You'd have to think about the specific workflow but I think that may be a good option to communicate between the hub and managed clusters during a rollout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dhaiducek after brainstorming with @gparvin since he has a good sense of the requirements from a core user, I'd like to propose the following:
- New policies are progressively propagated to clusters regardless of whether they are inform or enforce. The status of the root policy should indicate that cluster in later groups are pending rollout or some kind of other state.
- Existing policies that are updated should only have their
remediationAction: enforce
progressively applied. Take for instance a scenario where we have clusters A, B, and C and A is a canary cluster. An update to my policy that setsremediationAction: enforce
would at first haveremedation: enforce
on cluster A andremediationAction: inform
on clusters B and C. The root policy status needs to indicate that clusters B and C are pending enforcement. For example, their compliance state could beNonCompliant - Pending Rollout
. - PolicyAutomation (separate story) would have a new mode of
everyRollout
so can be triggered when a rollout fails as opposed to each cluster becoming noncompliant.
@dhaiducek and @JustinKuli , what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea to switch to inform to prevent unintended updates. We could add the RolloutStatus
to each cluster's status, but I'm not sure we should change the compliance state to NonCompliant during a rollout, especially if users have alerts set up or are active users of the dashboards--that could get unintentionally noisy? Maybe we just use the Pending
state? But overall I think this is the right direction to having a controlled rollout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a fair point but at the same time, as a user, I would like to see the state of the policy on the cluster if there are inform policies applied even if they are pending rollout.
We could use the Pending
state but include the noncompliant message so at least the information is there.
What do you think @dhaiducek and @gparvin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer the compliance being NonCompliant when that reflects reality and "maybe" have a separate Remediation Pending status (new field). Though there may be too many things we could do in this extra remediation status field:
- pending rollout
- remediation applied by policy controller
- automated remediation applied
- remediated by automation or user intervention
- no remediation needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gparvin I'm unclear on when you say "when that reflects reality". Are you saying policies should continue to report based on the content it has at the time, or are you saying that if the policy is out-of-date, it should be considered NonCompliant? For the latter, I suppose I'm thinking about a case where it might have been a minor update to the policy where the ultimate compliance wouldn't have changed due to the update? But maybe there's a "Policy" best practice I'm missing there if that's the meaning.
And, if we set it to "NonCompliant" because a rollout is happening, how would setting it to "NonCompliant" affect Ansible automation? As @gparvin points out, we might need some way to distinguish "NonCompliant--this is out of date" vs "NonCompliant--this is broken"? In that case, should Ansible automation only act when the RolloutStatus
is Succeed
? (I'm crossing my fingers that gets updated to Succeeded
😆)
Interestingly, to the "too many things we could do" point, having an enum
for status is apparently against the Kubernetes status conventions (though it seems RolloutStatus
is already set as such), where "Status" should simply be "True", "False", or "Unknown": https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#typical-status-properties
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dhaiducek, I think the point is that if we follow the idea that a progressive rollout to a policy update makes the enforce part progressive and nothing else, then the pending rollout clusters should still accurately report back their inform status.
Having an additional field in the root policy status, propagated policy status, or both indicating that a rollout is pending would be sufficient in my opinion.
For the PolicyAutomation front, I don't want us to get caught up on this since it's a very underutilized feature. For now, we could have an additional optional spec field on the PolicyAutomation such as ignorePendingRollout
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, that all sounds good to me, too. I've updated the enhancement with a point about setting the remediationAction
to inform
. I still need to check whether removing the status might work as a better alternative to the lastEvaluated
status.
```golang | ||
// ComplianceHistory defines compliance details history | ||
type ComplianceHistory struct { | ||
LastEvaluated metav1.Time `json:"lastEvaluated,omitempty" protobuf:"bytes,7,opt,name=lastEvaluated"` // <-- New field |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to my last comment, how is this field set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressing thread in #99 (comment)
reflected in each replicated policy.) | ||
3. Is there opposition to removing `Placement.Decisions` from the `status`? (This could alternatively be handled outside | ||
of this enhancement.) | ||
4. Should the `RolloutStrategy` be implemented in the `Policy` CRD instead of `PlacementBinding`? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer this level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means it needs to be in PolicySet as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mprahl I just realized that "this" was unclear. Are you saying you prefer it being in the Policy
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added my comments. Thanks for the work on this!
|
||
- **Summary** | ||
|
||
- Only continue policy deployment once the previously deployed clusters show as "Compliant" after a configurable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a specific feature request? To me it sounds like this is worded wrong -- maybe. Is the idea the policies must be compliant within the timeout to continue if Compliant "after" the timeout? I was expecting the word "before" instead of "after"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the decision strategy, there's a timeout to wait for "Success" (which in the case of Policy is "Compliant"). The rollout will continue when clusters in the current rollout have succeeded OR have timed out.
| `Progressing` | Current generation and `lastEvaluated` has not been updated | | ||
| `Succeed` | Compliant and current generation | | ||
| `Failed` | NonCompliant and current generation | | ||
| `TimeOut` | Current generation, `lastEvaluated` is less than `root-policy-timestamp`, </br>and time has passed beyond timeout specified in the `RolloutStrategy` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to understand "TimeOut":
Is it possible the "Current generation" and lastEvaluated
gets updated but is somehow less than the root policy timestamp? I'm not sure I've figured that out yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lastEvaluated
might be reworked to use the transition time instead (See #99 (comment)). The root-policy-timestamp would be when the replicated policy spec was created/updated with the current generation, so any time before that would have been a previous version of the policy.
annotation to the replicated policy for each cluster. | ||
- Update the annotation with the policy's generation at the time of the update when a policy has been replicated. | ||
- (Optional--requirement for phase 3) Update the `governance-policy-propagator` controller to add a | ||
`open-cluster-management.io/root-policy-timestamp` annotation to the replicated policy for each cluster indicating |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is really a "replication time" -- not a timestamp related to something that happened on the root policy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct--and that might be a better name for it 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on something like open-cluster-management.io/replication-timestamp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about last-replicated
?
instead of the `PlacementRule` struct. | ||
- Using the `Placement` library and implementing a basic `ClusterRolloutStatusFunc`, iterate over all clusters as | ||
usual using the `All` rollout strategy. The `RolloutStatus` will initially be either `ToApply` or `Succeed`. | ||
- (Optional cleanup) Remove the current `Placement.Decisions` from the `status` of the `Policy` CRD. It's uncertain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This item and maybe next item -- api removal would require a new version and deprecation of old version (I didn't look at the CRD to see how we handle this today since I know we do different things for root vs replicated policy) -- edit -- I see this is mentioned later so maybe this comment isn't needed...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right--I might make a separate issue to look at the CRD cleanup outside of this enhancement. But I'm also not sure it's used anywhere...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re: API updates, I think we could stop populating the field, without removing it from the CRD (and therefore without requiring a version bump). Especially since it's just in the status, it doesn't feel like it's as much of a breaking change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JustinKuli We'll have to investigate it (and I'll probably create a new issue to not muddle this enhancement), but I don't think we populate it currently 😅
only sending an updated timestamp if the generation was updated and/or the `LastEvaluated` timestamp is less than the | ||
root policy timstamp or (the current behavior) if the status indeed changed. | ||
|
||
## Design Details |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see PolicySet mentioned, this needs to work the same way for PolicySet as it does with Policy.
5. How should a policy handle when placement decision groups shift? (This implementation leans on the `Placement` | ||
library to handle such shifts, only handling the clusters that it returns.) | ||
6. (An aside) What's with the pointer in `[]*clusterv1beta1.PlacementDecision`? | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about PolicyAutomation? I don't think there's really any impact other than awareness automation can be triggered against a policy that is in the mixed state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Odd behavior if you change object templates and drill down to the old templates on a cluster without the new generation. Going back to the root policy you would not be able to edit/see that old object template. Expected behavior I think...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I'm not sure placement impacts PolicyAutomation, but definitely something to verify.
Odd indeed, but the old templates ought to be reflected in the replicated policy, and it'd hopefully be relatively straightforward to find the rollout status in the root policy to figure out that it hadn't been updated yet. That said, maybe storing a per-cluster rollout status in the root policy would be good for visibility.
// +kubebuilder:validation:Optional | ||
BindingOverrides BindingOverrides `json:"bindingOverrides,omitempty"` | ||
// +kubebuilder:validation:Optional | ||
RolloutStrategy clusterv1alpha1.RolloutStrategy `json:"rolloutStrategy,omitempty"` // <-- New field |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be used with selective policy enforcement?
Also, will there be CRD changes to additionalPrinterColumns
to indicate a rollout scenario is underway or complete for the binding which can be for multiple policies and/or policysets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good questions. I need to figure out how selective policy enforcement works, but I think they could work together, with the propagator filtering out clusters from the clusters returned from the Placement library.
The additionalPrinterColumns
would be a good addition to the CRD!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dhaiducek I worry about the rollout strategy being defined on the PlacementBinding instead of the Policy because you can have multiple PlacementBindings per Policy. If multiple set rolloutStrategy, which one wins? Could a subset of clusters rollout Progressive
and others rollout ProgressivePerGroup
?
Also, the PlacementBinding can be used for multiple policies and it'd be annoying to have to use a separate PlacementBinding just because your policies have different rollout strategies.
Lastly, I think the UX around PlacementBinding is challenging. I find it much more intuitive to configure how my policy is deployed by setting it on the policy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In regards to selective policy enforcement, I personally think it should be ignored or an error returned when a rollout strategy is also defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The multiple placement bindings question also seems to apply to the order/timing of when the enforce is done. If two bindings for the same policy have overlapping sets of clusters included both will (as I understand it) calculate decisiongroups independently. Any given cluster may be in group 1 of binding A and group 100 of binding B. Will it get enforced twice? If the max concurrency is 10, do you end up with 11 clusters being enforced due to the different groupings? etc.
reflected in each replicated policy.) | ||
3. Is there opposition to removing `Placement.Decisions` from the `status`? (This could alternatively be handled outside | ||
of this enhancement.) | ||
4. Should the `RolloutStrategy` be implemented in the `Policy` CRD instead of `PlacementBinding`? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means it needs to be in PolicySet as well.
|
||
- **Details** | ||
|
||
- Update the `governance-policy-propagator` controller to add a `open-cluster-management.io/root-policy-generation` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just thinking out loud, but I don't know if the generation number is enough here. Maybe I'm being paranoid, but if I delete the root policy and recreate it quickly enough that the propagator hasn't removed the old ones yet, I may have changed the policy while keeping the generation the same (or maybe the generation here is actually a larger than the one on the root policy). With the long-term compliance history stuff, we're already considering making a hash of the policy to identify it, I think? Maybe that could also be used here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about the hash and especially if the compliance history would make use of it, that'd definitely be the way to go. It becomes complex, however, if hub templates are involved since the root policy and replicated policy would look different, which is why I went with the generation. If the root policy generation is less than the annotation on the replicated one, that could also be a signal to delete the replicated policy and recreate it.
annotation to the replicated policy for each cluster. | ||
- Update the annotation with the policy's generation at the time of the update when a policy has been replicated. | ||
- (Optional--requirement for phase 3) Update the `governance-policy-propagator` controller to add a | ||
`open-cluster-management.io/root-policy-timestamp` annotation to the replicated policy for each cluster indicating |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on something like open-cluster-management.io/replication-timestamp
instead of the `PlacementRule` struct. | ||
- Using the `Placement` library and implementing a basic `ClusterRolloutStatusFunc`, iterate over all clusters as | ||
usual using the `All` rollout strategy. The `RolloutStatus` will initially be either `ToApply` or `Succeed`. | ||
- (Optional cleanup) Remove the current `Placement.Decisions` from the `status` of the `Policy` CRD. It's uncertain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re: API updates, I think we could stop populating the field, without removing it from the CRD (and therefore without requiring a version bump). Especially since it's just in the status, it doesn't feel like it's as much of a breaking change.
open-cluster-management.io/api v0.11.1-0.20230809113658-2b2399b5f6e8 | ||
``` | ||
|
||
For testing, the `governance-policy-propagator` doesn't currently account for multiple managed clusters. As part of this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it does currently consider multiple clusters, just in a fake way (which is probably what we want). I'm not saying that what it does is necessarily great, it probably needs to be looked at in detail for testing this feature, but it is already there to an extent: https://github.com/stolostron/governance-policy-propagator/blob/main/test/e2e/case1_propagation_test.go#L78
fd68266
to
9803f60
Compare
43bdaf3
to
f7ed48d
Compare
7eb8447
to
5af01e9
Compare
strategy specified, with the remediation action set to `inform`. This way, all policies are up-to-date and reporting | ||
a compliance status based on the current version of the root policy. The aggregated rollout status on the root policy | ||
is set to `Progressing`. If the remediation action on the root policy is already `inform`, the rollout status on each | ||
cluster is set to `Progressing` and no second pass occurs. If the remediation action is `enforce`, the rollout status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does an inform policy that rolls out with some clusters being NonCompliant ever transition out of Progressing to Succeeded? I think a clarification is needed on how an inform policy rollout can be successful and still identify NonCompliant clusters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's an interesting question. I was thinking inform
would follow the enforce
behaviors, transitioning to Succeeded
or Failed
based on the compliance.
But for inform
, I suppose we could transition the rollout to Succeeded
once a status is returned regardless of compliance? It'd differ in behavior from enforce
, but maybe that's okay?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, for inform I think as customers learn where problems "may" be they will improve inform policies to detect those issues. Then the rollout will help identify those problem clusters and I don't think the rollout should fail or be limited because inform policies are working as expected. Example: Initially we want our addon to just be available. In v2 we learn we want to look out for terminating config policies....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A user shouldn't define a progressive rollout for an inform policy since it will cause the policy to be enforced unless the ConfigurationPolicy has remediationAction: informOnly
which in that event, the rollout would have no effect.
I think the rollout status should be set to Succeeded
if any status is provided as Dale mentioned. Syntax errors in the ConfigurationPolicy would lead to a failed rollout status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nice part of this is that even users that don't use progressive policy rollouts will get a single field to know if all policy templates have reported an up to date status for the cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I not roll out a policyset that contains inform and enforce policies and the inform policies would not be enforced? I should be able to put any random policies in a policyset with some enforce and some inform and it should be honored appropriately during rollout. Enforcing some policies could have unintended consequences -- rebooting nodes, upgrading, etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it needs more clarification on how the RolloutStatus will move from ToApply->progressing->Succeeded/TimeOut.
Here is some thoughts;
-
ToApply;
1- when the child policy is not get created yet for the managedCluster
2- when the child policy is created for the managedCluster AND the policy remediation is Inform and the policy in NonCompliant state. -
Progressing;
1- When the child policy is created for the managedCluster AND the policy remediation is Enforce and the policy in NonCompliant state. -
Succeeded;
1- When the child policy is created for the managedCluster and it is in Compliant state for either enforce/inform remediation -
TimeOut;
1- When the child policy is created for the managedCluster AND the policy remediation is Enforce and the policy in NonCompliant state for longer time than it is defined in the rolloutStrategy->TimeOut
5af01e9
to
e40482c
Compare
policy, and an aggregated status on the root policy. The aggregated status would specify: `Progressing` (this would | ||
include `ToApply`), `Succeeded`, `Failed`, or (this would include `TimeOut`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ToApply
means the the policy hasn't been enforced yet on the cluster.
Progressing
would mean the policy has been enforced on the cluster but the rollout hasn't completed yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the set of "valid" status fields different on the root policy vs the replicated policies? It sounds like ToApply would be valid on a replicated policy but not the root. Similarly it sounds like progressing would be valid on the root but not the replicated set. If so it might help to define the set of valid values for root vs replicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, @mprahl--I wrapped ToApply
in for simplicity but it would be clearer to use ToApply
.
@imiller0 The RolloutStatus
is an iota we're taking from the Placement library: https://github.com/open-cluster-management-io/api/blob/9675ab737d2086dd7094fc962c9617cc71ec7484/cluster/v1alpha1/helpers.go#L21-L38 but having these well defined for each the root and replicated policies would be good.
### Open Questions | ||
|
||
1. Should the rollout handler be instantiated with a dynamic watcher/cache instead of the controller client? | ||
2. Should `RolloutStatus` be reflected per-cluster in the root policy? (The rollout status would also be reflected in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes to this.
1. Should the rollout handler be instantiated with a dynamic watcher/cache instead of the controller client? | ||
2. Should `RolloutStatus` be reflected per-cluster in the root policy? (The rollout status would also be reflected in | ||
each replicated policy.) | ||
3. Should the `RolloutStrategy` be implemented in the `Policy` CRD instead of `PlacementBinding`? (If that were the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Policy
CRD in my opinion based on the comment here:
https://github.com/open-cluster-management-io/enhancements/pull/99/files#r1309173623
5. How is "soak time" (the minimum time for a policy to reach a successful state) and timeout (the maximum amount of | ||
time to wait for a successful state), as defined in the placement enhancement, handled for Policies? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5. How is "soak time" (the minimum time for a policy to reach a successful state) and timeout (the maximum amount of | |
time to wait for a successful state), as defined in the placement enhancement, handled for Policies? | |
5. How is "soak time" (the minimum time for a policy to reach a successful state) and timeout (the maximum amount of | |
time to wait for a state), as defined in the placement enhancement, handled for Policies? |
- "Progressive": one cluster at a time | ||
- "ProgressivePerGroup": one group of clusters at a time | ||
- `Timeout`: Maximum amount of time to wait for success before continuing the rollout | ||
- `MandatoryDecisionGroups`: groups that should be handled before other groups |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this solely a priority ordering or does "mandatory" mean "fail if timed out"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would solely be priority ordering, but that's an interesting point to have a different timeout setting for the mandatory groups and the subsequent groups--that could be an update for the Placement library in the future.
MandatoryDecisionGroups
is described in the DecisionStrategy
enhancement here: https://github.com/open-cluster-management-io/enhancements/tree/main/enhancements/sig-architecture/64-placementStrategy#how-workload-applier-apis-will-benefits-from-using-placementstrategy
Any specialized behaviors outside of those provided by the `Placement` library and, by extension, the `DecisionStrategy` | ||
enhancement, that might require additional code other than that already provided/required by the `Placement` library. | ||
This includes the ability to roll back a rollout. A rollback should be implemented by the user by applying the previous | ||
version of the policy, and GitOps is the recommended method for this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be out of scope here, but for rollback to work correctly in the general case I believe that the policy would need to be "mustonlyhave" compliance type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@imiller0 That's true for objects on the managed cluster--this is referring specifically to the previous version of the policy itself on the hub rather than rolling back changes on the managed cluster.
policy, and an aggregated status on the root policy. The aggregated status would specify: `Progressing` (this would | ||
include `ToApply`), `Succeeded`, `Failed`, or (this would include `TimeOut`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the set of "valid" status fields different on the root policy vs the replicated policies? It sounds like ToApply would be valid on a replicated policy but not the root. Similarly it sounds like progressing would be valid on the root but not the replicated set. If so it might help to define the set of valid values for root vs replicated.
- Update the `PlacementBinding` CRD to contain the `RolloutStrategy` struct. (See the | ||
[`v1alpha1/RolloutStrategy`](https://github.com/open-cluster-management-io/api/blob/main/cluster/v1alpha1/types_rolloutstrategy.go)) | ||
Defaults to `All` if a strategy is not provided or the remediation action is `inform`. | ||
- When the `remediationAction` is set to `enforce`, policies not currently being rolled out will be set to `inform` to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once set to enforce do they remain in enforce mode until a subsequent change to the root policy or do they revert to inform once compliant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They remain in enforce
until the root policy is updated again.
// +kubebuilder:validation:Optional | ||
BindingOverrides BindingOverrides `json:"bindingOverrides,omitempty"` | ||
// +kubebuilder:validation:Optional | ||
RolloutStrategy clusterv1alpha1.RolloutStrategy `json:"rolloutStrategy,omitempty"` // <-- New field |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The multiple placement bindings question also seems to apply to the order/timing of when the enforce is done. If two bindings for the same policy have overlapping sets of clusters included both will (as I understand it) calculate decisiongroups independently. Any given cluster may be in group 1 of binding A and group 100 of binding B. Will it get enforced twice? If the max concurrency is 10, do you end up with 11 clusters being enforced due to the different groupings? etc.
3. Should the `RolloutStrategy` be implemented in the `Policy` CRD instead of `PlacementBinding`? (If that were the | ||
case, it would also need to be added to `PolicySet`.) | ||
4. How should a policy handle when placement decision groups shift? (This implementation leans on the `Placement` | ||
library to handle such shifts, only handling the clusters that it returns.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this risk clusters being skipped? Or gaps in the decision groups if clusters are unbound/rebound?
library to handle such shifts, only handling the clusters that it returns.) | ||
5. How is "soak time" (the minimum time for a policy to reach a successful state) and timeout (the maximum amount of | ||
time to wait for a successful state), as defined in the placement enhancement, handled for Policies? | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does the rolloutStrategy work with policy dependency? Is it possible that if the two policies are bound using separate policybindings and the policies are modified simultaneously the group ordering could be such that rollout of one policy is entirely blocked until completion of all groups of the other policy?
|
||
1. On the first pass following a root policy creation or update, the `governance-policy-propagator` controller on the | ||
hub cluster will replicate all policies to managed cluster namespaces and are created regardless of the rollout | ||
strategy specified, with the remediation action set to `inform`. This way, all policies are up-to-date and reporting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a policy is modified/updated by the user while a rollout is ongoing, does the process start over with all policies being set back to inform and rollout beginning with group 1 again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct
strategy specified, with the remediation action set to `inform`. This way, all policies are up-to-date and reporting | ||
a compliance status based on the current version of the root policy. The aggregated rollout status on the root policy | ||
is set to `Progressing`. If the remediation action on the root policy is already `inform`, the rollout status on each | ||
cluster is set to `Progressing` and no second pass occurs. If the remediation action is `enforce`, the rollout status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it needs more clarification on how the RolloutStatus will move from ToApply->progressing->Succeeded/TimeOut.
Here is some thoughts;
-
ToApply;
1- when the child policy is not get created yet for the managedCluster
2- when the child policy is created for the managedCluster AND the policy remediation is Inform and the policy in NonCompliant state. -
Progressing;
1- When the child policy is created for the managedCluster AND the policy remediation is Enforce and the policy in NonCompliant state. -
Succeeded;
1- When the child policy is created for the managedCluster and it is in Compliant state for either enforce/inform remediation -
TimeOut;
1- When the child policy is created for the managedCluster AND the policy remediation is Enforce and the policy in NonCompliant state for longer time than it is defined in the rolloutStrategy->TimeOut
|
||
- The `RolloutStatus` would be added to reflect: rollout status on the replicated policy, per-cluster on the root | ||
policy, and an aggregated status on the root policy. The aggregated status would specify: `Progressing` (this would | ||
include `ToApply`), `Succeeded`, `Failed`, or (this would include `TimeOut`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is a Failed status cause the policy-controller will continue to enforce the policy to the managedCluster as long as it is in NonCompliant state. So status could be ToApply, Progressing, Succeeded, TimeOut.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@serngawy True, but if the enforcement is blocked for some reason (like a missing API on the managed cluster or an invalid manifest), it may be good to reflect that in the root policy since there's no way for the rollout to progress?
ffadaa3
to
70cc2e8
Compare
c35160a
to
08d4db2
Compare
disabled: false | ||
remediationAction: inform # <-- Remediation action updated to prepare for rollout | ||
rolloutStrategy: | ||
type: ProgressivePerGroup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess better to include the progressivePerGroup fields as well.
remediationAction: inform # <-- Remediation action updated to prepare for rollout
rolloutStrategy:
type: ProgressivePerGroup
progressivePerGroup:
mandatoryDecisionGroups:
- groupName: dev-emea (Same group name as defined in the placement.)
timeout: None```
751cc8b
to
fc2bc75
Compare
ref: https://issues.redhat.com/browse/ACM-6523 Signed-off-by: Dale Haiducek <[email protected]>
fc2bc75
to
bb31ca5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. We'll work on adding the new requirements discussed today in a separate PR.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dhaiducek, mprahl The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
dffc91e
into
open-cluster-management-io:main
ref: https://issues.redhat.com/browse/ACM-6523