-
Notifications
You must be signed in to change notification settings - Fork 37
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[99][sig-policy] Policy placement strategy
ref: https://issues.redhat.com/browse/ACM-6523 Signed-off-by: Dale Haiducek <[email protected]>
- Loading branch information
Showing
2 changed files
with
312 additions
and
0 deletions.
There are no files selected for viewing
302 changes: 302 additions & 0 deletions
302
enhancements/sig-policy/99-policy-placement-strategy/README.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,302 @@ | ||
# Policy placement strategy | ||
|
||
## Release Signoff Checklist | ||
|
||
- [ ] Enhancement is `implementable` | ||
- [ ] Design details are appropriately documented from clear requirements | ||
- [ ] Test plan is defined | ||
- [ ] Graduation criteria for dev preview, tech preview, GA | ||
- [ ] User-facing documentation is created in | ||
[website](https://github.com/open-cluster-management-io/open-cluster-management-io.github.io/) | ||
|
||
## Summary | ||
|
||
Following from the [`DecisionStrategy`](../../sig-architecture/64-placementStrategy/README.md) field in the `Placement` | ||
API, policies can leverage this new logic to have a configurable and systematic way to roll out policy updates to | ||
clusters. | ||
|
||
## Motivation | ||
|
||
Currently policies and subsequent updates are pushed out to clusters en masse based on the placement to which it has | ||
been bound. The new `DecisionStrategy` field in the `Placement` API creates cluster groupings that controllers can | ||
leverage to have segmented, configurable rollouts of resources. This will aid use cases where high availability may be a | ||
priority or a set of test clusters should receive and verify the updates before the remaining clusters get updated. | ||
|
||
### Goals | ||
|
||
- Make `Placement` the primary API for placement (currently the governance propagator is somewhat | ||
`PlacementRule`-centric) | ||
- Leverage the `Placement` helper library to retrieve placement decisions | ||
- Reflect rollout status per cluster in the root policy for discoverability (whether a cluster is up-to-date or not) | ||
- Implement the `RolloutStrategy` struct for policies, including: | ||
- `RolloutStrategy` | ||
- "All": all clusters at once | ||
- "Progressive": one cluster at a time | ||
- "ProgressivePerGroup": one group of clusters at a time | ||
- `Timeout`: Maximum amount of time to wait for success before continuing the rollout | ||
- `MandatoryDecisionGroups`: groups that should be handled before other groups | ||
- `MaxConcurrency`: Concurrency during a progressive rollout | ||
- Add an aggregated rollout status for the root policy status. | ||
|
||
### Non-Goals | ||
|
||
Any specialized behaviors outside of those provided by the `Placement` library and, by extension, the `DecisionStrategy` | ||
enhancement, that might require additional code other than that already provided/required by the `Placement` library. | ||
This includes the ability to roll back a rollout. A rollback should be implemented by the user by applying the previous | ||
version of the policy, and GitOps is the recommended method for this. | ||
|
||
## Proposal | ||
|
||
### Background | ||
|
||
Users create policies on the hub cluster (which are referred to as the root policy). These policies are replicated to | ||
managed cluster namespaces based on the placement to which they have been bound by the associated `PlacementBinding` | ||
object. The managed clusters monitor the associated managed cluster namespace on the hub for policy updates. | ||
|
||
Additionally, policies have a `remediationAction` specified, which can either be `inform`, meaning the policy controller | ||
monitors objects on the managed cluster reporting status without making changes, or `enforce`, meaning the policy | ||
controller attempts to make changes on the managed cluster based on the policy definition and reports status accordingly | ||
based on the success or failure of the update. | ||
|
||
### Design | ||
|
||
For this rollout enhancement, policy propagation by the `governance-policy-propagator` controller on the hub would take | ||
place in a multiple-pass approach: | ||
|
||
1. On the first pass following a root policy creation or update, the `governance-policy-propagator` controller on the | ||
hub cluster will replicate all policies to managed cluster namespaces and are created regardless of the rollout | ||
strategy specified, with the remediation action set to `inform`. This way, all policies are up-to-date and reporting | ||
a compliance status based on the current version of the root policy. The aggregated rollout status on the root policy | ||
is set to `Progressing`. If the remediation action on the root policy is already `inform`, the rollout status on each | ||
cluster is set to `Progressing` and no second pass occurs. If the remediation action is `enforce`, the rollout status | ||
on each cluster is set to `ToApply`. | ||
2. On the subsequent passes, if the `remediationAction` is `enforce`, the `governance-policy-propagator` fetches the | ||
managed clusters returned from the `Placement` library (based on the user-configured rollout strategy) and sets the | ||
rollout status of each returned cluster to `Progressing`, updating any `remediationAction` to `enforce` as defined in | ||
the root policy. | ||
|
||
From here, work is picked up by the `governance-policy-framework` controllers on the managed clusters: | ||
|
||
1. The template sync controller of the framework updates the applicable `ConfigurationPolicy` objects like usual, which | ||
triggers reevaluation from the `config-policy-controller`. | ||
2. Once the `ConfigurationPolicies` have a generation that matches the `lastEvaluatedGeneration` set on the policy | ||
status, the status sync controller of the framework will set the rollout status to to `Succeeded` after all of the | ||
`ConfigurationPolicies` have a compliant status or `Failed` if there is any non-compliant status. | ||
|
||
### User Stories | ||
|
||
#### Story 1 | ||
|
||
As a system administrator, I want to know the status of a rollout for a policy and what clusters have been updated. | ||
|
||
- **Summary** | ||
|
||
- Add a `RolloutStatus` to the `Policy` status in the CRD. | ||
|
||
- **Details** | ||
|
||
- The `RolloutStatus` would be added to reflect: rollout status on the replicated policy, per-cluster on the root | ||
policy, and an aggregated status on the root policy. The aggregated status would specify: `Progressing` (this would | ||
include `ToApply`), `Succeeded`, `Failed`, or (this would include `TimeOut`). | ||
|
||
- **Snippet** | ||
|
||
```golang | ||
// CompliancePerClusterStatus defines compliance per cluster status | ||
type CompliancePerClusterStatus struct { | ||
ComplianceState ComplianceState `json:"compliant,omitempty"` | ||
RolloutStatus clusterv1alpha1.RolloutStatus `json:"rolloutStatus,omitempty"` // <-- New field | ||
ClusterName string `json:"clustername,omitempty"` | ||
ClusterNamespace string `json:"clusternamespace,omitempty"` | ||
} | ||
|
||
// PolicyStatus defines the observed state of Policy | ||
type PolicyStatus struct { | ||
Placement []*Placement `json:"placement,omitempty"` | ||
Status []*CompliancePerClusterStatus `json:"status,omitempty"` | ||
|
||
// +kubebuilder:validation:Enum=Compliant;Pending;NonCompliant | ||
ComplianceState ComplianceState `json:"compliant,omitempty"` | ||
RolloutStatus clusterv1alpha1.RolloutStatus `json:"rolloutStatus,omitempty"` // <-- New field | ||
Details []*DetailsPerTemplate `json:"details,omitempty"` | ||
} | ||
``` | ||
|
||
#### Story 2 | ||
|
||
As a system administrator, I want to use a placement `DecisionStrategy` with policies to control the order of the | ||
updates. | ||
|
||
- **Summary** | ||
|
||
- Use the `Placement` library to gather the list of clusters and iterate over them. | ||
|
||
- **Details** | ||
|
||
- Make `Placement` the primary placement resource and parse `PlacementRule` decisions into the `Placement` struct | ||
instead of the `PlacementRule` struct. | ||
- Using the `Placement` library and implementing a `ClusterRolloutStatusFunc`, iterate over all clusters as usual | ||
using the `All` rollout strategy. | ||
|
||
- **Snippet (untested)** | ||
|
||
```golang | ||
// common.go | ||
type placementDecisionGetter struct { | ||
c client.Client | ||
} | ||
|
||
func (pd placementDecisionGetter) List(selector labels.Selector, namespace string) ([]*clusterv1beta1.PlacementDecision, error) { | ||
pdList := &clusterv1beta1.PlacementDecisionList{} | ||
lopts := &client.ListOptions{ | ||
LabelSelector: selector, | ||
Namespace: namespace, | ||
} | ||
|
||
err := pd.c.List(context.TODO(), pdList, lopts) | ||
if err != nil { | ||
return nil, err | ||
} | ||
|
||
pdPtrList := []*clusterv1beta1.PlacementDecision{} | ||
for _, pd := range pdList.Items { | ||
pdPtrList = append(pdPtrList, &pd) | ||
} | ||
|
||
return pdPtrList, err | ||
} | ||
|
||
func GetRolloutHandler(c client.Client, placement *clusterv1beta1.Placement) (*clusterv1alpha1.RolloutHandler, error) { | ||
pdTracker := clusterv1beta1.NewPlacementDecisionClustersTracker(placement, placementDecisionGetter{c}, nil, nil) | ||
|
||
_, _, err := pdTracker.Get() | ||
if err != nil { | ||
log.Error(err, "Error retrieving PlacementDecisions from tracker") | ||
} | ||
|
||
return clusterv1alpha1.NewRolloutHandler(pdTracker) | ||
} | ||
|
||
var GetClusterRolloutStatus clusterv1alpha1.ClusterRolloutStatusFunc = func(clusterName string) clusterv1alpha1.ClusterRolloutStatus { | ||
// Fetch and return the rollout status and the last transition time from the status in the replicated policy | ||
|
||
return clusterv1alpha1.ClusterRolloutStatus{ | ||
Status: "<rollout-status>", | ||
LastTransitionTime: "<transition-time>", | ||
} | ||
} | ||
``` | ||
|
||
```golang | ||
rolloutHandler := common.GetRolloutHandler(client, placement) | ||
strategy, rolloutResult, err := rolloutHandler.GetRolloutCluster(clusterv1alpha1.All, common.GetClusterRolloutStatus) | ||
|
||
// ... Use rolloutResults for propagation | ||
``` | ||
|
||
#### Story 3 | ||
|
||
As a system administrator, I want policies to be rolled out only when the policy on previous clusters show as | ||
"Compliant" and control their concurrency and priority. | ||
|
||
- **Summary** | ||
|
||
- Only continue policy deployment once the previously deployed clusters show as "Compliant". | ||
|
||
- **Details** | ||
|
||
- Update the `PlacementBinding` CRD to contain the `RolloutStrategy` struct. (See the | ||
[`v1alpha1/RolloutStrategy`](https://github.com/open-cluster-management-io/api/blob/main/cluster/v1alpha1/types_rolloutstrategy.go)) | ||
Defaults to `All` if a strategy is not provided or the remediation action is `inform`. | ||
- When the `remediationAction` is set to `enforce`, policies not currently being rolled out will be set to `inform` to | ||
continue to report status without making changes on the managed cluster while waiting for the new version of the | ||
policy to be enforced. | ||
- Update the `ClusterRolloutStatusFunc`, (the `GetClusterRolloutStatus` function from Story 2) to set up logic to work | ||
with the `Placement` library to provide the information it needs to continue only once the previous group is | ||
compliant (or after a configurable timeout): | ||
|
||
| Status | Description | | ||
| ------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- | | ||
| `ToApply` | A policy to be enforced is created/updated in the first pass of the propagator on the hub | | ||
| `Progressing` | Policy was defined as `inform`, or selected by the `Placement` library and updated to `enforce` | | ||
| `Succeeded` | Policy was defined as `inform` and has status, or status is compliant and the `lastEvaluatedGeneration` matches the generation on the managed cluster | | ||
| `Failed` | Policy is non-compliant and the `lastEvaluatedGeneration` matches the generation on the managed cluster | | ||
| `TimeOut` | Time has passed beyond the timeout specified in the `RolloutStrategy` without a returned status | | ||
| `Skip` | (unused) | | ||
|
||
- **Snippet** | ||
|
||
```golang | ||
type PlacementBinding struct { | ||
metav1.TypeMeta `json:",inline"` | ||
metav1.ObjectMeta `json:"metadata,omitempty"` | ||
// +kubebuilder:validation:Optional | ||
BindingOverrides BindingOverrides `json:"bindingOverrides,omitempty"` | ||
// +kubebuilder:validation:Optional | ||
RolloutStrategy clusterv1alpha1.RolloutStrategy `json:"rolloutStrategy,omitempty"` // <-- New field | ||
// This field provides the ability to select a subset of bound clusters | ||
// +kubebuilder:validation:Optional | ||
// +kubebuilder:validation:Enum=restricted | ||
SubFilter SubFilter `json:"subFilter,omitempty"` | ||
// +kubebuilder:validation:Required | ||
PlacementRef PlacementSubject `json:"placementRef"` | ||
// +kubebuilder:validation:Required | ||
// +kubebuilder:validation:MinItems=1 | ||
Subjects []Subject `json:"subjects"` | ||
Status PlacementBindingStatus `json:"status,omitempty"` | ||
} | ||
``` | ||
|
||
### Implementation Details/Notes/Constraints | ||
|
||
For the `Placement` library, this requires importing at least this package version (the `Placement` library is in the | ||
`v1alpha1` version): | ||
|
||
``` | ||
open-cluster-management.io/api v0.11.1-0.20230828015110-b39eb9026c6e | ||
``` | ||
|
||
For testing, the `governance-policy-propagator` doesn't currently account for multiple managed clusters. As part of this | ||
enhancement, the test flows would need to be enhanced (and/or a separate workflow created) that deploys multiple managed | ||
clusters. | ||
|
||
The `PlacementDecisionGetter` returns an array of pointers (`[]*clusterv1beta1.PlacementDecision`) because it was | ||
intended to retrieve from a cache, so the implementation could consider setting up a `PlacementDecision` cache instead | ||
of using the Kubernetes client directly. | ||
|
||
### Risks and Mitigation | ||
|
||
- The `Placement` library is relatively new and untested outside of its repo and this implementation leans heavily on | ||
its logic. While it works in theory, there could be some tweaks/adjustments as development proceeds, lengthening time | ||
for implementation. The phased approach intends to address this to make partial implementation feasible. | ||
|
||
## Design Details | ||
|
||
### Open Questions | ||
|
||
1. Should the rollout handler be instantiated with a dynamic watcher/cache instead of the controller client? | ||
2. Should `RolloutStatus` be reflected per-cluster in the root policy? (The rollout status would also be reflected in | ||
each replicated policy.) | ||
3. Should the `RolloutStrategy` be implemented in the `Policy` CRD instead of `PlacementBinding`? (If that were the | ||
case, it would also need to be added to `PolicySet`.) | ||
4. How should a policy handle when placement decision groups shift? (This implementation leans on the `Placement` | ||
library to handle such shifts, only handling the clusters that it returns.) | ||
5. How is "soak time" (the minimum time for a policy to reach a successful state) and timeout (the maximum amount of | ||
time to wait for a successful state), as defined in the placement enhancement, handled for Policies? | ||
|
||
### Test Plan | ||
|
||
- Unit tests in the repo | ||
- E2E tests in the repo (would need to add additional managed clusters for this, potentially in a separate workflow, | ||
though ideally alongside the existing tests) | ||
- Policy integration tests in the framework (would need to add additional managed clusters for this, potentially in a | ||
separate workflow, though ideally alongside the existing tests) | ||
|
||
## Drawbacks / Alternatives | ||
|
||
The maturity of the `Placement` library could be brought into question. This enhancement could hold after migrating to | ||
the `Placement` library in Story 2 and only support the `All` strategy until we ensure that we have a robust test | ||
environment that can test the various configurations before moving forward. | ||
|
||
If a user only updates a single policy-template out of many in a `Policy`, the rollout would signal a status update for | ||
all of the policy-templates. This is a drawback that this enhancement does not seek to address. |
10 changes: 10 additions & 0 deletions
10
enhancements/sig-policy/99-policy-placement-strategy/metadata.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
title: policy-placement-strategy | ||
authors: | ||
- "@dhaiducek" | ||
reviewers: | ||
- TBD | ||
approvers: | ||
- TBD | ||
creation-date: 2023-08-03 | ||
last-updated: 2023-08-03 | ||
status: implementable |