Skip to content

Commit

Permalink
[99][sig-policy] Policy placement strategy
Browse files Browse the repository at this point in the history
  • Loading branch information
dhaiducek committed Aug 29, 2023
1 parent 11e15da commit e40482c
Show file tree
Hide file tree
Showing 2 changed files with 312 additions and 0 deletions.
302 changes: 302 additions & 0 deletions enhancements/sig-policy/99-policy-placement-strategy/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,302 @@
# Policy placement strategy

## Release Signoff Checklist

- [ ] Enhancement is `implementable`
- [ ] Design details are appropriately documented from clear requirements
- [ ] Test plan is defined
- [ ] Graduation criteria for dev preview, tech preview, GA
- [ ] User-facing documentation is created in
[website](https://github.com/open-cluster-management-io/open-cluster-management-io.github.io/)

## Summary

Following from the [`DecisionStrategy`](../../sig-architecture/64-placementStrategy/README.md) field in the `Placement`
API, policies can leverage this new logic to have a configurable and systematic way to roll out policy updates to
clusters.

## Motivation

Currently policies and subsequent updates are pushed out to clusters en masse based on the placement to which it has
been bound. The new `DecisionStrategy` field in the `Placement` API creates cluster groupings that controllers can
leverage to have segmented, configurable rollouts of resources. This will aid use cases where high availability may be a
priority or a set of test clusters should receive and verify the updates before the remaining clusters get updated.

### Goals

- Make `Placement` the primary API for placement (currently the governance propagator is somewhat
`PlacementRule`-centric)
- Leverage the `Placement` helper library to retrieve placement decisions
- Reflect rollout status per cluster in the root policy for discoverability (whether a cluster is up-to-date or not)
- Implement the `RolloutStrategy` struct for policies, including:
- `RolloutStrategy`
- "All": all clusters at once
- "Progressive": one cluster at a time
- "ProgressivePerGroup": one group of clusters at a time
- `Timeout`: Maximum amount of time to wait for success before continuing the rollout
- `MandatoryDecisionGroups`: groups that should be handled before other groups
- `MaxConcurrency`: Concurrency during a progressive rollout
- Add an aggregated rollout status for the root policy status.

### Non-Goals

Any specialized behaviors outside of those provided by the `Placement` library and, by extension, the `DecisionStrategy`
enhancement, that might require additional code other than that already provided/required by the `Placement` library.
This includes the ability to roll back a rollout. A rollback should be implemented by the user by applying the previous
version of the policy, and GitOps is the recommended method for this.

## Proposal

### Background

Users create policies on the hub cluster (which are referred to as the root policy). These policies are replicated to
managed cluster namespaces based on the placement to which they have been bound by the associated `PlacementBinding`
object. The managed clusters monitor the associated managed cluster namespace on the hub for policy updates.

Additionally, policies have a `remediationAction` specified, which can either be `inform`, meaning the policy controller
monitors objects on the managed cluster reporting status without making changes, or `enforce`, meaning the policy
controller attempts to make changes on the managed cluster based on the policy definition and reports status accordingly
based on the success or failure of the update.

### Design

For this rollout enhancement, policy propagation by the `governance-policy-propagator` controller on the hub would take
place in a multiple-pass approach:

1. On the first pass following a root policy creation or update, the `governance-policy-propagator` controller on the
hub cluster will replicate all policies to managed cluster namespaces and are created regardless of the rollout
strategy specified, with the remediation action set to `inform`. This way, all policies are up-to-date and reporting
a compliance status based on the current version of the root policy. The aggregated rollout status on the root policy
is set to `Progressing`. If the remediation action on the root policy is already `inform`, the rollout status on each
cluster is set to `Progressing` and no second pass occurs. If the remediation action is `enforce`, the rollout status
on each cluster is set to `ToApply`.
2. On the subsequent passes, if the `remediationAction` is `enforce`, the `governance-policy-propagator` fetches the
managed clusters returned from the `Placement` library (based on the user-configured rollout strategy) and sets the
rollout status of each returned cluster to `Progressing`, updating any `remediationAction` to `enforce` as defined in
the root policy.

From here, work is picked up by the `governance-policy-framework` controllers on the managed clusters:

1. The template sync controller of the framework updates the applicable `ConfigurationPolicy` objects like usual, which
triggers reevaluation from the `config-policy-controller`.
2. Once the `ConfigurationPolicies` have a generation that matches the `lastEvaluatedGeneration` set on the policy
status, the status sync controller of the framework will set the rollout status to to `Succeeded` after all of the
`ConfigurationPolicies` have a compliant status or `Failed` if there is any non-compliant status.

### User Stories

#### Story 1

As a system administrator, I want to know the status of a rollout for a policy and what clusters have been updated.

- **Summary**

- Add a `RolloutStatus` to the `Policy` status in the CRD.

- **Details**

- The `RolloutStatus` would be added to reflect: rollout status on the replicated policy, per-cluster on the root
policy, and an aggregated status on the root policy. The aggregated status would specify: `Progressing` (this would
include `ToApply`), `Succeeded`, `Failed`, or (this would include `TimeOut`).

- **Snippet**

```golang
// CompliancePerClusterStatus defines compliance per cluster status
type CompliancePerClusterStatus struct {
ComplianceState ComplianceState `json:"compliant,omitempty"`
RolloutStatus clusterv1alpha1.RolloutStatus `json:"rolloutStatus,omitempty"` // <-- New field
ClusterName string `json:"clustername,omitempty"`
ClusterNamespace string `json:"clusternamespace,omitempty"`
}

// PolicyStatus defines the observed state of Policy
type PolicyStatus struct {
Placement []*Placement `json:"placement,omitempty"`
Status []*CompliancePerClusterStatus `json:"status,omitempty"`

// +kubebuilder:validation:Enum=Compliant;Pending;NonCompliant
ComplianceState ComplianceState `json:"compliant,omitempty"`
RolloutStatus clusterv1alpha1.RolloutStatus `json:"rolloutStatus,omitempty"` // <-- New field
Details []*DetailsPerTemplate `json:"details,omitempty"`
}
```

#### Story 2

As a system administrator, I want to use a placement `DecisionStrategy` with policies to control the order of the
updates.

- **Summary**

- Use the `Placement` library to gather the list of clusters and iterate over them.

- **Details**

- Make `Placement` the primary placement resource and parse `PlacementRule` decisions into the `Placement` struct
instead of the `PlacementRule` struct.
- Using the `Placement` library and implementing a `ClusterRolloutStatusFunc`, iterate over all clusters as usual
using the `All` rollout strategy.

- **Snippet (untested)**

```golang
// common.go
type placementDecisionGetter struct {
c client.Client
}

func (pd placementDecisionGetter) List(selector labels.Selector, namespace string) ([]*clusterv1beta1.PlacementDecision, error) {
pdList := &clusterv1beta1.PlacementDecisionList{}
lopts := &client.ListOptions{
LabelSelector: selector,
Namespace: namespace,
}

err := pd.c.List(context.TODO(), pdList, lopts)
if err != nil {
return nil, err
}

pdPtrList := []*clusterv1beta1.PlacementDecision{}
for _, pd := range pdList.Items {
pdPtrList = append(pdPtrList, &pd)
}

return pdPtrList, err
}

func GetRolloutHandler(c client.Client, placement *clusterv1beta1.Placement) (*clusterv1alpha1.RolloutHandler, error) {
pdTracker := clusterv1beta1.NewPlacementDecisionClustersTracker(placement, placementDecisionGetter{c}, nil, nil)

_, _, err := pdTracker.Get()
if err != nil {
log.Error(err, "Error retrieving PlacementDecisions from tracker")
}

return clusterv1alpha1.NewRolloutHandler(pdTracker)
}

var GetClusterRolloutStatus clusterv1alpha1.ClusterRolloutStatusFunc = func(clusterName string) clusterv1alpha1.ClusterRolloutStatus {
// Fetch and return the rollout status and the last transition time from the status in the replicated policy

return clusterv1alpha1.ClusterRolloutStatus{
Status: "<rollout-status>",
LastTransitionTime: "<transition-time>",
}
}
```

```golang
rolloutHandler := common.GetRolloutHandler(client, placement)
strategy, rolloutResult, err := rolloutHandler.GetRolloutCluster(clusterv1alpha1.All, common.GetClusterRolloutStatus)

// ... Use rolloutResults for propagation
```

#### Story 3

As a system administrator, I want policies to be rolled out only when the policy on previous clusters show as
"Compliant" and control their concurrency and priority.

- **Summary**

- Only continue policy deployment once the previously deployed clusters show as "Compliant".

- **Details**

- Update the `PlacementBinding` CRD to contain the `RolloutStrategy` struct. (See the
[`v1alpha1/RolloutStrategy`](https://github.com/open-cluster-management-io/api/blob/main/cluster/v1alpha1/types_rolloutstrategy.go))
Defaults to `All` if a strategy is not provided or the remediation action is `inform`.
- When the `remediationAction` is set to `enforce`, policies not currently being rolled out will be set to `inform` to
continue to report status without making changes on the managed cluster while waiting for the new version of the
policy to be enforced.
- Update the `ClusterRolloutStatusFunc`, (the `GetClusterRolloutStatus` function from Story 2) to set up logic to work
with the `Placement` library to provide the information it needs to continue only once the previous group is
compliant (or after a configurable timeout):

| Status | Description |
| ------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
| `ToApply` | A policy to be enforced is created/updated in the first pass of the propagator on the hub |
| `Progressing` | Policy was defined as `inform`, or selected by the `Placement` library and updated to `enforce` |
| `Succeeded` | Policy was defined as `inform` and has status, or status is compliant and the `lastEvaluatedGeneration` matches the generation on the managed cluster |
| `Failed` | Policy is non-compliant and the `lastEvaluatedGeneration` matches the generation on the managed cluster |
| `TimeOut` | Time has passed beyond the timeout specified in the `RolloutStrategy` without a returned status |
| `Skip` | (unused) |

- **Snippet**

```golang
type PlacementBinding struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
// +kubebuilder:validation:Optional
BindingOverrides BindingOverrides `json:"bindingOverrides,omitempty"`
// +kubebuilder:validation:Optional
RolloutStrategy clusterv1alpha1.RolloutStrategy `json:"rolloutStrategy,omitempty"` // <-- New field
// This field provides the ability to select a subset of bound clusters
// +kubebuilder:validation:Optional
// +kubebuilder:validation:Enum=restricted
SubFilter SubFilter `json:"subFilter,omitempty"`
// +kubebuilder:validation:Required
PlacementRef PlacementSubject `json:"placementRef"`
// +kubebuilder:validation:Required
// +kubebuilder:validation:MinItems=1
Subjects []Subject `json:"subjects"`
Status PlacementBindingStatus `json:"status,omitempty"`
}
```

### Implementation Details/Notes/Constraints

For the `Placement` library, this requires importing at least this package version (the `Placement` library is in the
`v1alpha1` version):

```
open-cluster-management.io/api v0.11.1-0.20230828015110-b39eb9026c6e
```

For testing, the `governance-policy-propagator` doesn't currently account for multiple managed clusters. As part of this
enhancement, the test flows would need to be enhanced (and/or a separate workflow created) that deploys multiple managed
clusters.

The `PlacementDecisionGetter` returns an array of pointers (`[]*clusterv1beta1.PlacementDecision`) because it was
intended to retrieve from a cache, so the implementation could consider setting up a `PlacementDecision` cache instead
of using the Kubernetes client directly.

### Risks and Mitigation

- The `Placement` library is relatively new and untested outside of its repo and this implementation leans heavily on
its logic. While it works in theory, there could be some tweaks/adjustments as development proceeds, lengthening time
for implementation. The phased approach intends to address this to make partial implementation feasible.

## Design Details

### Open Questions

1. Should the rollout handler be instantiated with a dynamic watcher/cache instead of the controller client?
2. Should `RolloutStatus` be reflected per-cluster in the root policy? (The rollout status would also be reflected in
each replicated policy.)
3. Should the `RolloutStrategy` be implemented in the `Policy` CRD instead of `PlacementBinding`? (If that were the
case, it would also need to be added to `PolicySet`.)
4. How should a policy handle when placement decision groups shift? (This implementation leans on the `Placement`
library to handle such shifts, only handling the clusters that it returns.)
5. How is "soak time" (the minimum time for a policy to reach a successful state) and timeout (the maximum amount of
time to wait for a successful state), as defined in the placement enhancement, handled for Policies?

### Test Plan

- Unit tests in the repo
- E2E tests in the repo (would need to add additional managed clusters for this, potentially in a separate workflow,
though ideally alongside the existing tests)
- Policy integration tests in the framework (would need to add additional managed clusters for this, potentially in a
separate workflow, though ideally alongside the existing tests)

## Drawbacks / Alternatives

The maturity of the `Placement` library could be brought into question. This enhancement could hold after migrating to
the `Placement` library in Story 2 and only support the `All` strategy until we ensure that we have a robust test
environment that can test the various configurations before moving forward.

If a user only updates a single policy-template out of many in a `Policy`, the rollout would signal a status update for
all of the policy-templates. This is a drawback that this enhancement does not seek to address.
10 changes: 10 additions & 0 deletions enhancements/sig-policy/99-policy-placement-strategy/metadata.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
title: policy-placement-strategy
authors:
- "@dhaiducek"
reviewers:
- TBD
approvers:
- TBD
creation-date: 2023-08-03
last-updated: 2023-08-03
status: implementable

0 comments on commit e40482c

Please sign in to comment.