Skip to content

Commit

Permalink
[99][sig-policy] Policy placement strategy
Browse files Browse the repository at this point in the history
  • Loading branch information
dhaiducek committed Aug 14, 2023
1 parent 11e15da commit 6b3e4e7
Show file tree
Hide file tree
Showing 2 changed files with 277 additions and 0 deletions.
267 changes: 267 additions & 0 deletions enhancements/sig-policy/99-policy-placement-strategy/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,267 @@
# Policy placement strategy

## Release Signoff Checklist

- [ ] Enhancement is `implementable`
- [ ] Design details are appropriately documented from clear requirements
- [ ] Test plan is defined
- [ ] Graduation criteria for dev preview, tech preview, GA
- [ ] User-facing documentation is created in
[website](https://github.com/open-cluster-management-io/open-cluster-management-io.github.io/)

## Summary

Following from the [`DecisionStrategy`](../../sig-architecture/64-placementStrategy/README.md) field in the `Placement`
API, policies can leverage this new logic to have a configurable and systematic way to roll out policy updates to
clusters.

## Motivation

Currently policies and subsequent updates are pushed out to clusters en masse based on the placement to which it has
been bound. The new `DecisionStrategy` field in the `Placement` API creates cluster groupings that controllers can
leverage to have segmented, configurable rollouts of resources. This will aid use cases where high availability may be a
priority or a set of test clusters should receive and verify the updates before the remaining clusters get updated.

### Goals

- Make `Placement` the primary API for placement (currently the governance propagator is somewhat
`PlacementRule`-centric)
- Leverage the `Placement` helper library to retrieve placement decisions
- Reflect rollout status per cluster in the root policy for discoverability (whether a cluster is up-to-date or not)
- (stretch) Implement the `RolloutStrategy` struct for policies, including:
- `RolloutStrategy`
- "All": all clusters at once
- "Progressive": one cluster at a time
- "ProgressivePerGroup": one group of clusters at a time
- `Timeout`: Maximum amount of time to wait for success before continuing the rollout
- `MandatoryDecisionGroups`: groups that should be handled before other groups
- `MaxConcurrency`: Concurrency during a progressive rollout
- (stretch) Add an aggregated rollout status for the root policy status.

### Non-Goals

Any specialized behaviors outside of those provided by the `Placement` library and, by extension, the `DecisionStrategy`
enhancement, that might require additional code other than that already provided/required by the `Placement` library.

## Proposal

### User Stories

#### Story 1 / Phase 1

As a system administrator, I want to know the status of a rollout for a policy and what clusters have been updated.

- **Summary**

- Add an annotation to the replicated Policy indicating whether a policy is up-to-date.

- **Details**

- Update the `governance-policy-propagator` controller to add a `open-cluster-management.io/root-policy-generation`
annotation to the replicated policy for each cluster.
- Update the annotation with the policy's generation at the time of the update when a policy has been replicated.
- (Optional--requirement for phase 3) Update the `governance-policy-propagator` controller to add a
`open-cluster-management.io/root-policy-timestamp` annotation to the replicated policy for each cluster indicating
when the replicated policy was last applied.

#### Story 2 / Phase 2

As a system administrator, I want to use a placement `DecisionStrategy` with policies.

- **Summary**

- Use the `Placement` library to gather the list of clusters and iterate over them as it is today.

- **Details**

- Make `Placement` the primary placement resource and parse `PlacementRule` decisions into the `Placement` struct
instead of the `PlacementRule` struct.
- Using the `Placement` library and implementing a basic `ClusterRolloutStatusFunc`, iterate over all clusters as
usual using the `All` rollout strategy. The `RolloutStatus` will initially be either `ToApply` or `Succeed`.
- (Optional cleanup) Remove the current `Placement.Decisions` from the `status` of the `Policy` CRD. It's uncertain
why `Policy` would store something that's discoverable elsewhere, and it's not certain it's in use currently.
- (Optional) Remove the `ClusterNamespace` from the `status` of the `Policy` CRD (`Placement` doesn't have this, so it
ought to be safe to remove it at this point if it's feasible and not overly complex to do so).

- **Snippet (untested)**

```golang
// common.go
type placementDecisionGetter struct {
c client.Client
}

func (pd placementDecisionGetter) List(selector labels.Selector, namespace string) ([]*clusterv1beta1.PlacementDecision, error) {
pdList := &clusterv1beta1.PlacementDecisionList{}
lopts := &client.ListOptions{
LabelSelector: selector,
Namespace: namespace,
}

err := pd.c.List(context.TODO(), pdList, lopts)
pdPtrList := []*clusterv1beta1.PlacementDecision{}
for _, pd := range pdList.Items {
pdPtrList = append(pdPtrList, &pd)
}

return pdPtrList, err
}

func GetRolloutHandler(c client.Client, placement *clusterv1beta1.Placement) (*clusterv1alpha1.RolloutHandler, error) {
pdTracker := clusterv1beta1.NewPlacementDecisionClustersTracker(placement, placementDecisionGetter{c}, nil, nil)

return clusterv1alpha1.NewRolloutHandler(pdTracker)
}

var GetClusterRolloutStatus clusterv1alpha1.ClusterRolloutStatusFunc = func(clusterName string) clusterv1alpha1.ClusterRolloutStatus {
// (Phase 3 rollout logic goes here)

return clusterv1alpha1.ClusterRolloutStatus{
Status: "", // initially determined by comparing the generation of the policy with the particular
// cluster's generation, resolving to either "ToApply" or "Succeed"
LastTransitionTime: "", // retrieved from the last transition time in the cluster's status
}
}
```

```golang
rolloutHandler := common.GetRolloutHandler(client, placement)
strategy, rolloutResult, err := rolloutHandler.GetRolloutCluster(clusterv1alpha1.All, common.GetClusterRolloutStatus)

// ... Use rolloutResults for propagation
```

#### Story 3 / Phase 3 (Stretch goal)

As a system administrator, I want policies to be rolled out only when the policy on previous clusters show as
"Compliant" and control their concurrency and priority.

- **Summary**

- Only continue policy deployment once the previously deployed clusters show as "Compliant" after a configurable
timeout.

- **Details**

- Update the `PlacementBinding` CRD to contain the `RolloutStrategy` struct. (See the
[`v1alpha1/RolloutStrategy`](https://github.com/open-cluster-management-io/api/blob/main/cluster/v1alpha1/types_rolloutstrategy.go))
Default to `All` if a strategy is not provided.
- Update the `governance-policy-propagator` controller to add a `open-cluster-management.io/root-policy-timestamp`
annotation to the replicated policy for each cluster indicating when the replicated policy was last applied.
- Store a `lastEvaluated` in the status of the replicated policies so that the `RolloutStatus` can be determined
inside of `ClusterRolloutStatusFunc`. This should only be updated if the status changes (current behavior), or the
generation changes and/or the root policy timestamp is less than the `lastEvaluated` timestamp.
- Update the `ClusterRolloutStatusFunc`, (the `GetClusterRolloutStatus` function from Phase 2) to set up logic to work
with the `Placement` library to provide the information it needs to continue only once the previous group is
compliant (or after a configurable timeout):

| Status | Description |
| ------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------- |
| `ToApply` | No generation set |
| `Progressing` | Current generation and `lastEvaluated` has not been updated |
| `Succeed` | Compliant and current generation |
| `Failed` | NonCompliant and current generation |
| `TimeOut` | Current generation, `lastEvaluated` is less than `root-policy-timestamp`, </br>and time has passed beyond timeout specified in the `RolloutStrategy` |
| `Skip` | (unused) |

- (Stretch) Update the `Policy` CRD to contain a `RolloutStatus`. This will be used to provide an aggregated rollout
status for the root policy of `Progressing` (this would include `ToApply`), `Succeed`, or `Failed` (this would
include `TimeOut`).

- **Snippet**

```golang
type PlacementBinding struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
// +kubebuilder:validation:Optional
BindingOverrides BindingOverrides `json:"bindingOverrides,omitempty"`
// +kubebuilder:validation:Optional
RolloutStrategy clusterv1alpha1.RolloutStrategy `json:"rolloutStrategy,omitempty"` // <-- New field
// This field provides the ability to select a subset of bound clusters
// +kubebuilder:validation:Optional
// +kubebuilder:validation:Enum=restricted
SubFilter SubFilter `json:"subFilter,omitempty"`
// +kubebuilder:validation:Required
PlacementRef PlacementSubject `json:"placementRef"`
// +kubebuilder:validation:Required
// +kubebuilder:validation:MinItems=1
Subjects []Subject `json:"subjects"`
Status PlacementBindingStatus `json:"status,omitempty"`
}
```

```golang
// ComplianceHistory defines compliance details history
type ComplianceHistory struct {
LastEvaluated metav1.Time `json:"lastEvaluated,omitempty" protobuf:"bytes,7,opt,name=lastEvaluated"` // <-- New field
LastTimestamp metav1.Time `json:"lastTimestamp,omitempty" protobuf:"bytes,7,opt,name=lastTimestamp"`
Message string `json:"message,omitempty" protobuf:"bytes,4,opt,name=message"`
EventName string `json:"eventName,omitempty"`
}

// PolicyStatus defines the observed state of Policy
type PolicyStatus struct {
Placement []*Placement `json:"placement,omitempty"`
Status []*CompliancePerClusterStatus `json:"status,omitempty"`

// +kubebuilder:validation:Enum=Compliant;Pending;NonCompliant
ComplianceState ComplianceState `json:"compliant,omitempty"`
RolloutStatus clusterv1alpha1.RolloutStatus `json:"rolloutStatus,omitempty"` // <-- New field
Details []*DetailsPerTemplate `json:"details,omitempty"`
}
```

### Implementation Details/Notes/Constraints

For the `Placement` library, this requires importing at least this package version (the `Placement` library is in the
`v1alpha1` version):

```
open-cluster-management.io/api v0.11.1-0.20230809113658-2b2399b5f6e8
```

For testing, the `governance-policy-propagator` doesn't currently account for multiple managed clusters. As part of this
enhancement, the test flows would need to be enhanced (and/or a separate workflow created) that deploys multiple managed
clusters.

### Risks and Mitigation

- The optional proposal to remove items from the `Policy` CRD here is to prevent bloat and try to slim down the CRD, but
updating the CRD could be a breaking change for users so further consideration might need to be made and the updates
aren't required.
- The `Placement` library is relatively new and untested outside of its repo and this implementation leans heavily on
its logic. While it works in theory, there could be some tweaks/adjustments as development proceeds, lengthening time
for implementation. The phased approach intends to address this to make partial implementation feasible.
- Having a `LastEvaluated` field in the status is going to increase load on the hub cluster Kubernetes API server that
efforts originally sought to reduce by not sending updates when the status hadn't changed. This could be remedied by
only sending an updated timestamp if the generation was updated and/or the `LastEvaluated` timestamp is less than the
root policy timstamp or (the current behavior) if the status indeed changed.

## Design Details

### Open Questions

1. Should the rollout handler be instantiated with the dynamic watcher instead of the controller client?
2. Should `RolloutStatus` be reflected per-cluster in the root policy? (Removing `clusterNamespace` could essentially
make the CRD the same size as it currently is, remediating concerns around bloat, but the rollout status is also
reflected in each replicated policy.)
3. Is there opposition to removing `Placement.Decisions` from the `status`? (This could alternatively be handled outside
of this enhancement.)
4. Should the `RolloutStrategy` be implemented in the `Policy` CRD instead of `PlacementBinding`?
5. How should a policy handle when placement decision groups shift? (This implementation leans on the `Placement`
library to handle such shifts, only handling the clusters that it returns.)
6. (An aside) What's with the pointer in `[]*clusterv1beta1.PlacementDecision`?

### Test Plan

- Unit tests in the repo
- E2E tests in the repo (would need to add additional managed clusters for this, potentially in a separate workflow,
though ideally alongside the existing tests)
- Policy integration tests in the framework (would need to add additional managed clusters for this, potentially in a
separate workflow, though ideally alongside the existing tests)

## Drawbacks / Alternatives

The maturity of the `Placement` library could be brought into question. This enhancement could hold at Phase 2 until we
ensure that we have a robust test environment that can test the various configurations before moving forward.
10 changes: 10 additions & 0 deletions enhancements/sig-policy/99-policy-placement-strategy/metadata.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
title: policy-placement-strategy
authors:
- "@dhaiducek"
reviewers:
- TBD
approvers:
- TBD
creation-date: 2023-08-03
last-updated: 2023-08-03
status: implementable

0 comments on commit 6b3e4e7

Please sign in to comment.