diff --git a/keps/prod-readiness/sig-scheduling/5027.yaml b/keps/prod-readiness/sig-scheduling/5027.yaml new file mode 100644 index 00000000000..47cdd47b8f3 --- /dev/null +++ b/keps/prod-readiness/sig-scheduling/5027.yaml @@ -0,0 +1,6 @@ +# The KEP must have an approver from the +# "prod-readiness-approvers" group +# of http://git.k8s.io/enhancements/OWNERS_ALIASES +kep-number: 5027 +alpha: + approver: "@johnbelamaric" diff --git a/keps/prod-readiness/sig-scheduling/5055.yaml b/keps/prod-readiness/sig-scheduling/5055.yaml new file mode 100644 index 00000000000..059a51495a5 --- /dev/null +++ b/keps/prod-readiness/sig-scheduling/5055.yaml @@ -0,0 +1,6 @@ +# The KEP must have an approver from the +# "prod-readiness-approvers" group +# of http://git.k8s.io/enhancements/OWNERS_ALIASES +kep-number: 5055 +alpha: + approver: "@johnbelamaric" diff --git a/keps/sig-scheduling/5027-dra-admin-controlled-device-attributes/README.md b/keps/sig-scheduling/5027-dra-admin-controlled-device-attributes/README.md new file mode 100644 index 00000000000..a8d9fa07c86 --- /dev/null +++ b/keps/sig-scheduling/5027-dra-admin-controlled-device-attributes/README.md @@ -0,0 +1,772 @@ + +# [KEP-5027](https://github.com/kubernetes/enhancements/issues/5027): DRA: admin-controlled device attributes + + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [Notes/Constraints/Caveats](#notesconstraintscaveats) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [API](#api) + - [Merging ResourceSlicePatches and ResourceSlices](#merging-resourceslicepatches-and-resourceslices) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Alpha](#alpha) + - [Beta](#beta) + - [GA](#ga) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) + + +## Release Signoff Checklist + + + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [ ] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + + + +With Dynamic Resource Allocation (DRA), DRA drivers publish information about +the devices that they manage in ResourceSlices. This information is used by the +scheduler when selecting devices for user requests in ResourceClaims. + +This KEP adds a Kubernetes API that privileged users, typically cluster +administrators or control plane controllers, can use to override or extend that information. This can be +permanent as part of the installation of a DRA driver to adapt the driver to +the cluster or temporary as part of cluster maintenance. An extension of the +API adds [taints](../5055-dra-device-taints-and-tolerations/README.md). + +## Motivation + +### Goals + +- Enable [admin-controlled](../5055-dra-device-taints-and-tolerations/README.md) device taints. + +- Enable updating how devices are seen in the cluster without having to use + driver-specific APIs which influence what a driver puts into ResourceSlices. + +### Non-Goals + +- At least for alpha: extend `kubectl` to provide a unified view of devices + together with all patches that apply to them. + +## Proposal + +The intent to patch device attributes must be recorded persistently so that +it is preserved even when a ResourceSlice gets removed or updated. To achieve +this, a new cluster-scoped ResourceSlicePatch type gets added. A single +ResourceSlicePatch object specifies device attributes that apply to all +devices matching a CEL expression (i.e. the same way as users select devices in +a ResourceClaim) and/or some additional criteria (device class, +driver/pool/device name). + +The scheduler must merge these additional attributes with the ones provided by +the DRA drivers on-the-fly while it gathers information about available +devices. + +### Notes/Constraints/Caveats + +Users who look at ResourceSlices to figure out which devices are available also +need to consider ResourceSlicePatches to get the full picture. Copying from +the ResourceSlicePatch spec into the ResourceSlice status could help here, +but would not be instantaneous and potentially cause write amplification (one +ResourceSlicePatch affecting many different devices) and therefore is not +part of this proposal. + +Perhaps `kubectl describe resourceslices` can be extended to include the +additional information. For now this is out of scope. + +Creating a ResourceSlicePatch is racing with on-going scheduling attempts, +which is unavoidable. Removing a device from a ResourceSlice has the same +problem. + +### Risks and Mitigations + +From a security perspective, permission to patch device attributes is +expected to be limited to privileged users who already have the ability to add +or remove DRA drivers, so there won't be a substantial difference. + +Performance in the scheduler could be an issue. This will be mitigated by +caching the patched devices and +(re-)applying patches only when they or the device definitions change, which +should be rare. + +## Design Details + +### API + +The ResourceSlicePatch is a cluster-scoped type in the `resource.k8s.io` API +group, initially in `v1alpha3` (the alpha version in Kubernetes 1.32). Because +it may be useful to clean up after disabling the feature and because the +device taint feature also uses this type, it gets served unconditionally as long as +the `v1alpha3` version is enabled. Fields related specifically to this KEP +are feature-gated. + +```Go +type ResourceSlicePatch struct { +metav1.TypeMeta + // Standard object metadata + // +optional + metav1.ObjectMeta + + // Changing the spec automatically increments the metadata.generation number. + Spec ResourceSlicePatchSpec +} + +type ResourceSlicePatchSpec struct { + // Devices defines how to patch device attributes and taints. + Devices DevicePatch +} + +// DevicePatch selects one or more devices by class, driver, pool, device names +// and/or CEL selectors. All of these criteria must be satisfied by a device, otherwise +// it is ignored by the patch. A DevicePatch with no selection criteria is +// valid and matches all devices. +type DevicePatch struct { + // Filter defines which device(s) the patch is applied to. + // + // +optional + Filter *DevicePatchFilter + + // If a ResourceSlice and a DevicePatch define the same attribute or + // capacity, the value of the DevicePatch is used. If multiple + // different DevicePatches match the same device, then the one with + // the highest priority wins. If the priorities are the same, it is non-deterministic + // which patch is used. + // + // +optional + Priority *int + + // Attributes defines the set of attributes to patch for matching devices. + // The name of each attribute must be unique in that set and + // include the domain prefix. + // + // In contrast to attributes in a ResourceSlice, entries here are allowed to + // be marked as empty by setting their null field. Such entries remove the + // corresponding attribute in a ResourceSlice, if there is one, instead of + // overriding it. Because entries get removed and are not allowed in + // slices, CEL expressions do not need need to deal with null values. + // + // The maximum number of attributes and capacities in the DevicePatch combined is 32. + // This is an alpha field and requires enabling the DRAAdminControlledDeviceAttributes + // feature gate. + // + // +optional + // +featureGate:DRAAdminControlledDeviceAttributes + Attributes map[FullyQualifiedName]NullableDeviceAttribute + + // Capacity defines the set of capacities to patch for matching devices. + // The name of each capacity must be unique in that set and + // include the domain prefix. + // + // Removing a capacity is not supported. It can be reduced to 0 instead. + // + // The maximum number of attributes and capacities in the DevicePatch combined is 32. + // This is an alpha field and requires enabling the DRAAdminControlledDeviceAttributes + // feature gate. + // + // +optional + // +featureGate:DRAAdminControlledDeviceAttributes + Capacity map[FullyQualifiedName]DeviceCapacity + + // ^^^^ + // The assumption here is that all device types will have attributes and capacities, + // similar to the current BasicDevice type. Therefore the overrides are not made + // specific to certain device types. +} + +// DevicePatchFilter defines which device(s) a DevicePatch applies to. +// All criteria defined here must be satisfied for a device to be +// patched. +type DevicePatchFilter struct { + // If DeviceClassName is set, the selectors defined there must be + // satisfied by a device to be patched. This field corresponds + // to class.metadata.name. + // + // +optional + DeviceClassName *string + + // If driver is set, only devices from that driver are patched. + // This fields corresponds to slice.spec.driver. + // + // +optional + Driver *string + + // If pool is set, only devices in that pool are patched. + // This fields corresponds to slice.spec.pool.name. + // + // Setting also the driver name may be useful to avoid + // ambiguity when different drivers use the same pool name, + // but this is not required because selecting pools from + // different drivers may also be useful, for example when + // drivers with node-local devices use the node name as + // their pool name. + // + // +optional + Pool *string + + // If device is set, only devices with that name are patched. + // This field corresponds to slice.spec.devices[].name. + // + // Setting also driver and pool may be required to avoid ambiguity, + // but is not required. + // + // +optional + Device *string + + // Selectors define criteria which must be satisfied by a + // device to be patched. All selectors must be satisfied. + // + // +optional + // +listType=atomic + Selectors []DeviceSelector + + // ^^^ + // + // Selectors is a list for the same reason why a request has a list: at some point + // we might have entries which use some yet to be defined mechanism which isn't + // CEL. +} +``` + +To distinguish intentionally empty attributes from attributes which have some +future, unknown content and thus only seem to be empty to an older client, a +special "null value" gets introduced: + +```Go +// NullableDeviceAttribute must have exactly one field set. +// It has the exact same fields as a DeviceAttribute plus `null` as +// an additional alternative. +type NullableDeviceAttribute struct { + DeviceAttribute + + // NullValue, if set, marks an intentionally empty attribute. + // + // +optional + // +oneOf=ValueType + NullValue *NullValue `json:"null,omitempty" ...` +} + +// ^^^ +// `NullableDeviceAttribute` as an extension ensures that the OpenAPI +// for ResourceSlice remains unchanged. Using the same type with +// a `NullValue` that can be set only in one type is less clear. + +type NullValue struct {} +``` + +### Merging ResourceSlicePatches and ResourceSlices + +Helper code which keeps an up-to-date list of devices with all patches added +to them will be provided as part of `k8s.io/dynamic-resource-allocation`. It +will be based on informers such that evaluating the filter only is +necessary when ResourceSlices or ResourceSlicePatches change. + +A CEL expression that fails to evaluate to a boolean for a device (runtime +error like looking up an attribute that isn't defined, wrong result type, etc.) +is considered faulty. The patch then does not apply to the device where it +failed and an event will be generated for the ResourceSlicePatch with the +faulty CEL expression. + +### Test Plan + +[X] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + +None. + +##### Unit tests + + + +v1.32.0: + +- `k8s.io/dynamic-resource-allocation/structured`: 91.3% +- `k8s.io/kubernetes/pkg/apis/resource/validation`: 98.6% + +##### Integration tests + + + +Additional scenarios will be added to `test/integration/scheduler_perf`, not +just for correctness but also to evaluate a potential performance impact. + +##### e2e tests + + + +One E2E test scenario is to change attributes and then run pods which select devices +based on those modified attributes such that unmodified devices don't match. + +- : + +### Graduation Criteria + +#### Alpha + +- Feature implemented behind a feature flag +- Initial e2e tests completed and enabled + +#### Beta + +- Gather feedback from developers and surveys +- Additional tests are in Testgrid and linked in KEP + +#### GA + +- 3 examples of real-world usage +- Allowing time for feedback +- [Conformance tests] + +[conformance tests]: https://git.k8s.io/community/contributors/devel/sig-architecture/conformance-tests.md + +### Upgrade / Downgrade Strategy + +Patching devices gets disabled when downgrading to a release without support +for it or when disabling the feature. The effect is that pods get scheduled as +if the ResourceSlicePatches didn't exist. Because they are completely +stand-alone, there is no effect on ResourceSlices or ResourceClaims. + +### Version Skew Strategy + +During version skew where the apiserver supports the feature and the scheduler +doesn't, users can create ResourceSlicePatches without encountering errors or +warnings, but they won't have any effect. + +## Production Readiness Review Questionnaire + +### Feature Enablement and Rollback + +It is possible to disable the feature through the feature gate while leaving +the API group enabled. This enables cleanup through the API. + +Re-enabling is supported because ResourceSlicePatches remain in etcd even +if they are inaccessible. + +###### How can this feature be enabled / disabled in a live cluster? + + + +- [X] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: DRAAdminControlledDeviceAttributes + - Components depending on the feature gate: + - kube-apiserver + - kube-scheduler +- [X] Other + - Describe the mechanism: resource.k8s.io/v1alpha3 API group + - Will enabling / disabling the feature require downtime of the control + plane? Yes, in the apiserver. + - Will enabling / disabling the feature require downtime or reprovisioning + of a node? No. + +###### Does enabling the feature change any default behavior? + +No. + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + +Yes. The behavior of scheduling changes when it was in use. +Running applications are not affected. + +###### What happens if we reenable the feature if it was previously rolled back? + +It takes effect again for scheduling. +Running applications are not affected. + +###### Are there any tests for feature enablement/disablement? + +This will be covered through unit tests for the apiserver and scheduler. + +### Rollout, Upgrade and Rollback Planning + + + +###### How can a rollout or rollback fail? Can it impact already running workloads? + + + +###### What specific metrics should inform a rollback? + + + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + + +### Monitoring Requirements + + + +###### How can an operator determine if the feature is in use by workloads? + + + +###### How can someone using this feature know that it is working for their instance? + + + +- [ ] Events + - Event Reason: +- [ ] API .status + - Condition name: + - Other field: +- [ ] Other (treat as last resort) + - Details: + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + + + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + + +- [ ] Metrics + - Metric name: + - [Optional] Aggregation method: + - Components exposing the metric: +- [ ] Other (treat as last resort) + - Details: + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + + + +### Dependencies + + + +###### Does this feature depend on any specific services running in the cluster? + + + +### Scalability + +Applying patches to devices scales with `number of ResourceSlicePatches` * +`number of devices` when CEL selectors need to be evaluated. Without them, +filtering scales with `number of ResourceSlicePatches` * `number of +ResourceSlices` but then may still need to compare device names and of course +modify selected devices. + +###### Will enabling / using this feature result in any new API calls? + +A fixed, small number of clients (primarily the scheduler) need to start +watching ResourceSlicePatches. + +###### Will enabling / using this feature result in introducing new API types? + +ResourceSlicePatches must be created explicitly by admins or controller +operated by admins. Kubernetes itself does not create them. + +The number of ResourceSlicePatches is expected to be orders of +magnitude smaller than the number of ResourceSlices. + +###### Will enabling / using this feature result in any new calls to the cloud provider? + +No. + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + +No. + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + +Pod scheduling should be as fast as would be without this feature, because in +both cases it starts with listing all devices. That information is local can +comes either from an informer cache or a cache of patched devices. + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + +Filtering and patching are local operations, with no impact on the cluster. To +prevent doing the same work repeatedly, it will be implemented so that it gets +done once and then only processes changes. This increases CPU and RAM +consumption. But even all devices should get patched (which is unlikely), memory +will be shared between objects in the informer cache and in the patch cache, so +it will not be doubled. + +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? + +No, because the feature is not used on nodes. + +### Troubleshooting + + + +###### How does this feature react if the API server and/or etcd is unavailable? + +###### What are other known failure modes? + + + +###### What steps should be taken if SLOs are not being met to determine the problem? + +## Implementation History + +- 1.33: first KEP revision and implementation + +## Drawbacks + +Distributing information across different objects of different types makes it +harder for users to get a complete view. + +## Alternatives + +Instead of ResourceSlicePatch as a separate type, new fields in the +ResourceSlice status could be modified by an admin. That has the problem that +the ResourceSlice object might get deleted while doing cluster maintenance like +a driver update, in which case the admin intent would get lost. diff --git a/keps/sig-scheduling/5027-dra-admin-controlled-device-attributes/kep.yaml b/keps/sig-scheduling/5027-dra-admin-controlled-device-attributes/kep.yaml new file mode 100644 index 00000000000..a19fa930779 --- /dev/null +++ b/keps/sig-scheduling/5027-dra-admin-controlled-device-attributes/kep.yaml @@ -0,0 +1,38 @@ +title: "DRA: admin-controlled device attributes" +kep-number: 5027 +authors: + - "@pohly" +owning-sig: sig-scheduling +status: implementable +creation-date: 2024-01-10 +reviewers: + - TBD +approvers: + - TBD + +see-also: + - "/keps/sig-node/4381-dra-structured-parameters" + +# The target maturity stage in the current dev cycle for this KEP. +stage: alpha + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v1.33" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v1.33" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: DRAAdminControlledDeviceAttributes + components: + - kube-apiserver + - kube-scheduler +disable-supported: true + +# The following PRR answers are required at beta release +metrics: diff --git a/keps/sig-scheduling/5055-dra-device-taints-and-tolerations/README.md b/keps/sig-scheduling/5055-dra-device-taints-and-tolerations/README.md new file mode 100644 index 00000000000..0325f6e5dea --- /dev/null +++ b/keps/sig-scheduling/5055-dra-device-taints-and-tolerations/README.md @@ -0,0 +1,768 @@ + +# [KEP-5055](https://github.com/kubernetes/enhancements/issues/5055): DRA: device taints and tolerations + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories](#user-stories) + - [Degraded Devices](#degraded-devices) + - [External Health Monitoring](#external-health-monitoring) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [API](#api) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Alpha](#alpha) + - [Beta](#beta) + - [GA](#ga) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) + + +## Release Signoff Checklist + + + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [ ] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + +With Dynamic Resource Allocation (DRA), DRA drivers publish information about +the devices that they manage in ResourceSlices. This information is used by the +scheduler when selecting devices for user requests in ResourceClaims. + +With this KEP, DRA drivers can mark devices as tainted such that they won't be +used for scheduling new pods. In addition, pods already running with access to +a tainted device can be stopped automatically. Cluster administrators can do +the same by creating a +[ResourceSlicePatch](../5027-dra-admin-controlled-device-attributes) with a +taint. + +Users can decide to ignore specific taints by adding tolerations to their +ResourceClaim. + +## Motivation + +### Goals + +- Enable taking devices offline for maintenance while still allowing tests pods + to request and use those devices. Being able to do this one device at a time + minimizes service level disruption. + +- Enable users to decide whether they want to keep running a workload in a degraded + mode while a device is unhealthy or prefer to get pods rescheduled. + +### Non-Goals + +- Not part of the plan for alpha: developing a kubectl command for managing device taints. + This may be reconsidered. + +## Proposal + +### User Stories + +#### Degraded Devices + +A driver itself can detect problems which may or may not be tolerable for +workloads, like degraded performance due to overheating. Removing such devices +from the ResourceSlice would unconditionally prevent using them for new +pods. Instead, publishing with a taint informs users about this degradation and +leaves them the choice whether the device is still usable enough to run pods. +It also automates stopping pods which don't tolerate such a degradation. + +#### External Health Monitoring + +As cluster admin, I am deploying a vendor-provided DRA driver together with a +separate monitoring component for hardware aspects that are not available or +not supported by that DRA driver. When that component detects problems, it can +check its policy configuration and decide to take devices offline by creating +a ResourceSlicePatch with a taint for affected devices. + +### Risks and Mitigations + +A device can be identified by its names (`//`) and/or by its attributes (for example, a unique ID). It was a conscious +decision for core DRA to not require that the name is tied to one particular +hardware instance to support hot-swapping. Admins might favor using the names +whereas health monitoring might prefer to be specific and use a vendor-defined +unique ID. Both are supported, which creates additional complexity. + +Without a kubectl extension similar to `kubectl taint nodes` the user +experience for admins will be a bit challenging. They need to decide how to +identify the device (by name or with a CEL expression), manually create a +ResourceSlicePatch with a unique name, then remember to remove that +ResourceSlicePatch again. For beta, support in `kubectl` for common +operations may be needed. + +## Design Details + +The feature is following the approach and APIs taken for node taints and +applies them to devices. A new controller watches tainted devices and deletes +pods using them unless they tolerate the device taint, similar to the +[taint-eviction-controller](https://github.com/kubernetes/kubernetes/blob/32130691a4cb8a1034b999341c40e48d197f5465/pkg/controller/tainteviction/taint_eviction.go#L81-L83). A pod which is running or has finalizers will not get removed +immediately. Instead, the `DeletionTimestamp` gets set. That's okay for +the purpose of this KEP: +- The kubelet will stop any running containers and mark the pod as completed. +- The ResourceClaim controller will remove such a completed pod from the claim's + `ReservedFor` and deallocate the claim once it has no consumers. + +Taints are cumulative as long as the key and effect pairs are different: +- Taints defined by an admin in a ResourceSlicePatch get added to the + set of taints defined by the DRA driver in a ResourceSlice. +- Taints with the same key and effect get overwritten, using the same + precedence as for attributes. + +This merging will be implemented by the same code that also +overrides device attributes. + +To ensure consistency among all pods sharing a ResourceClaim, the toleration +for taints gets added to the request in a ResourceClaim, not the pod. This also +avoids conflicts like one pod tolerating a taint for scheduling and some other +pod not tolerating that. + +Device and node taints are applied independently. A node taint applies to all +pods on a node, whereas a device taint affects claim allocation and only those +pods using the claim. + +### API + +The ResourceSlice content gets extended: + +```Go +// BasicDevice defines one device instance. +type BasicDevice struct { + ... + + // If specified, the device's taints. + // + // This is an alpha field and requires enabling the DRADeviceTaints + // feature gate. + // + // +optional + // +listType=atomic + // +featureGate=DRADeviceTaints + Taints []Taint +} + +// The device this Taint is attached to has the "effect" on +// any claim and, through the claim, to pods that do not tolerate +// the Taint. +type Taint struct { + // The taint key to be applied to a device. + // Must be a label name. + // + // +required + Key string + + // The taint value corresponding to the taint key. + // Must be a label value. + // + // +optional + Value string + + // The effect of the taint on claims that do not tolerate the taint + // and through such claims on the pods using them. + // Valid effects are NoSchedule and NoExecute. PreferNoSchedule as used for + // nodes is not valid here. + // + // +required + Effect TaintEffect + + // ^^^^ + // + // Implementing PreferNoSchedule would depend on a scoring solution for DRA. + // It might get added as part of that. + + // TimeAdded represents the time at which the taint was added. + // It is only written for NoExecute taints. + // + // +optional + TimeAdded *metav1.Time +} +``` + +Taint has the exact same fields as a v1.Taint, but the description is a bit +different. In particular, PreferNoSchedule is not valid. + +Tolerations get added to a DeviceRequest: + +```Go +type DeviceRequest struct { + ... + + // If specified, the request's tolerations. + // + // Tolerations for NoSchedule are required to allocate a + // device which has a taint with that effect. The same applies + // to NoExecute. + // + // In addition, should any of the allocated devices get tainted + // with NoExecute after allocation and that effect is not tolerated, + // then all pods consuming the ResourceClaim get deleted to evict + // them. The scheduler will not let new pods reserve the claim while + // it has these tainted devices. Once all pods are evicted, the + // claim will get deallocated. + // + // +optional + // +listType=atomic + Tolerations []Toleration +} + +// The ResourceClaim this Toleration is attached to tolerate any taint that matches +// the triple using the matching operator . +type Toleration struct { + // Key is the taint key that the toleration applies to. Empty means match all taint keys. + // If the key is empty, operator must be Exists; this combination means to match all values and all keys. + // Must be a label name. + // + // +optional + Key string + + // Operator represents a key's relationship to the value. + // Valid operators are Exists and Equal. Defaults to Equal. + // Exists is equivalent to wildcard for value, so that a ResourceClaim can + // tolerate all taints of a particular category. + // + // +optional + Operator TolerationOperator + + // Value is the taint value the toleration matches to. + // If the operator is Exists, the value should be empty, otherwise just a regular string. + // Must be a label value. + // + // +optional + Value string + + // Effect indicates the taint effect to match. Empty means match all taint effects. + // When specified, allowed values are NoSchedule and NoExecute. + // + // +optional + Effect TaintEffect + + // TolerationSeconds represents the period of time the toleration (which must be + // of effect NoExecute, otherwise this field is ignored) tolerates the taint. By default, + // it is not set, which means tolerate the taint forever (do not evict). Zero and + // negative values will be treated as 0 (evict immediately) by the system. + // + // +optional + TolerationSeconds *int64 +} + +// A toleration operator is the set of operators that can be used in a toleration. +// +// +enum +type TolerationOperator string + +const ( + TolerationOpExists TolerationOperator = "Exists" + TolerationOpEqual TolerationOperator = "Equal" +) +``` + +As with Taint, these structs get duplicated to document DRA specific +behavior and to ensure that future extensions do not get inherited +accidentally. + +Generated conversion code might make it possible to reuse existing helper +code. Alternatively, that code can be copied. + +The DevicePatch also gets extended. It is possible to use +admin-controlled taints without enabling attribute overrides by enabling the +`v1alpha3` API and only the `DRADeviceTaints` feature, while leaving +`DRAAdminControlledDeviceAttributes` disabled. + +```Go +type DevicePatch struct { + ... + + // If specified, the device's taints. Taints with unique key and effect + // get added to the set of taints of the device. When key and effect + // are used in multiple places, the same precedence rules as for attributes apply + // (see the priority field). + // + // This is an alpha field and requires enabling the DRADeviceTaints + // feature gate. + // + // +optional + // +listType=atomic + // +featureGate=DRADeviceTaints + Taints []Taint +``` + +### Test Plan + +[X] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + +None. + +##### Unit tests + + + +v1.32.0: + +- `k8s.io/dynamic-resource-allocation/structured`: 91.3% +- `k8s.io/kubernetes/pkg/apis/resource/validation`: 98.6% +- `k8s.io/kubernetes/pkg/controller/tainteviction`: 81.8% + +##### Integration tests + + + +Integration tests for the new eviction manager will be useful to ensure that +permissions are correct. + +##### e2e tests + + + +Useful E2E tests are checking that the scheduler really honors taints during +scheduling. Adding a taint in a ResourceSlice must evict a running pod. Same +for adding a taint through a ResourceSlicePatch. + +### Graduation Criteria + +#### Alpha + +- Feature implemented behind a feature flag +- Initial e2e tests completed and enabled + +#### Beta + +- Gather feedback from developers and surveys +- Additional tests are in Testgrid and linked in KEP + +#### GA + +- 3 examples of real-world usage +- Allowing time for feedback +- [Conformance tests] + +[conformance tests]: https://git.k8s.io/community/contributors/devel/sig-architecture/conformance-tests.md + +### Upgrade / Downgrade Strategy + +Tainting gets disabled when downgrading to a release without support for it or +when disabling the feature. The effect is as if the taints weren't set. + +### Version Skew Strategy + +During version skew where the apiserver supports the feature and the scheduler +doesn't, taints can be set without encountering errors or +warnings, but they won't have any effect. + +## Production Readiness Review Questionnaire + +### Feature Enablement and Rollback + +It is possible to disable the feature through the feature gate while leaving +the API group enabled. This enables cleanup through the API. + +Re-enabling is supported because ResourceSlicePatches remain in etcd even if +they are inaccessible and existing taints and tolerations are preserved during +updates. + +###### How can this feature be enabled / disabled in a live cluster? + +- [X] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: DRADeviceTaints + - Components depending on the feature gate: + - kube-apiserver + - kube-scheduler + - kube-controller-manager +- [X] Other + - Describe the mechanism: resource.k8s.io/v1alpha3 API group + - Will enabling / disabling the feature require downtime of the control + plane? Yes, in the apiserver. + - Will enabling / disabling the feature require downtime or reprovisioning + of a node? No. + +###### Does enabling the feature change any default behavior? + +No. + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + +Yes. The behavior of scheduling changes when it was in use. +Running applications are not affected. + +###### What happens if we reenable the feature if it was previously rolled back? + +It takes effect again for scheduling and may evict pods. + +###### Are there any tests for feature enablement/disablement? + +This will be covered through unit tests for the apiserver and scheduler. + +### Rollout, Upgrade and Rollback Planning + + + +###### How can a rollout or rollback fail? Can it impact already running workloads? + + + +###### What specific metrics should inform a rollback? + + + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + + +### Monitoring Requirements + + + +###### How can an operator determine if the feature is in use by workloads? + + + +###### How can someone using this feature know that it is working for their instance? + + + +- [ ] Events + - Event Reason: +- [ ] API .status + - Condition name: + - Other field: +- [ ] Other (treat as last resort) + - Details: + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + + + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + + +- [ ] Metrics + - Metric name: + - [Optional] Aggregation method: + - Components exposing the metric: +- [ ] Other (treat as last resort) + - Details: + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + + + +### Dependencies + + + +###### Does this feature depend on any specific services running in the cluster? + + + +### Scalability + +See [../5027-dra-admin-controlled-device-attributes/README.md#scalability] for a +discussion of the scalability of patching devices. The same applies to applying +taints through ResourceSlicePatch objects. + +Handling eviction scales with the number of claims and pods using those claims. + +###### Will enabling / using this feature result in any new API calls? + +A fixed, small number of clients (primarily the scheduler and controller +manager) need to start watching ResourceSlicePatches. + +Pods are already watched in the controller manager. Evicting them adds one call +per pod. + +###### Will enabling / using this feature result in introducing new API types? + +ResourceSlicePatches must be created explicitly by admins or controller +operated by admins. Kubernetes itself does not create them. + +The number of ResourceSlicePatches is expected to be orders of +magnitude smaller than the number of ResourceSlices. + +###### Will enabling / using this feature result in any new calls to the cloud provider? + +No. + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + +Enabling it doesn't. Using tolerations increases the size of ResourceClaims and +ResourceClaimTemplates. + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + +Pod scheduling may become a bit slower because of the additional checks, but +only when pods use claims. There are no SLI/SLOs for pods using claims. + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + +For scheduling, tracking taints should be comparable to the overhead for +patching attributes. + +For eviction, additional data structures will be needed to track taints and +tolerations. This should not be too large. + +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? + +No, because the feature is not used on nodes. + +### Troubleshooting + + + +###### How does this feature react if the API server and/or etcd is unavailable? + +###### What are other known failure modes? + + + +###### What steps should be taken if SLOs are not being met to determine the problem? + +## Implementation History + +- 1.33: first KEP revision and implementation + +## Drawbacks + +Distributing information across different objects of different types makes it +harder for users to get a complete view. + +## Alternatives + +The existing taint-eviction-controller could be extended to cover device +taints. Cloning it lowers the risk of breaking existing stable functionality. + +Tolerations for device taints could also be added to individual pods. This +seems less useful because if pods share the same claim, they are typically part +of one larger application with identical tolerations. Experimenting with a new +API in the beta ResourceClaim type is a bit easier than it would be in the GA +Pod type. diff --git a/keps/sig-scheduling/5055-dra-device-taints-and-tolerations/kep.yaml b/keps/sig-scheduling/5055-dra-device-taints-and-tolerations/kep.yaml new file mode 100644 index 00000000000..68ca44fdb50 --- /dev/null +++ b/keps/sig-scheduling/5055-dra-device-taints-and-tolerations/kep.yaml @@ -0,0 +1,40 @@ +title: "DRA: device taints and tolerations" +kep-number: 5055 +authors: + - "@pohly" + - "@everpeace" +owning-sig: sig-scheduling +status: implementable +creation-date: 2025-01-20 +reviewers: + - TBD +approvers: + - TBD + +see-also: + - "/keps/sig-node/5027-dra-admin-controlled-device-attributes" + +# The target maturity stage in the current dev cycle for this KEP. +stage: alpha + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v1.33" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v1.33" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: DRADeviceTaints + components: + - kube-apiserver + - kube-scheduler + - kube-controller-manager +disable-supported: true + +# The following PRR answers are required at beta release +metrics: