KEP-5027 + 5055: DRA: admin-controlled device attributes + device taints #5034

pohly · 2025-01-10T15:52:02Z

One-line PR description: DRA: admin-controlled device attributes + device attributes
Issue links:
- DRA: admin-controlled device attributes #5027
- DRA: device taints and tolerations #5055
Other comments: first revision

/cc @johnbelamaric

pohly · 2025-01-12T12:31:53Z

/cc @KobayashiD27

For the "device priority" use case.

/cc @byako

For device health.

k8s-ci-robot · 2025-01-12T12:31:56Z

@pohly: GitHub didn't allow me to request PR reviews from the following users: KobayashiD27.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @KobayashiD27

For the "device priority" use case.

/cc @byako

For device health.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

keps/sig-node/5027-dra-admin-controlled-device-attributes/README.md

pohly · 2025-01-13T08:02:06Z

/wg device-management
/sig node

keps/sig-node/5027-dra-admin-controlled-device-attributes/README.md

These are two different KEPs that provide two features that can be enabled and disabled independently. However, both use the same new ResourceSliceOverride type and thus get described and implemented together.

eero-t

There was earlier discussion of common (driver independent) tool(ing) for listing, adding and removing device taints. Would it make sense to mention something about that in the tainting KEP?

keps/sig-node/5055-device-taints-and-tolerations/README.md

everpeace

I deeply appreciated for your quick action for device taints/tolerations KEP!! I left some comments. PTAL.

keps/sig-node/5055-device-taints-and-tolerations/README.md

keps/sig-node/5055-device-taints-and-tolerations/kep.yaml

keps/sig-node/5027-dra-admin-controlled-device-attributes/README.md

keps/sig-node/5055-device-taints-and-tolerations/README.md

nojnhuh · 2025-01-22T19:20:46Z

keps/sig-scheduling/5027-dra-admin-controlled-device-attributes/README.md

+If a CEL expression fails for a device, the override does not apply and an
+event will be generated for the ResourceSlicePatch with the faulty CEL
+expression.


Does "fail" in this context mean an invalid CEL expression caused by something like a syntax error, and not that it cleanly evaluates to false?

"fails to evaluate to a boolean (runtime error, wrong result type)".

Syntax errors are caught during validation, but the attribute lookup is not type safe (devices.attributes[...].someField may or may not be a bool) and can cause key lookup exceptions (in this case, if someField isn't matching some attribute).

I updated the paragraph.

keps/sig-scheduling/5055-device-taints-and-tolerations/README.md

keps/sig-scheduling/5027-dra-admin-controlled-device-attributes/README.md

nojnhuh · 2025-01-25T21:37:40Z

keps/sig-scheduling/5027-dra-admin-controlled-device-attributes/README.md

+    // satisfied by a device to be patched.
+    //
+    // +optional
+    DeviceClass *string


Is it worth naming this DeviceClassName to be consistent with that field in ResourceClaims?

https://github.com/kubernetes/kubernetes/blob/03bf94bac074ce43228ee906a8cadea6176873c0/staging/src/k8s.io/api/resource/v1beta1/types.go#L411-L424

Also perhaps Pool -> PoolName?

Yes, it should be consistent, and thus DeviceClassName.

But driver, pool, and device are referred to without the Name suffix (https://github.com/kubernetes/kubernetes/blob/03bf94bac074ce43228ee906a8cadea6176873c0/staging/src/k8s.io/api/resource/v1beta1/types.go#L1012-L1035).

This is based on an API guideline which says "use *Name only for API objects". DeviceClass is an API object, "pool" isn't. I personally would have preferred "DriverName" instead of "Driver" because there is a difference between a "driver" (the thing, perhaps described by a struct) and a "driver name" (one particular attribute of it) and and had that in initial revisions of the API, but was told to remove the suffix for the sake of consistency with other APIs.

Hrrm. "DeviceClassName" next to "Driver/Pool/Device" looks odd. Not sure what a good solution is here. Also, suppose we do add a "resource.Driver" type similar to "storage.CSIDriver". Then "DriverName" suddenly would become more suitable than "Driver". Still not a fan of this API convention.... 🤷

I'm going with "consistent with other fields" for now, but we may have to revisit as part of the final API review.

I've also added comments that explain what those other fields are.

keps/sig-scheduling/5027-dra-admin-controlled-device-attributes/README.md

keps/sig-scheduling/5055-device-taints-and-tolerations/README.md

asm582 · 2025-01-28T19:22:25Z

keps/sig-scheduling/5027-dra-admin-controlled-device-attributes/README.md

+
+The other usage is to influence which devices are picked when there are
+multiple viable alternatives. This is a first step towards providing a more
+comprehensive [scoring](https://github.com/kubernetes/enhancements/issues/4970)


Can we add more information on how scoring can be achieved?

I need to remove this and the preceeding paragraph. Device priority is no longer part of this KEP and health is a separate one now.

asm582 · 2025-01-29T02:24:37Z

keps/sig-scheduling/5027-dra-admin-controlled-device-attributes/README.md

+-->
+
+One E2E test scenario is to mark all devices as offline and then verify that
+pods don't get scheduled. Another is to set different priorities and check that


based on this comment E2E tests also needs to be changed.

Thanks for the reminder. Fixed.

This gets added for the sake of completeness.

keps/sig-scheduling/5055-device-taints-and-tolerations/README.md

nojnhuh · 2025-01-29T17:34:13Z

keps/sig-scheduling/5027-dra-admin-controlled-device-attributes/README.md

+    // In contrast to attributes in a ResourceSlice, entries here are allowed to
+    // be marked as empty by setting their null field. Such entries remove the
+    // corresponding attribute in a ResourceSlice, if there is one, instead of
+    // overriding it.


Is it worth enforcing that null cannot be set in a ResourceSlice? If it's allowed, that would leave the option open for drivers to do so in case they want to communicate some nuance like "I think users may expect this attribute to exist, but it intentionally does not."

I think that's too nuanced and adds one more thing that users writing CEL expressions would have to be prepared for, in addition to the "can I access this attribute without getting a lookup error".

nojnhuh · 2025-01-29T17:52:39Z

keps/sig-scheduling/5027-dra-admin-controlled-device-attributes/README.md

+    // NullValue, if set, marks an intentionally empty attribute.
+    //
+    // May be used inside a ResourceSlicePatch to remove attributes,
+    // but not in a ResourceSlice.
+    //
+    // +optional
+    // +oneOf=ValueType
+    NullValue *NullValue `json:"null,omitempty"`


In the CEL environment, will null values be omitted entirely from the device.attributes map if we don't handle them specially? If not, enabling that might be worthwhile since I can see that being a little more ergonomic for authors of CEL expressions than if they have to handle "explicit null" and "actually undefined" differently.

I guess my question here boils down to "how does a Go map with a value of nil manifest in the CEL environment, and is that different from a Go map without that key/value pair at all?"

Because ResourceSlices contain no null values and null values in a ResourceSlicePatch cause the attribute to be removed, there's never a situation where a CEL expression gets a null value when looking up an attribute. That is deliberate: we already have "attribute not set in map", we don't need "attribute set with null value". That is semantically so close that I don't see the need.

keps/sig-scheduling/5027-dra-admin-controlled-device-attributes/README.md

asm582 · 2025-01-29T19:06:54Z

keps/sig-scheduling/5027-dra-admin-controlled-device-attributes/README.md

+    // ^^^^
+    // The assumption here is that all device types will have attributes and capacities,
+    // similar to the current BasicDevice type. Therefore the overrides are not made
+    // specific to certain device types.
 }

 // DevicePatchFilter defines which device(s) a [DevicePatch] applies to.
 type DevicePatchFilter struct {


It may be something out of the scope of this KEP, can an external controller add a selector that can be later consumed by the machinery described in the KEP?

This touches on the question whether DevicePatchFilter is immutable: it isn't, so whenever a selector gets added or removed, it changes how the scheduler evaluates the patch.

The machinery in this KEP doesn't care who does the updating, so it could be a controller.

nojnhuh · 2025-01-29T20:40:26Z

keps/sig-scheduling/5027-dra-admin-controlled-device-attributes/README.md

+Creating a ResourceSlicePatch is racing with on-going scheduling attempts,
+which is unavoidable.


Just for my own understanding, is the worst-case scenario here something like this?

A Pod comes into the scheduler and a ResourceSlicePatch is created at the same time.

The scheduler successfully schedules the Pod, having not yet observed the new ResourceSlicePatch.

The ResourceSlicePatch makes modifications such that the Pod's ResourceClaims no longer match the devices it was allocated (e.g. changing an attribute referenced in a selector).

The scheduled Pod continues to run with the unsuitable allocated device.

And does this same race condition already exist today when updating ResourceSlices since the scheduler's view of ResourceSlices is driven by an informer?

Is the "correct" answer to this to use only taints instead of attributes/capacity for anything that should cause a Pod to be evicted at runtime?

Your understanding is correct, on all points.

pohly · 2025-01-30T08:48:53Z

keps/sig-scheduling/5027-dra-admin-controlled-device-attributes/README.md

+// ^^^
+// `NullableDeviceAttribute` as an extension ensures that the OpenAPI
+// for ResourceSlice remains unchanged. Using the same type with
+// a `NullValue` that can be set only in one type is less clear.


@aaron-prindle: will a NullableDeviceAttribute with some "oneOf" alternatives inside the embedded DeviceAttribute and one more outside of it work for declarative validation?

It should work right now (OpenAPI flattens embedded structs) and it is more natural in Go (can use a NullableDeviceAttribute to initialize a DeviceAttribute without manually written copy code).

But if this then poses a problem for declarative validation, then it will be difficult to switch because the embedding leads to different protobuf encoding.

Given how early declarative validation is, I feel confident saying "we can make it work". I don't think it is something we handle in the dev-branch prototype, but it seems reasonably well defined.

What you're doing here isn't obvious at first blush (even I started writing an alternative), but this comes back to a hard-learned lesson: Don't make "nothing" mean something. The absence of a value in a patch cannot mean "remove", it has to mean "don't know".

Yes, exactly, hence the explicit "null". One alternative in a prior comment thread in this PR was an explicit remove: true (but then what does remove: false mean?) or remove: {}.

For reference: this idea originated in #5034 (comment)

keps/sig-scheduling/5027-dra-admin-controlled-device-attributes/README.md

keps/sig-scheduling/5055-device-taints-and-tolerations/README.md

macsko · 2025-01-31T11:53:10Z

keps/sig-scheduling/5055-device-taints-and-tolerations/README.md

+Taints are cumulative as long as the key and effect pairs are different:
+- Taints defined by an admin in a ResourceSliceOverride get added to the
+  set of taints defined by the DRA driver in a ResourceSlice.
+- Taints with the same key and effect get overwritten, using the same
+  precedence as for attributes.


Don't we want to allow to delete the taints?

Maybe? So the goal is to let the driver set a taint and then the admin decides that "no, this this taint shouldn't really taint the device"?

Let's see what an API for that could look like... What if we allowed an effect called "None"? Then a driver can publish a taint {Key: dra.example.com/temperature, Value: 200, Effect: NoExecute} and an admin can replace that with {Key: dra.example.com/temperature, Value: 0, Effect: None}.

The scheduler and eviction controller ignore such taints with "no effect". The difference is that Effect: None replaces other taints with the same key during patching, instead of adding the taint to the set.

I'm not sure whether it makes sense, but I would also allow such entries in a ResourceSlice.

In general, I think actors on the taints should act only on their own taints, but naturally they should be able to remove those taints when they are not relevant any more.

Eg. after admin has finished upgrading the device firmware, and verified its working with a test pod tolerating the FirmwareUpgrade taint, he should be able to remove that taint so that other pods can again be scheduled on the upgraded device.

The comment here is about ResourceSlicePatch and how it relates to taints in the ResourceSlice ("cumulative").

Of course each actor can and should remove their own taint. That's so normal, I didn't even bother mentioning that in the KEP 😅 Should I?

So do we not need the ability for an admin to modify some device taint set by the driver?

If driver has not set any taint yet, how k8s would know that admin patches a taint that driver is "supposed" to manage? Or was your question about admin masking out ("removing") a taint set by the driver?

I would assume allowing these would be simpler. It would be nice if (some future) k8s tool for setting device taints would give a warning if driver has already set the specified taint, but I think its fine to allow it (-force).

Disallowing it would be fine too though. If driver adds taint that admin does not care about, admin could always use taint toleration. And maybe drivers could include options for disabling tainting, in case admin wants something else to handle that.

Or was your question about admin masking out ("removing") a taint set by the driver?

This thread here is about admins deleting taints set be a driver. The way the API is defined right now, an admin can add its own taint, but they cannot remove a taint set by a driver ("cumulative"). If we want to allow that, we need a special API for it. #5034 (comment) is an attempt to define such an API.

macsko · 2025-01-31T11:55:20Z

keps/sig-scheduling/5055-device-taints-and-tolerations/README.md

+As with node taints, the key is typically a short string. The meaning of the value
+depends on the key. It is allowed to be longer. The core v1 API does not impose
+length limitations for these fields. The `resource.k8s.io` API does.


Should the proposed length limitations be a part of the KEP?

I was wrong and already removed this: the core API uses the validation for label names and keys and thus has length limits. They are just not documented specifically for the taint API.

I think that makes sense because defining "label name" and "label key" in all APIs that use them would be very repetitive. Therefore I have done the same here and only said that the strings must be label names and keys.

What is missing in the KEP is a specification of how many taints per device and patch are allowed. I think each taint is relatively small compared to the potentially large attributes, so how about we allow 16?

While taints may differ between devices, I do not see single device having that many different taints. Possibly few error metric taints from driver, couple of admin taints and potentially several metric related taints. If device collects half a dozen taints before it gets decommissioned, that seems rather problematic device to keep in use...

It's possible that taint names change with time though, and some devices in large & old clusters could have eventually collected set of taints with obsolete names. Such things should not prevent device getting relevant taints. 16 should be way more that enough for that though.

I agree that a few (3-4 at most) seems likely, perhaps even just one. 16 is meant to provide some buffer for unexpected usages.

keps/sig-scheduling/5055-device-taints-and-tolerations/README.md

keps/sig-scheduling/5027-dra-admin-controlled-device-attributes/README.md

pohly · 2025-01-31T14:21:36Z

I just finished a full pass over both README.md while @macsko was reviewing. I need to double-check whether I addressed some of his comments already.

@johnbelamaric: I also filled out the required PRR sections.

keps/sig-scheduling/5027-dra-admin-controlled-device-attributes/README.md

nojnhuh · 2025-02-03T17:48:04Z

keps/sig-scheduling/5027-dra-admin-controlled-device-attributes/README.md

+both cases it starts with listing all devices. That information is local can
+comes either from an informer cache or a cache of patched devices.


Should this read as something like this?

Suggested change

both cases it starts with listing all devices. That information is local can

comes either from an informer cache or a cache of patched devices.

both cases it starts with listing all devices. That information is local and

comes either from an informer cache or a cache of patched devices.

nojnhuh · 2025-02-03T17:49:04Z

keps/sig-scheduling/5027-dra-admin-controlled-device-attributes/README.md

+Filtering and patching are local operations, with no impact on the cluster. To
+prevent doing the same work repeatedly, it will be implemented so that it gets
+done once and then only processes changes. This increases CPU and RAM
+consumption. But even all devices should get patched (which is unlikely), memory


Similarly here?

Suggested change

consumption. But even all devices should get patched (which is unlikely), memory

consumption. But even if all devices should get patched (which is unlikely), memory

dom4ha · 2025-02-03T22:08:49Z

Patching attributes dynamically sounds like a good idea, but I wonder if there is a similar mechanism already implemented in k8s? I haven't found any, so I wonder whether it wasn't needed, or just no one did it this way.

My only worry is that this approach brings quite a lot complexity, especially since these overrides have to be consumed in unknown number of places. Do you think that the benefit of updating multiple objects at once is a sufficient argument? Why patching the objects in the background is not acceptable (even when keeping the ResourceSlicePatch as a centralized way of specifying patches)?

Alternatively, shouldn't such mechanism become a part of the framework, so that components get objects with already applied patches in informers?

k8s-ci-robot requested a review from johnbelamaric January 10, 2025 15:52

k8s-ci-robot requested a review from byako January 12, 2025 12:31

pohly commented Jan 12, 2025

View reviewed changes

keps/sig-node/5027-dra-admin-controlled-device-attributes/README.md Outdated Show resolved Hide resolved

k8s-ci-robot added the wg/device-management Categorizes an issue or PR as relevant to WG Device Management. label Jan 13, 2025

pohly mentioned this pull request Jan 13, 2025

DRA: admin-controlled device attributes #5027

Open

4 tasks

pohly commented Jan 13, 2025

View reviewed changes

keps/sig-node/5027-dra-admin-controlled-device-attributes/README.md Outdated Show resolved Hide resolved

everpeace mentioned this pull request Jan 14, 2025

Topological alignment between GPUs and NICs in DRA (exposing pci device topology as device attribute?) NVIDIA/k8s-dra-driver#213

Open

everpeace reviewed Jan 16, 2025

View reviewed changes

keps/sig-node/5027-dra-admin-controlled-device-attributes/README.md Outdated Show resolved Hide resolved

pohly mentioned this pull request Jan 20, 2025

DRA: device taints and tolerations #5055

Open

4 tasks

pohly force-pushed the dra-device-attribute-overrides branch from 531a905 to cddc84f Compare January 20, 2025 14:31

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 20, 2025

pohly force-pushed the dra-device-attribute-overrides branch from cddc84f to c4a6f66 Compare January 20, 2025 14:40

DRA: add admin controlled device attributes + device taints

41cdbf5

These are two different KEPs that provide two features that can be enabled and disabled independently. However, both use the same new ResourceSliceOverride type and thus get described and implemented together.

pohly force-pushed the dra-device-attribute-overrides branch from c4a6f66 to 41cdbf5 Compare January 20, 2025 14:44

eero-t reviewed Jan 20, 2025

View reviewed changes

keps/sig-node/5055-device-taints-and-tolerations/README.md Outdated Show resolved Hide resolved

keps/sig-node/5055-device-taints-and-tolerations/README.md Outdated Show resolved Hide resolved

pohly changed the title ~~KEP-5027: DRA: admin-controlled device attributes~~ KEP-5027: DRA: admin-controlled device attributes + device taints Jan 20, 2025

everpeace reviewed Jan 21, 2025

View reviewed changes

keps/sig-node/5055-device-taints-and-tolerations/README.md Outdated Show resolved Hide resolved

keps/sig-node/5055-device-taints-and-tolerations/README.md Outdated Show resolved Hide resolved

keps/sig-node/5055-device-taints-and-tolerations/kep.yaml Outdated Show resolved Hide resolved

pohly added 3 commits January 21, 2025 11:50

fixup! review feedback and updates

9db5e40

fixup! review feedback and updates

7c1b5df

fixup! review feedback and updates

378e4ad

pohly commented Jan 21, 2025

View reviewed changes

keps/sig-node/5027-dra-admin-controlled-device-attributes/README.md Outdated Show resolved Hide resolved

pohly commented Jan 21, 2025

View reviewed changes

keps/sig-node/5055-device-taints-and-tolerations/README.md Outdated Show resolved Hide resolved

fixup! move PRR yamls

bcbd75f

nojnhuh reviewed Jan 22, 2025

View reviewed changes

keps/sig-scheduling/5027-dra-admin-controlled-device-attributes/README.md Show resolved Hide resolved

fixup! review feedback

935ee37

pohly changed the title ~~KEP-5027: DRA: admin-controlled device attributes + device taints~~ KEP-5027 + 5055: DRA: admin-controlled device attributes + device taints Jan 24, 2025

nojnhuh reviewed Jan 27, 2025

View reviewed changes

keps/sig-scheduling/5027-dra-admin-controlled-device-attributes/README.md Show resolved Hide resolved

asm582 reviewed Jan 28, 2025

View reviewed changes

keps/sig-scheduling/5055-device-taints-and-tolerations/README.md Outdated Show resolved Hide resolved

asm582 reviewed Jan 28, 2025

View reviewed changes

keps/sig-scheduling/5055-device-taints-and-tolerations/README.md Outdated Show resolved Hide resolved

asm582 reviewed Jan 28, 2025

View reviewed changes

asm582 reviewed Jan 29, 2025

View reviewed changes

pohly added 2 commits January 29, 2025 15:17

fixup! review feedback

9e966d6

DRA: support removing attributes

276b4c9

This gets added for the sake of completeness.

eero-t reviewed Jan 29, 2025

View reviewed changes

keps/sig-scheduling/5055-device-taints-and-tolerations/README.md Outdated Show resolved Hide resolved

keps/sig-scheduling/5055-device-taints-and-tolerations/README.md Outdated Show resolved Hide resolved

fixup! formatting of taints/tolerations API

8801928

nojnhuh reviewed Jan 29, 2025

View reviewed changes

asm582 reviewed Jan 29, 2025

View reviewed changes

keps/sig-scheduling/5027-dra-admin-controlled-device-attributes/README.md Outdated Show resolved Hide resolved

asm582 reviewed Jan 29, 2025

View reviewed changes

nojnhuh reviewed Jan 29, 2025

View reviewed changes

pohly added 2 commits January 30, 2025 09:22

fixup! review feedback

f9849d9

fixup! DRA: support removing attributes

f06ce8a

pohly commented Jan 30, 2025

View reviewed changes

aaron-prindle mentioned this pull request Jan 30, 2025

Seeing inconsistency/error when using +k8s:unionMember with embedded struct jpbetz/kubernetes#94

Open

macsko reviewed Jan 31, 2025

View reviewed changes

fixup! editorial changes, PRR

786fa6c

fixup! review feedback

ac6c681

nojnhuh reviewed Jan 31, 2025

View reviewed changes

keps/sig-scheduling/5027-dra-admin-controlled-device-attributes/README.md Show resolved Hide resolved

nojnhuh reviewed Feb 3, 2025

View reviewed changes

		Creating a ResourceSlicePatch is racing with on-going scheduling attempts,
		which is unavoidable.

		both cases it starts with listing all devices. That information is local can
		comes either from an informer cache or a cache of patched devices.

	consumption. But even all devices should get patched (which is unlikely), memory
	consumption. But even if all devices should get patched (which is unlikely), memory

KEP-5027 + 5055: DRA: admin-controlled device attributes + device taints #5034

Are you sure you want to change the base?

KEP-5027 + 5055: DRA: admin-controlled device attributes + device taints #5034

Conversation

pohly commented Jan 10, 2025 • edited Loading

pohly commented Jan 12, 2025

k8s-ci-robot commented Jan 12, 2025

pohly commented Jan 13, 2025

eero-t left a comment

Choose a reason for hiding this comment

everpeace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pohly commented Jan 31, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dom4ha commented Feb 3, 2025

pohly commented Jan 10, 2025 •

edited

Loading