KEP-4800: Split Uncore Cache Toplogy Awareness Alignment #5110

ajcaldelas · 2025-01-30T15:34:54Z

One-line PR description: Update to the KEP of adding clarity to documentation and added visual aids to help users understand the problem we are trying to solve.

Issue link: Split L3 Cache Topology Awareness in CPU Manager #5109

k8s-ci-robot · 2025-01-30T15:35:04Z

Hi @ajcaldelas. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ffromani · 2025-01-31T10:51:45Z

/ok-to-test

ffromani

Other reviewers: I asked clarifications for the text and explaining diagrams before to move further with any changes because during the review some concepts were unnecessarily hard to read. There is no new content to be expected, only attempts to further clarify and streamline the design which was already agreed upon in the previous cycle.

It already happened few times we had to clarify some KEP concepts and iterate over time, even post-fact (somehow related examples: #3570 #2625 with the FG conversations) trying to be proactive this time.

keps/sig-node/4800-cpumanager-split-uncorecache/kep.yaml

ffromani · 2025-02-06T17:12:25Z

thanks, at glance very nice improvements. We will need to clean up some paperwork and I will have a complete review ASAP, but this really seems a nice cleanup which paves the road for a much smoother future graduation.

ffromani

Thanks for this welcome cleanup. This addresses most of the open points which emerged during the implementation of the KEP and AFAICT some of the early partial feedback.

There still some areas we can clarify and refine further in order to solidify the design and make for a smooth sailing in the next stages.

In particular, let's make sure we discuss all the possible scope/design changes so once this cycle ends we have a solid alpha feature-wise and we can proceed in the stabilization path with beta and GA in due time

Finally, let's review the graduation criteria and make sure these are accurate with the information we have now

ffromani · 2025-02-07T14:16:20Z

keps/sig-node/4800-cpumanager-split-uncorecache/README.md


- prefer-align-cpus-by-uncorecache will be compatible with the default CPU allocation logic with future plans to be compatible with the options distribute-cpus-across-numa and distribute-cpus-across-cores.
+- `prefer-align-cpus-by-uncorecache` will be compatible with the default CPU allocation logic.


We talked during the code review about adding a required alignment option (cc @wongchar ). I assume we don't want to do that though. If we want to introduce that additional option, now's the time, before we move to beta, because alpha is the stage on which we adapt depending on the feedback.

My thoughts were that a "strict" policy implementation would be different from the best-effort/preferred implementation.

prefer-align-cpus-by-uncorecache would be considered for beta graduation in the future while a strict policy is a new feature introduced as alpha.

This is also why I think the static policy options should be passed to takeByTopologyNUMAPacked so there can be checks for these planned features at the appropriate level. I'll make this clearer in the KEP

I'm fine with this approach and makes sense. Please however be aware we want to slow down with the addition of policy options over time (#5108 (comment) and following). We don't need to decide this now, though.

ffromani · 2025-02-07T14:19:18Z

keps/sig-node/4800-cpumanager-split-uncorecache/README.md

+- The allocatable CPU pool and CPU resource requirement are then passed to `takeByTopology` where a NUMAPacked method or a NUMADistributed method is determined. This feature will only be compatible with the NUMAPacked method and not compatible with NUMADistributed. Distributing CPU allocations evenly across NUMAs for containers requiring more resources than a single NUMA can undermine the objective of minimizing CPU distributions across uncore caches. Workloads that are sensitive to uncore cache latency and benefit from this feature may experience increased latency due to the added cross-NUMA latency among the uncore caches.
+- The CPU topology, allocatable CPU pool, and required CPUs for the container are then passed to the `takeByTopologyNUMAPacked` method. This method begins to schedule CPU resources according to the CPU topology hierarchy. We will add the Static Policy Options as a variable for this method to determine the `preferAlignCPUsByUncoreCache` boolean and `cpuSortingStrategy` boolean to be used in the allocation logic below. Passing all Static Policy options reduces the amount of variables for the method and allows for future extensions to the UncoreCache alignment policy.


While this is not wrong, I would prefer to add details about the algorithms and the design tradeoffs rather than be so precise about how we are gonna implement the design.
If we elaborate the design but we stay generic, the design has higher chance to be future proof; OTOH, if we lean so heavily on the current implementation, a future refactoring of the code can make the design details obsolete.

Makes sense. Updated to remove this portion and described more of the general CPU allocation.

ffromani · 2025-02-07T14:21:04Z

keps/sig-node/4800-cpumanager-split-uncorecache/README.md

+- The CPU topology, allocatable CPU pool, and required CPUs for the container are then passed to the `takeByTopologyNUMAPacked` method. This method begins to schedule CPU resources according to the CPU topology hierarchy. We will add the Static Policy Options as a variable for this method to determine the `preferAlignCPUsByUncoreCache` boolean and `cpuSortingStrategy` boolean to be used in the allocation logic below. Passing all Static Policy options reduces the amount of variables for the method and allows for future extensions to the UncoreCache alignment policy.
+  1. If the number of CPUs required for the container is equal to or larger than a total NUMA/Socket's-worth of CPUs (depending on which is larger in the hierarchy), allocate whole NUMA/sockets (`takeFullFirstLevel`).
+  2. If the remaining CPUs required or original requirement of CPUs for the container is equal to or larger than a total NUMA/Socket's worth of CPUs (depending on which is smaller in the hierarchy), allocate whole NUMA/sockets (`takeFullSecondLevel`).
+  3. **Added feature**: If `preferAlignCPUsByUncoreCache` is enabled, use the `takeUncoreCache` method. Scan through every uncore cache index in numerical order:


if the numerical order (a detail present also in the old text) is a key design detail, let's explain why; otherwise I think is better to omit implementation detail (which are free to change) from the basic design (which should change rarely, if ever)

I would think the order is a key detail because it highlights that we follow a hierarchy

so we need to clarify which is this numerical order and why we choose this specific ordering

Ordering is done based upon the CPU domain sizes from largest to smallest. I updated and provided justification for where uncore cache sits in the hierarchy

ffromani · 2025-02-07T14:22:03Z

keps/sig-node/4800-cpumanager-split-uncorecache/README.md

+  1. If the number of CPUs required for the container is equal to or larger than a total NUMA/Socket's-worth of CPUs (depending on which is larger in the hierarchy), allocate whole NUMA/sockets (`takeFullFirstLevel`).
+  2. If the remaining CPUs required or original requirement of CPUs for the container is equal to or larger than a total NUMA/Socket's worth of CPUs (depending on which is smaller in the hierarchy), allocate whole NUMA/sockets (`takeFullSecondLevel`).
+  3. **Added feature**: If `preferAlignCPUsByUncoreCache` is enabled, use the `takeUncoreCache` method. Scan through every uncore cache index in numerical order:
+      - If the required CPUs are greater than or equal the size of the **total** amount of CPUs to an uncore cache, assign the full uncore cache CPUs if they are all **unscheduled** within the group. Subtract the amount of scheduled CPUs from the quantity of required CPUs. This submethod will be called `takeFullUncore`.


please let's use consistent naming: scheduled vs allocated (or better define the terms)

Something like "Allocatable"?

Any consistent term is fine. In the kubelet resource allocation area we tend to not use "scheduling" to mean "resource allocation" (like is not we "schedule" which CPU a container can use, we "allocate" them).

removed all instances of "schedule"

ffromani · 2025-02-07T14:24:21Z

keps/sig-node/4800-cpumanager-split-uncorecache/README.md

+  3. **Added feature**: If `preferAlignCPUsByUncoreCache` is enabled, use the `takeUncoreCache` method. Scan through every uncore cache index in numerical order:
+      - If the required CPUs are greater than or equal the size of the **total** amount of CPUs to an uncore cache, assign the full uncore cache CPUs if they are all **unscheduled** within the group. Subtract the amount of scheduled CPUs from the quantity of required CPUs. This submethod will be called `takeFullUncore`.
+       - Continue scanning uncore caches in numerical order and assigning full uncore cache CPU groups until the required quantity is less than the **total** number of CPUs to a uncore cache group (`takeFullUncore`).
+      - If the required CPUs are less than the size of the **total** amount of CPUs to an uncore cache, scan through the remaining uncore cache index and assign CPUs if there are enough **available** CPUs within the uncore cache group. This submethod will be called `takePartialUncore`.    


let's clarify if we prefer to do packed allocation within uncore groups or not, and why.
IOW, uncore group size is 8 (random number). We have 4 uncore groups. Uncore group 0 is partially allocated (4 cores), all the others are free.
We need to allocate 4 cores. Do we pick uncore group 0? (that would be packed allocation within uncores) or do we pick another uncore group like, say, 1?

These design decision help understand the overall decisions. Both approaches have pros and cons, we need to record why we pick one.

Added a new example with different sized containers that should help answer these questions. Let me know if this is still not clear! :)

ffromani · 2025-02-07T14:25:57Z

keps/sig-node/4800-cpumanager-split-uncorecache/README.md


-A new CPUManager policy option, prefer-align-cpus-by-uncorecache, will be introduced to enable this feature. When enabled, the existing default scheduling function will be modified to account for UncoreCache as a preference. The allocation process will first attempt to allocate CPUs from an entire socket. Next, it will attempt to allocate CPUs within the logical boundary of a NUMA node within the socket. Finally, the allocation will consider the subset of CPUs associated with each UncoreCache within the previously selected CPUs. While UncoreCache allocation is preferred, it is not strictly required this policy will not fail if alignment is not possible. Since this policy extends the CPUManager’s Static policy, it will only apply to guaranteed pods with whole CPU requests.
+In the case where the NUMA boundary is larger than a socket (setting NPS0 on a dual-socket system), the full node will be scheduled to the container if it requires the total NUMA amount of CPUs. Otherwise, if the CPU requirement for the container is less than the total CPUs to the NUMA, the logic will begin with (`takeFullSecondLevel`). The node can not be over committed. 


did we define what NPS is (and shortly after Sub-NUMA clustering) in this document?

ffromani · 2025-02-07T14:26:52Z

keps/sig-node/4800-cpumanager-split-uncorecache/README.md


-In the case where the NUMA boundary is smaller than an uncore cache (enabling Sub-NUMA clustering on a monolithic cache system), the allocatable pool of CPUs does not expand beyong the respective NUMA boundary when uncore cache hints are provided in the above scheduling policy. The filtered subset of CPUs will still remain within the NUMA boundary.
+The following shows the two different CPU assignment outcomes for the default static policy allocation and the `preferAlignCPUsByUncoreCache` option. Assuming both containers are running uncore cache sensitive workloads, Container 1 and Container 2 will suffer from the noisy neighbor effect due to both workloads competing to access Uncore Cache 0. Container 2 will suffer from cross-cache latency as CPUs 6&7 need to access Uncore Cache 1 and CPUs 8-15 need to access Uncore Cache 0.


let's use the external name users can set for the option, not the internal one

Using prefer-align-cpus-by-uncorecache instead

ffromani · 2025-02-07T14:27:39Z

keps/sig-node/4800-cpumanager-split-uncorecache/README.md

+Uncore Cache sensitive applications can be deployed on the edge where lower core count and more efficient systems are utilized. These applications often requires CPUs less than the total CPUs of an Uncore Cache group. Newer split architecture processors have increased uncore cache size and bandwidth which has helped reduced the noisy neighbor problem. However, the workloads can still be sensitive to cross cache latency. The example below helps illustrate why `takePartialUncore` is important to fully maximize nodes that have less uncore caches available. or the example below, a 16-core split uncore cache processor is scheduling the following Guaranteed containers using the static policy with preferAlignCPUsByUncoreCache` enabled and disabled.
+- Reserved CPUs: 0-1
+- Allocatable CPUs: 2-15
+- Container 1: 4 CPUs
+- Container 2: 4 CPUs
+- Container 3: 4 CPUs


this seems relevant for the previous explanation about packed vs distributed allocation at uncore level. I'd move that above and refer it here.

ffromani · 2025-02-07T14:29:25Z

keps/sig-node/4800-cpumanager-split-uncorecache/README.md


 This scheduling policy will minimize the distribution of containers across uncore caches to improve performance while still maintaining the default packed logic. The scope will be initially be narrowed to implement uncore cache alignment to the default static scheduling behavior. The table below summarizes future enhancement plans to implement uncore cache alignment to be compatible with the distributed scheduling policies to reduce contention/noisy neighbor effects.


 | Compatibility | alpha | beta | GA |
 | --- | --- | --- | --- |
 | full-pcpus-only | x | x | x |
-| distribute-cpus-across-numa  |   | x | x |
-| align-by-socket | x | x | x |
+| distribute-cpus-across-numa  |   |   |   |


this is a change wrt the previous alpha iteration. Which is fine. But we should split the functional changes (design changes) from the clarifications of the existing design.

Ideally:

commit(s) to clarify the existing design, adding examples, diagrams

commit(s) to adjust the design and the scope e.g. removing compatibility with distribute-cpus-across-numa with the rationale for the design/scope changes

Restored the old compatibility table. I will submit another PR for this KEP that trails this one to justify the scope change.

thanks, but no need for another PR. It's more than enough to just split the compatibility change in its own commit.

ffromani · 2025-02-07T14:29:51Z

keps/sig-node/4800-cpumanager-split-uncorecache/README.md

- 1P Intel Core i7-12850HX
- 1P ARM Ampere Altra 128c
- AWS Graviton
+Since this feature is a best-effort policy and will not evict containers that are unable to be restricted to CPUs that belong to a single uncore cache, a metric will need to be provided to let the user identify if the container is uncore cache aligned as well.


note that cpumanager never evicts workloads in general

Since this feature is a best-effort policy, a metric will need to be provided to let the user identify if the container uncore cache alignment was successful.

keps/sig-node/4800-cpumanager-split-uncorecache/kep.yaml

ffromani · 2025-02-11T10:11:56Z

/remove-sig api-machinery
/remove-sig apps
/remove-sig cli
/remove-sig network
/remove-sig scheduling
/remove-sig storage
/remove-sig ui

please rebase less aggressively, makes our own review harder and as side effect we can have bad pushes with unnecessarily involve unrelated sigs.

ffromani

Thanks a lot folks, the last iteration looks much nicer.
~~Things we lack:~~
1. the KEP right in its summary talks about "uncore cache" we should spend 1-2 lines top for a succint explanation of what a "uncore cache" is. It's fine (actually better) to defer the detailed documentation to a 3rd paty, preferably vendor neutral, source
~~2. the design details looks much improved. Need still to do a proper pass but at glance LGTM~~
~~3. it's fine to drop compatibility with distribute-cpus-across-numa. We don't need a separate KEP, we just need to do that small scope change in a separate commit in addition to this one.~~

PTAL below for updated pending items.
provisional LGTM

ffromani

ok, very good. Totally improved, thanks a lot. We're now in a pretty good shape. There are just few missing adjustment to discuss and LGTM.

ffromani · 2025-02-11T14:00:24Z

keps/sig-node/4800-cpumanager-split-uncorecache/README.md


-The algorithm for prefer-align-cpus-by-uncorecache will be implemented to follow the default packed behavior of the existing static CPUManager allocation with the introduction of a uncorecache hierarchy. This means that when a guaranteed container requires CPUs that are equal to or greater than the size of a NUMA or socket, the CPU allocation will behave as usual and schedule the full socket or NUMA. The scheduled CPUs will be subtracted from the quantity of CPUs required for the container.
+  - Scan through every uncore cache index in numerical order, for each uncore cache block:
+    - First, check if the current required total number of CPUs for the container is greater than or equal the size of the **total** amount of CPUs to an uncore cache, assign the full uncore cache CPUs if they are all **unallocated** within the group. Subtract the amount of assigned CPUs from the quantity of required CPUs. This submethod will be called `takeFullUncore` and follows the same allocation principle as `takeFullFirstLevel` and `takeFullSecondLevel` where CPU allocations will be done based on the domain-level. Repeat `takeFullUncore` until the required number of total CPUs is satisfied or less than the amount of CPUs available on an Uncore Cache.


this means an allocation can span across multiple uncore groups. Which brings the question about how can the workload optimally use the allocated cpus, because a naive allocation can still very much cause cross-uncore-group communication reintroducing all the issues we are trying to solve.

There's no way to solve this problem, only avoid it by failing allocations which are not uncore-aligned, much like the single-numa-node topology manager option.
But going that direction can open a very big pandora's box. So I think the best course of action is acknowledge this and defer the soluition to the workload: the workload is expected to query the system (e.g. using sysfs) to learn the detailed topology of the CPUs it got assigned, and perform its own thread pinning as most fit.

updated KEP with some justifications:

feature is intended more for "smaller" cpu requirements but won't restrict "larger" cpu requirements

optimization potential increases with fewer uncore cache CPU requirements

Defer responsibility to workload if container needs "large" CPU requirement yet still affine a subset to an uncore cache

ffromani · 2025-02-11T14:49:38Z

keps/sig-node/4800-cpumanager-split-uncorecache/README.md

+    - `takePartialUncore` diverges from the allocation principles previously used where full CPU domains are allocated at a time. This is necessary since this is a best-effort implementation intended to balance maximizing node performance while minimizing uncore cache distribution. Uncore Cache sensitive applications can be deployed on the edge where lower core count systems are utilized. These applications often require less CPUs than the total number of CPUs of an uncore cache domain. Newer split architecture processors have increased uncore cache size and bandwidth which has helped reduced the noisy neighbor problem. However, the workloads can still be sensitive to cross-cache latency. The example below helps illustrate why `takePartialUncore` is important to fully maximize nodes that have less uncore caches available. For the example below, a 16-core split uncore cache processor is allocated to the following Guaranteed containers using the static policy with `prefer-align-cpus-by-uncorecache` enabled and disabled.
+      - Reserved CPUs: 0-1
+      - Allocatable CPUs: 2-15
+      - Container 1: 4 CPUs
+      - Container 2: 4 CPUs
+      - Container 3: 4 CPUs


thanks, this is good. Can we emphasize more that takePartialUncore wants to sort uncore cache groups in descending order according to most available (= all CPUs free) to least available (= all CPUs allocated) and do a first-fit allocation taking from this ordered sequence?

We can infer this from the text but I'd like to make it explicit. The text you current have does a good job explaining why and the tradeoffs, I'd just like to have 1-2 explicit lines about this key decision to make it very obvious to the reader.

Just to clarify, the current code implementation iterates through each uncore cache in numerical order by cacheID. This somewhat follows the "packed" concept. Current implementation does not sort uncore caches from most available to least available (CPU-wise).

I'm not opposed to try to make this more efficient in the future. Let me know your thoughts

uhm, I thought I understood but it seems this keeps slipping away. Not good. Let's summarize.

even in the context of best-effort allocation, sorting the uncore groups by availability (most free first, least free last) and then doing a first-match scan makes sense to me and has benefits as the very explanation being added clearly shows.

but this is not what the current code does, we take the default uncore ordering and iterate naively over it (disclaimer: I didn't review the code, while I probably should). Not great, but OK

so in the context of the KEP, what we should do? sort+iterate ? if so we need to emphasize the design per my initial suggestion? just iterate like current code does? if so we need to explain in the design why we do that and outline future improvements in the future improvements section. But let me stress: the KEP is the design. What we SHOULD do, with plans to converge the implementation towards the design.

My take is there are no reason to leave a better algorithm on the table and the design (= the KEP) should aim for it unless there are clear reasons not to, which I love to see clarified here.

As for adapting the implementation, can be either a alpha or a beta item.

ffromani · 2025-02-11T14:51:36Z

keps/sig-node/4800-cpumanager-split-uncorecache/README.md


-HW Test Matrix
+As this feature operates on a best-effort basis and does not prevent CPU allocation to containers that cannot be restricted to a single uncore cache, a metric will be need to be provided to help users determine whether a container is uncore cache aligned. The PR corresponding to this work can be referenced here: [PR 129529:Uncore Cache Alignment Metric](https://github.com/kubernetes/kubernetes/pull/129529)


no need to reference PRs: these will ha a lifecycle much shorter than the KEP itself, so it will become obsolete soon. Just mention this work will be done in the context of the alpha graduation.

Updated, removed PR and mentioned metric scope as graduation criteria as well

ffromani · 2025-02-11T14:53:24Z

Thanks a lot folks, the last iteration looks much nicer.
Things we lack:

the KEP right in its summary talks about "uncore cache" we should spend 1-2 lines top for a succint explanation of what a "uncore cache" is. It's fine (actually better) to defer the detailed documentation to a 3rd paty, preferably vendor neutral, source
PTAL to the inline comments. Some are just discussion items, some require attention. These are all pretty minor though.
it's fine to drop compatibility with distribute-cpus-across-numa. We don't need a separate KEP, we just need to do that small scope change in a separate commit in addition to this one.

provisional LGTM

address feedback

ffromani · 2025-02-12T11:04:28Z

/lgtm

the open point is about an optimization which can be deferred to beta, mostly because we are out of time.
I think what we need to do and why are now clarified enough.

ffromani · 2025-02-12T11:05:42Z

/assign @haircommander
/assign @SergeyKanzhelev

PTAL, it would be nice to have these clarifications in for 1.33, but if they slip not too big of a deal, we will build on top of these for the beta graduation, which will then be a very small changeset.

k8s-ci-robot · 2025-02-12T17:19:02Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ajcaldelas, mrunalp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/sig-node/OWNERS~~ [mrunalp]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot requested review from derekwaynecarr and mrunalp January 30, 2025 15:35

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jan 30, 2025

k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Jan 30, 2025

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 31, 2025

wongchar force-pushed the split_l3_cache branch from 5daf914 to be357b7 Compare February 6, 2025 05:08

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Feb 6, 2025

ajcaldelas marked this pull request as ready for review February 6, 2025 13:04

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 6, 2025

ajcaldelas mentioned this pull request Feb 6, 2025

Split L3 Cache Topology Awareness in CPU Manager #5109

Open

6 tasks

ffromani reviewed Feb 6, 2025

View reviewed changes

keps/sig-node/4800-cpumanager-split-uncorecache/kep.yaml Outdated Show resolved Hide resolved

ajcaldelas changed the title ~~Cleanup L3 Cache Alignment~~ KEP-4800: Split Topology Uncore Cache Alignment Feb 6, 2025

ajcaldelas changed the title ~~KEP-4800: Split Topology Uncore Cache Alignment~~ KEP-4800: Split Uncore Cache Toplogy Awareness Alignment Feb 6, 2025

ffromani reviewed Feb 7, 2025

View reviewed changes

keps/sig-node/4800-cpumanager-split-uncorecache/kep.yaml Outdated Show resolved Hide resolved

ajcaldelas force-pushed the split_l3_cache branch 2 times, most recently from a088d0e to 2743ed1 Compare February 7, 2025 15:52

wongchar force-pushed the split_l3_cache branch from 625e75d to 58f8cd3 Compare February 11, 2025 02:44

k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/cli Categorizes an issue or PR as relevant to SIG CLI. labels Feb 11, 2025

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Feb 11, 2025

wongchar force-pushed the split_l3_cache branch from 58f8cd3 to 827279a Compare February 11, 2025 03:10

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Feb 11, 2025

ffromani reviewed Feb 11, 2025

View reviewed changes

wongchar force-pushed the split_l3_cache branch from 8b7f93b to 27af9a0 Compare February 11, 2025 21:29

wongchar added 2 commits February 11, 2025 16:53

clarify uncore cache KEP

463c17e

address feedback

justify uncore scope change

e108f1b

wongchar force-pushed the split_l3_cache branch from 26bdc1e to e108f1b Compare February 11, 2025 22:53

k8s-ci-robot assigned ffromani Feb 12, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 12, 2025

k8s-ci-robot assigned haircommander and SergeyKanzhelev Feb 12, 2025

mrunalp approved these changes Feb 12, 2025

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 12, 2025

k8s-ci-robot merged commit db3a595 into kubernetes:master Feb 12, 2025
2 of 4 checks passed

k8s-ci-robot added this to the v1.33 milestone Feb 12, 2025


		- prefer-align-cpus-by-uncorecache will be compatible with the default CPU allocation logic with future plans to be compatible with the options distribute-cpus-across-numa and distribute-cpus-across-cores.
		- `prefer-align-cpus-by-uncorecache` will be compatible with the default CPU allocation logic.

		- The allocatable CPU pool and CPU resource requirement are then passed to `takeByTopology` where a NUMAPacked method or a NUMADistributed method is determined. This feature will only be compatible with the NUMAPacked method and not compatible with NUMADistributed. Distributing CPU allocations evenly across NUMAs for containers requiring more resources than a single NUMA can undermine the objective of minimizing CPU distributions across uncore caches. Workloads that are sensitive to uncore cache latency and benefit from this feature may experience increased latency due to the added cross-NUMA latency among the uncore caches.
		- The CPU topology, allocatable CPU pool, and required CPUs for the container are then passed to the `takeByTopologyNUMAPacked` method. This method begins to schedule CPU resources according to the CPU topology hierarchy. We will add the Static Policy Options as a variable for this method to determine the `preferAlignCPUsByUncoreCache` boolean and `cpuSortingStrategy` boolean to be used in the allocation logic below. Passing all Static Policy options reduces the amount of variables for the method and allows for future extensions to the UncoreCache alignment policy.


		A new CPUManager policy option, prefer-align-cpus-by-uncorecache, will be introduced to enable this feature. When enabled, the existing default scheduling function will be modified to account for UncoreCache as a preference. The allocation process will first attempt to allocate CPUs from an entire socket. Next, it will attempt to allocate CPUs within the logical boundary of a NUMA node within the socket. Finally, the allocation will consider the subset of CPUs associated with each UncoreCache within the previously selected CPUs. While UncoreCache allocation is preferred, it is not strictly required this policy will not fail if alignment is not possible. Since this policy extends the CPUManager’s Static policy, it will only apply to guaranteed pods with whole CPU requests.
		In the case where the NUMA boundary is larger than a socket (setting NPS0 on a dual-socket system), the full node will be scheduled to the container if it requires the total NUMA amount of CPUs. Otherwise, if the CPU requirement for the container is less than the total CPUs to the NUMA, the logic will begin with (`takeFullSecondLevel`). The node can not be over committed.


		In the case where the NUMA boundary is smaller than an uncore cache (enabling Sub-NUMA clustering on a monolithic cache system), the allocatable pool of CPUs does not expand beyong the respective NUMA boundary when uncore cache hints are provided in the above scheduling policy. The filtered subset of CPUs will still remain within the NUMA boundary.
		The following shows the two different CPU assignment outcomes for the default static policy allocation and the `preferAlignCPUsByUncoreCache` option. Assuming both containers are running uncore cache sensitive workloads, Container 1 and Container 2 will suffer from the noisy neighbor effect due to both workloads competing to access Uncore Cache 0. Container 2 will suffer from cross-cache latency as CPUs 6&7 need to access Uncore Cache 1 and CPUs 8-15 need to access Uncore Cache 0.


		HW Test Matrix
		As this feature operates on a best-effort basis and does not prevent CPU allocation to containers that cannot be restricted to a single uncore cache, a metric will be need to be provided to help users determine whether a container is uncore cache aligned. The PR corresponding to this work can be referenced here: [PR 129529:Uncore Cache Alignment Metric](https://github.com/kubernetes/kubernetes/pull/129529)

KEP-4800: Split Uncore Cache Toplogy Awareness Alignment #5110

KEP-4800: Split Uncore Cache Toplogy Awareness Alignment #5110

Conversation

ajcaldelas commented Jan 30, 2025 • edited Loading

k8s-ci-robot commented Jan 30, 2025

ffromani commented Jan 31, 2025

ffromani left a comment

Choose a reason for hiding this comment

ffromani commented Feb 6, 2025

ffromani left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ffromani Feb 10, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ffromani commented Feb 11, 2025

ffromani left a comment • edited Loading

Choose a reason for hiding this comment

ffromani left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ffromani commented Feb 11, 2025

ffromani commented Feb 12, 2025

ffromani commented Feb 12, 2025 • edited Loading

k8s-ci-robot commented Feb 12, 2025

ajcaldelas commented Jan 30, 2025 •

edited

Loading

ffromani left a comment •

edited

Loading

ffromani Feb 10, 2025 •

edited

Loading

ffromani left a comment •

edited

Loading

ffromani commented Feb 12, 2025 •

edited

Loading