Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-4800: Split Uncore Cache Toplogy Awareness Alignment #5110

Merged
merged 2 commits into from
Feb 12, 2025

Conversation

ajcaldelas
Copy link
Contributor

@ajcaldelas ajcaldelas commented Jan 30, 2025

  • One-line PR description: Update to the KEP of adding clarity to documentation and added visual aids to help users understand the problem we are trying to solve.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jan 30, 2025
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jan 30, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @ajcaldelas. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Jan 30, 2025
@ffromani
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 31, 2025
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Feb 6, 2025
@ajcaldelas ajcaldelas marked this pull request as ready for review February 6, 2025 13:04
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 6, 2025
Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other reviewers: I asked clarifications for the text and explaining diagrams before to move further with any changes because during the review some concepts were unnecessarily hard to read. There is no new content to be expected, only attempts to further clarify and streamline the design which was already agreed upon in the previous cycle.

It already happened few times we had to clarify some KEP concepts and iterate over time, even post-fact (somehow related examples: #3570 #2625 with the FG conversations) trying to be proactive this time.

keps/sig-node/4800-cpumanager-split-uncorecache/kep.yaml Outdated Show resolved Hide resolved
@ffromani
Copy link
Contributor

ffromani commented Feb 6, 2025

thanks, at glance very nice improvements. We will need to clean up some paperwork and I will have a complete review ASAP, but this really seems a nice cleanup which paves the road for a much smoother future graduation.

@ajcaldelas ajcaldelas changed the title Cleanup L3 Cache Alignment KEP-4800: Split Topology Uncore Cache Alignment Feb 6, 2025
@ajcaldelas ajcaldelas changed the title KEP-4800: Split Topology Uncore Cache Alignment KEP-4800: Split Uncore Cache Toplogy Awareness Alignment Feb 6, 2025
Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this welcome cleanup. This addresses most of the open points which emerged during the implementation of the KEP and AFAICT some of the early partial feedback.

There still some areas we can clarify and refine further in order to solidify the design and make for a smooth sailing in the next stages.

In particular, let's make sure we discuss all the possible scope/design changes so once this cycle ends we have a solid alpha feature-wise and we can proceed in the stabilization path with beta and GA in due time

Finally, let's review the graduation criteria and make sure these are accurate with the information we have now


- prefer-align-cpus-by-uncorecache will be compatible with the default CPU allocation logic with future plans to be compatible with the options distribute-cpus-across-numa and distribute-cpus-across-cores.
- `prefer-align-cpus-by-uncorecache` will be compatible with the default CPU allocation logic.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked during the code review about adding a required alignment option (cc @wongchar ). I assume we don't want to do that though. If we want to introduce that additional option, now's the time, before we move to beta, because alpha is the stage on which we adapt depending on the feedback.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thoughts were that a "strict" policy implementation would be different from the best-effort/preferred implementation.

prefer-align-cpus-by-uncorecache would be considered for beta graduation in the future while a strict policy is a new feature introduced as alpha.

This is also why I think the static policy options should be passed to takeByTopologyNUMAPacked so there can be checks for these planned features at the appropriate level. I'll make this clearer in the KEP

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this approach and makes sense. Please however be aware we want to slow down with the addition of policy options over time (#5108 (comment) and following). We don't need to decide this now, though.

Comment on lines 146 to 147
- The allocatable CPU pool and CPU resource requirement are then passed to `takeByTopology` where a NUMAPacked method or a NUMADistributed method is determined. This feature will only be compatible with the NUMAPacked method and not compatible with NUMADistributed. Distributing CPU allocations evenly across NUMAs for containers requiring more resources than a single NUMA can undermine the objective of minimizing CPU distributions across uncore caches. Workloads that are sensitive to uncore cache latency and benefit from this feature may experience increased latency due to the added cross-NUMA latency among the uncore caches.
- The CPU topology, allocatable CPU pool, and required CPUs for the container are then passed to the `takeByTopologyNUMAPacked` method. This method begins to schedule CPU resources according to the CPU topology hierarchy. We will add the Static Policy Options as a variable for this method to determine the `preferAlignCPUsByUncoreCache` boolean and `cpuSortingStrategy` boolean to be used in the allocation logic below. Passing all Static Policy options reduces the amount of variables for the method and allows for future extensions to the UncoreCache alignment policy.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this is not wrong, I would prefer to add details about the algorithms and the design tradeoffs rather than be so precise about how we are gonna implement the design.
If we elaborate the design but we stay generic, the design has higher chance to be future proof; OTOH, if we lean so heavily on the current implementation, a future refactoring of the code can make the design details obsolete.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Updated to remove this portion and described more of the general CPU allocation.

- The CPU topology, allocatable CPU pool, and required CPUs for the container are then passed to the `takeByTopologyNUMAPacked` method. This method begins to schedule CPU resources according to the CPU topology hierarchy. We will add the Static Policy Options as a variable for this method to determine the `preferAlignCPUsByUncoreCache` boolean and `cpuSortingStrategy` boolean to be used in the allocation logic below. Passing all Static Policy options reduces the amount of variables for the method and allows for future extensions to the UncoreCache alignment policy.
1. If the number of CPUs required for the container is equal to or larger than a total NUMA/Socket's-worth of CPUs (depending on which is larger in the hierarchy), allocate whole NUMA/sockets (`takeFullFirstLevel`).
2. If the remaining CPUs required or original requirement of CPUs for the container is equal to or larger than a total NUMA/Socket's worth of CPUs (depending on which is smaller in the hierarchy), allocate whole NUMA/sockets (`takeFullSecondLevel`).
3. **Added feature**: If `preferAlignCPUsByUncoreCache` is enabled, use the `takeUncoreCache` method. Scan through every uncore cache index in numerical order:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the numerical order (a detail present also in the old text) is a key design detail, let's explain why; otherwise I think is better to omit implementation detail (which are free to change) from the basic design (which should change rarely, if ever)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think the order is a key detail because it highlights that we follow a hierarchy

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we need to clarify which is this numerical order and why we choose this specific ordering

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ordering is done based upon the CPU domain sizes from largest to smallest. I updated and provided justification for where uncore cache sits in the hierarchy

1. If the number of CPUs required for the container is equal to or larger than a total NUMA/Socket's-worth of CPUs (depending on which is larger in the hierarchy), allocate whole NUMA/sockets (`takeFullFirstLevel`).
2. If the remaining CPUs required or original requirement of CPUs for the container is equal to or larger than a total NUMA/Socket's worth of CPUs (depending on which is smaller in the hierarchy), allocate whole NUMA/sockets (`takeFullSecondLevel`).
3. **Added feature**: If `preferAlignCPUsByUncoreCache` is enabled, use the `takeUncoreCache` method. Scan through every uncore cache index in numerical order:
- If the required CPUs are greater than or equal the size of the **total** amount of CPUs to an uncore cache, assign the full uncore cache CPUs if they are all **unscheduled** within the group. Subtract the amount of scheduled CPUs from the quantity of required CPUs. This submethod will be called `takeFullUncore`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please let's use consistent naming: scheduled vs allocated (or better define the terms)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like "Allocatable"?

Copy link
Contributor

@ffromani ffromani Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any consistent term is fine. In the kubelet resource allocation area we tend to not use "scheduling" to mean "resource allocation" (like is not we "schedule" which CPU a container can use, we "allocate" them).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed all instances of "schedule"

3. **Added feature**: If `preferAlignCPUsByUncoreCache` is enabled, use the `takeUncoreCache` method. Scan through every uncore cache index in numerical order:
- If the required CPUs are greater than or equal the size of the **total** amount of CPUs to an uncore cache, assign the full uncore cache CPUs if they are all **unscheduled** within the group. Subtract the amount of scheduled CPUs from the quantity of required CPUs. This submethod will be called `takeFullUncore`.
- Continue scanning uncore caches in numerical order and assigning full uncore cache CPU groups until the required quantity is less than the **total** number of CPUs to a uncore cache group (`takeFullUncore`).
- If the required CPUs are less than the size of the **total** amount of CPUs to an uncore cache, scan through the remaining uncore cache index and assign CPUs if there are enough **available** CPUs within the uncore cache group. This submethod will be called `takePartialUncore`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's clarify if we prefer to do packed allocation within uncore groups or not, and why.
IOW, uncore group size is 8 (random number). We have 4 uncore groups. Uncore group 0 is partially allocated (4 cores), all the others are free.
We need to allocate 4 cores. Do we pick uncore group 0? (that would be packed allocation within uncores) or do we pick another uncore group like, say, 1?

These design decision help understand the overall decisions. Both approaches have pros and cons, we need to record why we pick one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a new example with different sized containers that should help answer these questions. Let me know if this is still not clear! :)


A new CPUManager policy option, prefer-align-cpus-by-uncorecache, will be introduced to enable this feature. When enabled, the existing default scheduling function will be modified to account for UncoreCache as a preference. The allocation process will first attempt to allocate CPUs from an entire socket. Next, it will attempt to allocate CPUs within the logical boundary of a NUMA node within the socket. Finally, the allocation will consider the subset of CPUs associated with each UncoreCache within the previously selected CPUs. While UncoreCache allocation is preferred, it is not strictly required this policy will not fail if alignment is not possible. Since this policy extends the CPUManager’s Static policy, it will only apply to guaranteed pods with whole CPU requests.
In the case where the NUMA boundary is larger than a socket (setting NPS0 on a dual-socket system), the full node will be scheduled to the container if it requires the total NUMA amount of CPUs. Otherwise, if the CPU requirement for the container is less than the total CPUs to the NUMA, the logic will begin with (`takeFullSecondLevel`). The node can not be over committed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did we define what NPS is (and shortly after Sub-NUMA clustering) in this document?


In the case where the NUMA boundary is smaller than an uncore cache (enabling Sub-NUMA clustering on a monolithic cache system), the allocatable pool of CPUs does not expand beyong the respective NUMA boundary when uncore cache hints are provided in the above scheduling policy. The filtered subset of CPUs will still remain within the NUMA boundary.
The following shows the two different CPU assignment outcomes for the default static policy allocation and the `preferAlignCPUsByUncoreCache` option. Assuming both containers are running uncore cache sensitive workloads, Container 1 and Container 2 will suffer from the noisy neighbor effect due to both workloads competing to access Uncore Cache 0. Container 2 will suffer from cross-cache latency as CPUs 6&7 need to access Uncore Cache 1 and CPUs 8-15 need to access Uncore Cache 0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use the external name users can set for the option, not the internal one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using prefer-align-cpus-by-uncorecache instead

Comment on lines 171 to 176
Uncore Cache sensitive applications can be deployed on the edge where lower core count and more efficient systems are utilized. These applications often requires CPUs less than the total CPUs of an Uncore Cache group. Newer split architecture processors have increased uncore cache size and bandwidth which has helped reduced the noisy neighbor problem. However, the workloads can still be sensitive to cross cache latency. The example below helps illustrate why `takePartialUncore` is important to fully maximize nodes that have less uncore caches available. or the example below, a 16-core split uncore cache processor is scheduling the following Guaranteed containers using the static policy with preferAlignCPUsByUncoreCache` enabled and disabled.
- Reserved CPUs: 0-1
- Allocatable CPUs: 2-15
- Container 1: 4 CPUs
- Container 2: 4 CPUs
- Container 3: 4 CPUs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems relevant for the previous explanation about packed vs distributed allocation at uncore level. I'd move that above and refer it here.


This scheduling policy will minimize the distribution of containers across uncore caches to improve performance while still maintaining the default packed logic. The scope will be initially be narrowed to implement uncore cache alignment to the default static scheduling behavior. The table below summarizes future enhancement plans to implement uncore cache alignment to be compatible with the distributed scheduling policies to reduce contention/noisy neighbor effects.


| Compatibility | alpha | beta | GA |
| --- | --- | --- | --- |
| full-pcpus-only | x | x | x |
| distribute-cpus-across-numa | | x | x |
| align-by-socket | x | x | x |
| distribute-cpus-across-numa | | | |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a change wrt the previous alpha iteration. Which is fine. But we should split the functional changes (design changes) from the clarifications of the existing design.

Ideally:

  1. commit(s) to clarify the existing design, adding examples, diagrams
  2. commit(s) to adjust the design and the scope e.g. removing compatibility with distribute-cpus-across-numa with the rationale for the design/scope changes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restored the old compatibility table. I will submit another PR for this KEP that trails this one to justify the scope change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, but no need for another PR. It's more than enough to just split the compatibility change in its own commit.

- 1P Intel Core i7-12850HX
- 1P ARM Ampere Altra 128c
- AWS Graviton
Since this feature is a best-effort policy and will not evict containers that are unable to be restricted to CPUs that belong to a single uncore cache, a metric will need to be provided to let the user identify if the container is uncore cache aligned as well.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that cpumanager never evicts workloads in general

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this feature is a best-effort policy, a metric will need to be provided to let the user identify if the container uncore cache alignment was successful.

@ajcaldelas ajcaldelas force-pushed the split_l3_cache branch 2 times, most recently from a088d0e to 2743ed1 Compare February 7, 2025 15:52
@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/cli Categorizes an issue or PR as relevant to SIG CLI. labels Feb 11, 2025
@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Feb 11, 2025
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Feb 11, 2025
@ffromani
Copy link
Contributor

/remove-sig api-machinery
/remove-sig apps
/remove-sig cli
/remove-sig network
/remove-sig scheduling
/remove-sig storage
/remove-sig ui

please rebase less aggressively, makes our own review harder and as side effect we can have bad pushes with unnecessarily involve unrelated sigs.

@k8s-ci-robot k8s-ci-robot removed sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/cli Categorizes an issue or PR as relevant to SIG CLI. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/ui Categorizes an issue or PR as relevant to SIG UI. labels Feb 11, 2025
Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot folks, the last iteration looks much nicer.
Things we lack:
1. the KEP right in its summary talks about "uncore cache" we should spend 1-2 lines top for a succint explanation of what a "uncore cache" is. It's fine (actually better) to defer the detailed documentation to a 3rd paty, preferably vendor neutral, source
2. the design details looks much improved. Need still to do a proper pass but at glance LGTM
3. it's fine to drop compatibility with distribute-cpus-across-numa. We don't need a separate KEP, we just need to do that small scope change in a separate commit in addition to this one.

PTAL below for updated pending items.
provisional LGTM

Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, very good. Totally improved, thanks a lot. We're now in a pretty good shape. There are just few missing adjustment to discuss and LGTM.


The algorithm for prefer-align-cpus-by-uncorecache will be implemented to follow the default packed behavior of the existing static CPUManager allocation with the introduction of a uncorecache hierarchy. This means that when a guaranteed container requires CPUs that are equal to or greater than the size of a NUMA or socket, the CPU allocation will behave as usual and schedule the full socket or NUMA. The scheduled CPUs will be subtracted from the quantity of CPUs required for the container.
- Scan through every uncore cache index in numerical order, for each uncore cache block:
- First, check if the current required total number of CPUs for the container is greater than or equal the size of the **total** amount of CPUs to an uncore cache, assign the full uncore cache CPUs if they are all **unallocated** within the group. Subtract the amount of assigned CPUs from the quantity of required CPUs. This submethod will be called `takeFullUncore` and follows the same allocation principle as `takeFullFirstLevel` and `takeFullSecondLevel` where CPU allocations will be done based on the domain-level. Repeat `takeFullUncore` until the required number of total CPUs is satisfied or less than the amount of CPUs available on an Uncore Cache.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this means an allocation can span across multiple uncore groups. Which brings the question about how can the workload optimally use the allocated cpus, because a naive allocation can still very much cause cross-uncore-group communication reintroducing all the issues we are trying to solve.

There's no way to solve this problem, only avoid it by failing allocations which are not uncore-aligned, much like the single-numa-node topology manager option.
But going that direction can open a very big pandora's box. So I think the best course of action is acknowledge this and defer the soluition to the workload: the workload is expected to query the system (e.g. using sysfs) to learn the detailed topology of the CPUs it got assigned, and perform its own thread pinning as most fit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated KEP with some justifications:

  • feature is intended more for "smaller" cpu requirements but won't restrict "larger" cpu requirements
  • optimization potential increases with fewer uncore cache CPU requirements
  • Defer responsibility to workload if container needs "large" CPU requirement yet still affine a subset to an uncore cache

Comment on lines 148 to 154
- `takePartialUncore` diverges from the allocation principles previously used where full CPU domains are allocated at a time. This is necessary since this is a best-effort implementation intended to balance maximizing node performance while minimizing uncore cache distribution. Uncore Cache sensitive applications can be deployed on the edge where lower core count systems are utilized. These applications often require less CPUs than the total number of CPUs of an uncore cache domain. Newer split architecture processors have increased uncore cache size and bandwidth which has helped reduced the noisy neighbor problem. However, the workloads can still be sensitive to cross-cache latency. The example below helps illustrate why `takePartialUncore` is important to fully maximize nodes that have less uncore caches available. For the example below, a 16-core split uncore cache processor is allocated to the following Guaranteed containers using the static policy with `prefer-align-cpus-by-uncorecache` enabled and disabled.
- Reserved CPUs: 0-1
- Allocatable CPUs: 2-15
- Container 1: 4 CPUs
- Container 2: 4 CPUs
- Container 3: 4 CPUs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, this is good. Can we emphasize more that takePartialUncore wants to sort uncore cache groups in descending order according to most available (= all CPUs free) to least available (= all CPUs allocated) and do a first-fit allocation taking from this ordered sequence?

We can infer this from the text but I'd like to make it explicit. The text you current have does a good job explaining why and the tradeoffs, I'd just like to have 1-2 explicit lines about this key decision to make it very obvious to the reader.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, the current code implementation iterates through each uncore cache in numerical order by cacheID. This somewhat follows the "packed" concept. Current implementation does not sort uncore caches from most available to least available (CPU-wise).

I'm not opposed to try to make this more efficient in the future. Let me know your thoughts

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uhm, I thought I understood but it seems this keeps slipping away. Not good. Let's summarize.

  1. even in the context of best-effort allocation, sorting the uncore groups by availability (most free first, least free last) and then doing a first-match scan makes sense to me and has benefits as the very explanation being added clearly shows.
  2. but this is not what the current code does, we take the default uncore ordering and iterate naively over it (disclaimer: I didn't review the code, while I probably should). Not great, but OK
  3. so in the context of the KEP, what we should do? sort+iterate ? if so we need to emphasize the design per my initial suggestion? just iterate like current code does? if so we need to explain in the design why we do that and outline future improvements in the future improvements section. But let me stress: the KEP is the design. What we SHOULD do, with plans to converge the implementation towards the design.

My take is there are no reason to leave a better algorithm on the table and the design (= the KEP) should aim for it unless there are clear reasons not to, which I love to see clarified here.

As for adapting the implementation, can be either a alpha or a beta item.


HW Test Matrix
As this feature operates on a best-effort basis and does not prevent CPU allocation to containers that cannot be restricted to a single uncore cache, a metric will be need to be provided to help users determine whether a container is uncore cache aligned. The PR corresponding to this work can be referenced here: [PR 129529:Uncore Cache Alignment Metric](https://github.com/kubernetes/kubernetes/pull/129529)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to reference PRs: these will ha a lifecycle much shorter than the KEP itself, so it will become obsolete soon. Just mention this work will be done in the context of the alpha graduation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, removed PR and mentioned metric scope as graduation criteria as well

@ffromani
Copy link
Contributor

Thanks a lot folks, the last iteration looks much nicer.
Things we lack:

  1. the KEP right in its summary talks about "uncore cache" we should spend 1-2 lines top for a succint explanation of what a "uncore cache" is. It's fine (actually better) to defer the detailed documentation to a 3rd paty, preferably vendor neutral, source
  2. PTAL to the inline comments. Some are just discussion items, some require attention. These are all pretty minor though.
  3. it's fine to drop compatibility with distribute-cpus-across-numa. We don't need a separate KEP, we just need to do that small scope change in a separate commit in addition to this one.

provisional LGTM

@ffromani
Copy link
Contributor

/lgtm

the open point is about an optimization which can be deferred to beta, mostly because we are out of time.
I think what we need to do and why are now clarified enough.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 12, 2025
@ffromani
Copy link
Contributor

ffromani commented Feb 12, 2025

/assign @haircommander
/assign @SergeyKanzhelev

PTAL, it would be nice to have these clarifications in for 1.33, but if they slip not too big of a deal, we will build on top of these for the beta graduation, which will then be a very small changeset.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ajcaldelas, mrunalp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 12, 2025
@k8s-ci-robot k8s-ci-robot merged commit db3a595 into kubernetes:master Feb 12, 2025
2 of 4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.33 milestone Feb 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Status: Done
Status: Done
Status: Done
Development

Successfully merging this pull request may close these issues.

7 participants