RFE: make sure the kubelet reservation are multi-NUMA aware #160

ffromani · 2020-04-01T07:15:16Z

While writing and verifying the e2e tests for k8s/okd topology manager, I stumbled on a scenario which may be relevant and interesting for the performance-addon-operator.

Let's consider a cluster whose workers are multi-NUMA with, say, 2 numa nodes each and, says, 72 cpus (but this works also with 80 cpus, 64 cpus...)

numa node 0 cpus: 0,2,4,6,8,10...40
numa node 1 cpus: 1,3,5,7,9,11...39

NOTE: I need to check if and how HyperThreading affects this picture

oftentimes the PCI devices are connected to numa node #0
oftentimes the kubelet reserves cpus 0-3 (aka 0,1,2,3) for system purposes

In this scenario, if we want to start a workload which requires 1+ device and using all the cores on a NUMA node, the only suitable node is #0.

but because of the default configuration:

you cannot allocate a full NUMA node to the workload. This is an incident due to a side effect of default settings, which we can avoid with a smarter configuration.
even if you allocate the remaining cores, and even if you isolate cores, you will have some interference because mixed workload (system + user) runs on the same node.

A possible fix would be: the operator should offer the option to try to reserve system resources on NUMA nodes which don't have PCI devices connected to. From CPU-only workloads (aka workloads who don't need PCI devices at all) this makes no difference, and we can free some resources for PCI-requiring workloads.
If all the NUMA nodes have PCI devices attached to them, the operator can happily do nothing and
trust the cluster admin.

The text was updated successfully, but these errors were encountered:

MarSik · 2020-04-01T07:21:44Z

The operator expects the list of cpus to reserve, so the user can set reserved to 1,3,5,7 in situation like this. There is no automation, we rely on the sysadmin knowledge about the hardware in question.

Automated cpu discovery was in the initial plan, but was too hard to implement in the first phase. Also every user might have different preferences (some want 1 housekeeping core per NUMA, some want totally isolated NUMA node for workload, ...)

ffromani · 2020-04-01T07:33:33Z

The operator expects the list of cpus to reserve, so the user can set reserved to 1,3,5,7 in situation like this. There is no automation, we rely on the sysadmin knowledge about the hardware in question.

Automated cpu discovery was in the initial plan, but was too hard to implement in the first phase. Also every user might have different preferences (some want 1 housekeeping core per NUMA, some want totally isolated NUMA node for workload, ...)

Makes sense. We can perhaps offer in a future release, the option for the operator to figure out a smart reservation, or the smarter it can figure out on its own. There are cases which are safe and relatively easy to figure out. The cluster admin must always have the option to override the reservation.

ffromani changed the title ~~make sure the kubelet reservation are multi-NUMA aware~~ RFE: make sure the kubelet reservation are multi-NUMA aware Apr 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFE: make sure the kubelet reservation are multi-NUMA aware #160

RFE: make sure the kubelet reservation are multi-NUMA aware #160

ffromani commented Apr 1, 2020

MarSik commented Apr 1, 2020

ffromani commented Apr 1, 2020

RFE: make sure the kubelet reservation are multi-NUMA aware #160

RFE: make sure the kubelet reservation are multi-NUMA aware #160

Comments

ffromani commented Apr 1, 2020

MarSik commented Apr 1, 2020

ffromani commented Apr 1, 2020