You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While writing and verifying the e2e tests for k8s/okd topology manager, I stumbled on a scenario which may be relevant and interesting for the performance-addon-operator.
Let's consider a cluster whose workers are multi-NUMA with, say, 2 numa nodes each and, says, 72 cpus (but this works also with 80 cpus, 64 cpus...)
numa node 0 cpus: 0,2,4,6,8,10...40
numa node 1 cpus: 1,3,5,7,9,11...39
NOTE: I need to check if and how HyperThreading affects this picture
oftentimes the PCI devices are connected to numa node #0
oftentimes the kubelet reserves cpus 0-3 (aka 0,1,2,3) for system purposes
In this scenario, if we want to start a workload which requires 1+ device and using all the cores on a NUMA node, the only suitable node is #0.
but because of the default configuration:
you cannot allocate a full NUMA node to the workload. This is an incident due to a side effect of default settings, which we can avoid with a smarter configuration.
even if you allocate the remaining cores, and even if you isolate cores, you will have some interference because mixed workload (system + user) runs on the same node.
A possible fix would be: the operator should offer the option to try to reserve system resources on NUMA nodes which don't have PCI devices connected to. From CPU-only workloads (aka workloads who don't need PCI devices at all) this makes no difference, and we can free some resources for PCI-requiring workloads.
If all the NUMA nodes have PCI devices attached to them, the operator can happily do nothing and
trust the cluster admin.
The text was updated successfully, but these errors were encountered:
ffromani
changed the title
make sure the kubelet reservation are multi-NUMA aware
RFE: make sure the kubelet reservation are multi-NUMA aware
Apr 1, 2020
The operator expects the list of cpus to reserve, so the user can set reserved to 1,3,5,7 in situation like this. There is no automation, we rely on the sysadmin knowledge about the hardware in question.
Automated cpu discovery was in the initial plan, but was too hard to implement in the first phase. Also every user might have different preferences (some want 1 housekeeping core per NUMA, some want totally isolated NUMA node for workload, ...)
The operator expects the list of cpus to reserve, so the user can set reserved to 1,3,5,7 in situation like this. There is no automation, we rely on the sysadmin knowledge about the hardware in question.
Automated cpu discovery was in the initial plan, but was too hard to implement in the first phase. Also every user might have different preferences (some want 1 housekeeping core per NUMA, some want totally isolated NUMA node for workload, ...)
Makes sense. We can perhaps offer in a future release, the option for the operator to figure out a smart reservation, or the smarter it can figure out on its own. There are cases which are safe and relatively easy to figure out. The cluster admin must always have the option to override the reservation.
While writing and verifying the e2e tests for k8s/okd topology manager, I stumbled on a scenario which may be relevant and interesting for the performance-addon-operator.
Let's consider a cluster whose workers are multi-NUMA with, say, 2 numa nodes each and, says, 72 cpus (but this works also with 80 cpus, 64 cpus...)
NOTE: I need to check if and how HyperThreading affects this picture
oftentimes the PCI devices are connected to numa node #0
oftentimes the kubelet reserves cpus 0-3 (aka 0,1,2,3) for system purposes
In this scenario, if we want to start a workload which requires 1+ device and using all the cores on a NUMA node, the only suitable node is #0.
but because of the default configuration:
A possible fix would be: the operator should offer the option to try to reserve system resources on NUMA nodes which don't have PCI devices connected to. From CPU-only workloads (aka workloads who don't need PCI devices at all) this makes no difference, and we can free some resources for PCI-requiring workloads.
If all the NUMA nodes have PCI devices attached to them, the operator can happily do nothing and
trust the cluster admin.
The text was updated successfully, but these errors were encountered: