Node Selector, Node affinity, Pod Affinity, Taints, Tolerations
for a example pod config:
apiVersion: v1
kind: Pod
metadata:
name: api
spec:
contianers:
- name: api
image: api:v1
resources:
requests:
cpu: "2"
memory: "4Gi"
when we apply this config, scheduler in control plane:
- Filter out node that meet the resource request -> feasible nodes
- Score across feasible nodes and choose the highest one
- Bind, create the pod with containers in the selected node
In a cluster, there might be:
- windows nodes
- high-cpu nodes
- high-memory nodes
- spot nodes
- arm nodes
- GPU nodes
- Storage(SSD) nodes
And for a default cluster, it might not able to decide the best place to run the pods, so we can give k8s some extra instructions to help it.
for example: In node config file: kubeletExtraArgs.node-labels = "key0=value0,environment=production,region=us-west"
And In deployment/stateful set/job/other objects: spec.nodeSelector: environment: production
In node config, taints are used to specified the key/value pairs [ref]https://kubernetes.io/docs/reference/config-api/kubeadm-config.v1beta3/#kubeadm-k8s-io-v1beta3-NodeRegistrationOptions
And In deployment/stateful set/job/other objects: spec.nodeSelector.{[key,value]} For node affinity, there are: affinity.nodeAffinity.requiredDuringSchedulingIgnoreDuringExecution
- wont reschedule execution pods
- wont schedule pods if not preferences not matched (pending)
affinity.nodeAffinity.preferredDuringSchedulingIgnoreDuringExecution
- wont reschedule execution pods
- still schedule pods if not preferences not matched
Deploy pods at node as possible to reduce latency if different pods need to communicate with each other In deployment/stateful set/job/other objects: affinity.podAffinity.requiredDuringSchedulingIgnoreDuringExecution affinity.podAffinity.preferredDuringSchedulingIgnoreDuringExecution warning: namespace
Deploy pods at different node as possible to reduce impact of nodes going down, frequently used in deploying Nginx and other ingress controllers In deployment/stateful set/job/other objects: affinity.podAntiAffinity.requiredDuringSchedulingIgnoreDuringExecution affinity.podAntiAffinity.preferredDuringSchedulingIgnoreDuringExecution warning: all nodes should have the same key: topologyKey: "kubernetes.io/hostname"
For example, when scheduling non control plane pods on control node, you may get warning:
Warning FailedScheduling 4m30s (x4 over 19m) default-scheduler 0/1 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.
indicates that the scheduler is unable to schedule pods because the only available node has a taint that prevents the pods from being scheduled on it. The taint is:
# Taints: key=value:effect
taints:
- key: "node-role.kubernetes.io/control-plane"
value: null
effect: NoSchedule
To resolve this, you can remove the taint
kubectl taint nodes <control-plane-node-name> node-role.kubernetes.io/control-plane:NoSchedule-
or
add a new node
or
Edit the pod spec to add the NoSchedule toleration
In the case of deploying ArgoCD plugin, the best practice is to deploy ArgoCD pods in worker node, ensures the control-plane components are isolated and have dedicated resources.