[non-official Guide] Recover from an erroneous machine config changing kernel options (in this case, cgroups) #2056

glowing-axolotl · 2024-11-12T15:45:42Z

glowing-axolotl
Nov 12, 2024

The following is my experience with OKD 4.15 baremetal, it by no means should be applied without second-thoughs or testing in your environment.

Use this guide at your own risk!

You might have come across this machineconfig:

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: 99-openshift-machineconfig-worker-kargs
spec:
  kernelArguments:
    - systemd.unified_cgroup_hierarchy=0
    - systemd.legacy_systemd_cgroup_controller=1

In versions before 4.13, this config changed cgroups support from v2 to v1.

When running the configuration in a 4.15 OKD cluster, the node on which the config is applied will get stuck:

okd-worker01    NotReady,SchedulingDisabled   worker                 95d   v1.28.7+6e2789b
okd-worker02    Ready                         worker                 95d   v1.28.7+6e2789b
okd-worker03    Ready                         worker                 95d   v1.28.7+6e2789b

No containers will start on the node, and journalctl will keep showing the following error:

Nov 11 16:11:52 okd-worker02 bash[2020]: Your kernel does not support pids limit capabilities or the cgroup is not mounted. PIDs limit discarded.

If we ssh on the node and switch to root:

ssh core@<node ip>
sudo su -

And show the kernel arguments:

rpm-ostree kargs

We will find the following options:

systemd.unified_cgroup_hierarchy=0 systemd.legacy_systemd_cgroup_controller=1 systemd.unified_cgroup_hierarchy=1 systemd.legacy_systemd_cgroup_controller=0

Since OKD 4.13, these options have been added as kernel arguments (assuming from https://access.redhat.com/solutions/7049418 ).

To recover the node and the machine config pool, you should delete the machineconfig:

oc delete machineconfig 99-openshift-machineconfig-worker-kargs

The "Stuck" node will not automatically recover, since cri-o didn't start and the pod machine-config-daemon-**** for the node couldn't start:

$ oc -n openshift-machine-config-operator get pods -o wide | grep -i worker01 # cannot see logs since you can't connect to the node
[root@okd-worker01 ~]# systemctl status crio
○ crio.service - Container Runtime Interface for OCI (CRI-O)
     Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; preset: disabled)
    Drop-In: /etc/systemd/system/crio.service.d
             └─01-kubens.conf, 05-mco-ordering.conf, 10-mco-default-madv.conf, 10-mco-profile-unix-socket.conf
             /usr/lib/systemd/system/service.d
             └─10-timeout-abort.conf
             /etc/systemd/system/crio.service.d
             └─20-nodenet.conf
     Active: inactive (dead)
       Docs: https://github.com/cri-o/cri-o

You must therefore manually modify the kernel arguments on the node while in ssh as root:

# rpm-ostree kargs --editor

You will get something like:

mitigations=auto,nosmt ignition.platform.id=metal $ignition_firstboot root=UUID=AnUUID rw rootflags=prjquota boot=UUID=AnUUID systemd.unified_cgroup_hierarchy=1 cgroup_no_v1="all" psi=1 mitigations=off systemd.unified_cgroup_hierarchy=1 systemd.legacy_systemd_cgroup_controller=0 systemd.unified_cgroup_hierarchy=0 systemd.legacy_systemd_cgroup_controller=1

You must modify it removing systemd.unified_cgroup_hierarchy=0 and systemd.legacy_systemd_cgroup_controller=1 , like so:

mitigations=auto,nosmt ignition.platform.id=metal $ignition_firstboot root=UUID=AnUUID rw rootflags=prjquota boot=UUID=AnUUID systemd.unified_cgroup_hierarchy=1 cgroup_no_v1="all" psi=1 mitigations=off systemd.unified_cgroup_hierarchy=1 systemd.legacy_systemd_cgroup_controller=0

Reboot the node with systemctl:

systemctl reboot

On reboot, the logs of the config daemon will be visible and will show a failure:

oc -n openshift-machine-config-operator get pods -o wide | grep -i worker01
oc -n openshift-machine-config-operator logs machine-config-daemon-***** -f --tail=100

I1112 14:52:30.502941    2664 daemon.go:1433] Previous boot ostree-finalize-staged.service appears successful
I1112 14:52:30.502959    2664 daemon.go:1557] Current config: rendered-worker-23194325fb32292db1d8f8cbc3eeaff9
I1112 14:52:30.502964    2664 daemon.go:1558] Desired config: rendered-worker-243a23269203f7389b07b6d4296d6694
I1112 14:52:30.502967    2664 daemon.go:1566] state: Degraded
I1112 14:52:30.502986    2664 update.go:2326] Running: rpm-ostree cleanup -r
Deployments unchanged.
I1112 14:52:30.533687    2664 update.go:2341] Disk currentConfig "rendered-worker-243a23269203f7389b07b6d4296d6694" overrides node's currentConfig annotation "rendered-worker-23194325fb32292db1d8f8cbc3eeaff9"
I1112 14:52:30.535322    2664 daemon.go:1966] Validating against current config rendered-worker-243a23269203f7389b07b6d4296d6694
I1112 14:52:30.535359    2664 daemon.go:1879] SSH key location ("/home/core/.ssh/authorized_keys.d/ignition") up-to-date!
I1112 14:52:33.098960    2664 rpm-ostree.go:308] Running captured: rpm-ostree kargs
I1112 14:52:33.123772    2664 daemon.go:2369] Booted command line: BOOT_IMAGE=(hd0,gpt3)/ostree/fedora-coreos-a3df9400bf00bf7eb28ccc8b1e4a6fc3c2d3051164d7a7afcc21a47e7cd4f5d6/vmlinuz-6.7.4-200.fc39.x86_64 mitigations=auto,nosmt ignition.platform.id=metal root=UUID=AnUUID rw rootflags=prjquota boot=UUID=fb44790e-f13a-4f5b-bf92-b898e74988b0 systemd.unified_cgroup_hierarchy=1 cgroup_no_v1=all psi=1 mitigations=off systemd.unified_cgroup_hierarchy=1 systemd.legacy_systemd_cgroup_controller=0 ostree=/ostree/boot.1/fedora-coreos/a3df9400bf00bf7eb28ccc8b1e4a6fc3c2d3051164d7a7afcc21a47e7cd4f5d6/0
I1112 14:52:33.123791    2664 daemon.go:2371] Current ostree kargs: mitigations=auto,nosmt ignition.platform.id=metal $ignition_firstboot root=UUID=AnUUID rw rootflags=prjquota boot=UUID=fb44790e-f13a-4f5b-bf92-b898e74988b0 systemd.unified_cgroup_hierarchy=1 cgroup_no_v1="all" psi=1 mitigations=off systemd.unified_cgroup_hierarchy=1 systemd.legacy_systemd_cgroup_controller=0 ostree=/ostree/boot.0/fedora-coreos/a3df9400bf00bf7eb28ccc8b1e4a6fc3c2d3051164d7a7afcc21a47e7cd4f5d6/0
I1112 14:52:33.123796    2664 daemon.go:2372] Expected MachineConfig kargs: [systemd.unified_cgroup_hierarchy=1 cgroup_no_v1="all" psi=1 mitigations=off systemd.unified_cgroup_hierarchy=1 systemd.legacy_systemd_cgroup_controller=0 systemd.unified_cgroup_hierarchy=0 systemd.legacy_systemd_cgroup_controller=1]
E1112 14:52:33.123830    2664 writer.go:226] Marking Degraded due to: unexpected on-disk state validating against rendered-worker-243a23269203f7389b07b6d4296d6694: missing expected kernel arguments: [systemd.unified_cgroup_hierarchy=0 systemd.legacy_systemd_cgroup_controller=1]

Drift detection will scold us for modifying the node manually. At the same time, if we add back the kernel parameters the Machine Configuration Daemon won't be able to rollback our erroneous changes, since cri-o doesn't start and no container is run for the daemon.

To fix this, we will manually rollback the desiredConfig for the node.

You can use the following one-liner to get the correct/incorrect machineconfig names:

# for i in $(oc get nodes | awk {'print $1'}); do echo $i; oc get node/$i -o yaml | grep -Ei "currentConfig|desiredConfig"; done

okd-worker01
    machineconfiguration.openshift.io/currentConfig: rendered-worker-23194325fb32292db1d8f8cbc3eeaff9 <- Correct config
    machineconfiguration.openshift.io/desiredConfig: rendered-worker-243a23269203f7389b07b6d4296d6694 <- Incorrect config
okd-worker02
    machineconfiguration.openshift.io/currentConfig: rendered-worker-23194325fb32292db1d8f8cbc3eeaff9 <- Correct config
    machineconfiguration.openshift.io/desiredConfig: rendered-worker-23194325fb32292db1d8f8cbc3eeaff9 <- Correct config
...

In our case, we want to make the node believe he already "Updated" but we'll make it use the previous configuration.

Make sure the config the machine config pool is moving the nodes to is the correct one:

# oc get mcp worker -o jsonpath='{.spec.configuration.name}' && echo
rendered-worker-23194325fb32292db1d8f8cbc3eeaff9

If it isn't, make sure you deleted the correct machineconfig you originally applied (in our case, 99-openshift-machineconfig-worker-kargs). Otherwise, there might be something else broken, since in the newest versions deleting a wrong machine config should automatically restore the previous rendered-config, as per https://docs.redhat.com/en/documentation/openshift_container_platform/4.16/html-single/machine_configuration/index#checking-mco-status_machine-config-overview "If something goes wrong with a machine config that you apply, you can always back out that change. For example, if you had run oc create -f ./myconfig.yaml to apply a machine config, you could remove that machine config".

Once verified, patch it onto the node's annotations:

$ oc patch node okd-worker01 --type merge --patch "{\"metadata\": {\"annotations\": {\"machineconfiguration.openshift.io/currentConfig\": \"rendered-worker-23194325fb32292db1d8f8cbc3eeaff9\"}}}"
$ oc patch node okd-worker01 --type merge --patch "{\"metadata\": {\"annotations\": {\"machineconfiguration.openshift.io/desiredConfig\": \"rendered-worker-23194325fb32292db1d8f8cbc3eeaff9\"}}}"
$ oc patch node okd-worker01 --type merge --patch '{"metadata": {"annotations": {"machineconfiguration.openshift.io/reason": ""}}}'
$ oc patch node okd-worker01 --type merge --patch '{"metadata": {"annotations": {"machineconfiguration.openshift.io/state": "Done"}}}'

The error in the daemon will now change to:

I1112 15:14:50.826460    2664 daemon.go:1433] Previous boot ostree-finalize-staged.service appears successful
I1112 15:14:50.826478    2664 daemon.go:1550] Current+desired config: rendered-worker-23194325fb32292db1d8f8cbc3eeaff9
I1112 15:14:50.826482    2664 daemon.go:1566] state: Degraded
I1112 15:14:50.826498    2664 update.go:2326] Running: rpm-ostree cleanup -r
Deployments unchanged.
I1112 15:14:50.853678    2664 update.go:2341] Disk currentConfig "rendered-worker-243a23269203f7389b07b6d4296d6694" overrides node's cur
rentConfig annotation "rendered-worker-23194325fb32292db1d8f8cbc3eeaff9"
I1112 15:14:50.854907    2664 daemon.go:1966] Validating against current config rendered-worker-243a23269203f7389b07b6d4296d6694
I1112 15:14:50.854937    2664 daemon.go:1879] SSH key location ("/home/core/.ssh/authorized_keys.d/ignition") up-to-date!
I1112 15:14:53.314074    2664 rpm-ostree.go:308] Running captured: rpm-ostree kargs
I1112 15:14:53.340866    2664 daemon.go:2369] Booted command line: BOOT_IMAGE=(hd0,gpt3)/ostree/fedora-coreos-a3df9400bf00bf7eb28ccc8b1
e4a6fc3c2d3051164d7a7afcc21a47e7cd4f5d6/vmlinuz-6.7.4-200.fc39.x86_64 mitigations=auto,nosmt ignition.platform.id=metal root=UUID=67ee2
379-99f7-4f86-86ac-883ad3664951 rw rootflags=prjquota boot=UUID=AnUUID systemd.unified_cgroup_hierarchy=1
 cgroup_no_v1=all psi=1 mitigations=off systemd.unified_cgroup_hierarchy=1 systemd.legacy_systemd_cgroup_controller=0 ostree=/ostree/bo
ot.1/fedora-coreos/a3df9400bf00bf7eb28ccc8b1e4a6fc3c2d3051164d7a7afcc21a47e7cd4f5d6/0
I1112 15:14:53.340887    2664 daemon.go:2371] Current ostree kargs: mitigations=auto,nosmt ignition.platform.id=metal $ignition_firstbo
ot root=UUID=67ee2379-99f7-4f86-86ac-883ad3664951 rw rootflags=prjquota boot=UUID=AnUUID systemd.unified_
cgroup_hierarchy=1 cgroup_no_v1="all" psi=1 mitigations=off systemd.unified_cgroup_hierarchy=1 systemd.legacy_systemd_cgroup_controller
=0 ostree=/ostree/boot.0/fedora-coreos/a3df9400bf00bf7eb28ccc8b1e4a6fc3c2d3051164d7a7afcc21a47e7cd4f5d6/0
I1112 15:14:53.340893    2664 daemon.go:2372] Expected MachineConfig kargs: [systemd.unified_cgroup_hierarchy=1 cgroup_no_v1="all" psi=
1 mitigations=off systemd.unified_cgroup_hierarchy=1 systemd.legacy_systemd_cgroup_controller=0 systemd.unified_cgroup_hierarchy=0 syst
emd.legacy_systemd_cgroup_controller=1]
E1112 15:14:53.340935    2664 writer.go:226] Marking Degraded due to: unexpected on-disk state validating against rendered-worker-243a23
269203f7389b07b6d4296d6694: missing expected kernel arguments: [systemd.unified_cgroup_hierarchy=0 systemd.legacy_systemd_cgroup_contro
ller=1]

We still need to change the currentConfig saved in /etc/machine-config-daemon/currentconfig .
We can update it like this, make sure to make a backup of the original file and change the MCP_NAME according to your configuration, be extremely careful to use the correct mcp name, as the command below will still give you an output even if you use a wrong name:

# cp /etc/machine-config-daemon/currentconfig /etc/machine-config-daemon/currentconfig.bkp_date_of_today
# oc login -u kubeadmin -p <PASSWORD> https://api.<CLUSTER_NAME>.<DOMAIN_NAME>:6443
# oc get mc $(oc get mcp <MCP_NAME> -o json | jq ".status.configuration.name" -r) -o json | jq -rc | less # make sure you are using the correct mcp
# oc get mc $(oc get mcp <MCP_NAME> -o json | jq ".status.configuration.name" -r) -o json | jq -rc > /etc/machine-config-daemon/currentconfig

After about a minute, the machine config daemon pod should tell you the on-disk state is now valid:

# oc -n openshift-machine-config-operator logs machine-config-daemon-s4c64 -f --tail=100
I1112 15:23:17.388558    2664 daemon.go:1433] Previous boot ostree-finalize-staged.service appears successful
I1112 15:23:17.388585    2664 daemon.go:1550] Current+desired config: rendered-worker-23194325fb32292db1d8f8cbc3eeaff9
I1112 15:23:17.388596    2664 daemon.go:1566] state: Degraded
I1112 15:23:17.388612    2664 update.go:2326] Running: rpm-ostree cleanup -r
Deployments unchanged.
I1112 15:23:17.415866    2664 daemon.go:1966] Validating against current config rendered-worker-23194325fb32292db1d8f8cbc3eeaff9
I1112 15:23:17.415903    2664 daemon.go:1879] SSH key location ("/home/core/.ssh/authorized_keys.d/ignition") up-to-date!
I1112 15:23:19.640425    2664 rpm-ostree.go:308] Running captured: rpm-ostree kargs
I1112 15:23:19.682339    2664 update.go:2341] Validated on-disk state
I1112 15:23:19.684998    2664 daemon.go:2063] Completing update to target MachineConfig: rendered-worker-23194325fb32292db1d8f8cbc3eeaff9

Do try a reboot of the node to verify that everything works correctly once again.

Drain the node:

# oc adm drain node/okd-worker01 --ignore-daemonsets

As root on the node, force it to re-validate their configuration template:

# touch /run/machine-config-daemon-force
# systemctl reboot

Once the node returns ready and all the pods are running:

# oc get pods -A -o wide | grep -i worker01 | grep -v Running # should return empty output

Verify the file was deleted, then launch another reboot (better safe than sorry):

# ls -lhart /run/machine-config-daemon-force
ls: cannot access '/run/machine-config-daemon-force': No such file or directory
# systemctl reboot

Your node should once again work.
You might also have to do some additional steps, as initially I tried messing with the node's annotations of desiredConfig/currentConfig and forcing an update with /run/machine-config-daemon-force , but I tested this on nodes of another MCP and it worked in restoring the nodes.

Hope this helps someone with the same problem!

This issue was similar to openshift/machine-config-operator#1443 and openshift/machine-config-operator#2705 .

Edit:
This exact issue was reported on https://access.redhat.com/solutions/7069660 and https://issues.redhat.com/browse/OCPBUGS-19352

The correct solution seems to be this documentation https://docs.okd.io/4.15/nodes/clusters/nodes-cluster-cgroups-2.html , and therefore to modify nodes.config/cluster adding spec.cgroupMode: "v1" .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[non-official Guide] Recover from an erroneous machine config changing kernel options (in this case, cgroups) #2056

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

[non-official Guide] Recover from an erroneous machine config changing kernel options (in this case, cgroups) #2056

glowing-axolotl Nov 12, 2024

Replies: 0 comments

glowing-axolotl
Nov 12, 2024