[release-1.27] Add additional static pod cleanup during cluster reset #4724

brandond · 2023-09-01T17:37:54Z

Proposed Changes

Addresses issue with hangs or crashes when starting up servers following a cluster-reset, caused by etcd and/or the apiserver being restarted in unexpected sequences.

Background is discussed at #4707 (comment):

Shut down the etcd static pod (and the kubelet, to keep the kubelet from restarting it) at the end of the cluster-reset process, so that etcd doesn't have to be restarted and reconfigured midway through the next start. Etcd is explicitly shut down at the end of the cluster-reset process on k3s, we just haven't wired up the context on RKE2.
Remove the apiserver static pod manifest during rke2 startup, so that the kubelet doesn't start it before it's been written with the current config - after etcd starts.
need to confirm that this doesn't do anything weird during normal restarts of the rke2 service
Use the absence of etcd db files on a node with etcd enabled as an indicator of cluster-reset, and force cleanup of the etcd and apiserver static pods early on in startup. This prevents them from being restarted later, while the kubelet and embedded controllers are trying to talk to them.

Also updates k3s.

Types of Changes

bugfix

Verification

See linked issue

In addition to the steps in the linked issue, should also see some new messages at the end of the cluster-reset process:

INFO[0067] Shutting down kubelet and etcd
ERRO[0067] Kubelet exited: signal: killed
INFO[0072] Managed etcd cluster membership has been reset, restart without --cluster-reset flag now. Backup and delete ${datadir}/server/db on each peer etcd server and rejoin the nodes

Confirm that there is no etcd process running after rke2 exits at the end of the cluster-reset.

Testing

Linked Issues

User-Facing Change

Further Comments

Addresses issue with hangs or crashes when starting up servers following a cluster-reset, caused by etcd and/or the apiserver being restarted in unexpected sequences. Signed-off-by: Brad Davidson <[email protected]>

Updates k3s: k3s-io/k3s@f365a9c...8d074ec Signed-off-by: Brad Davidson <[email protected]>

Add additional static pod cleanup during cluster reset

a673000

Addresses issue with hangs or crashes when starting up servers following a cluster-reset, caused by etcd and/or the apiserver being restarted in unexpected sequences. Signed-off-by: Brad Davidson <[email protected]>

brandond requested a review from a team as a code owner September 1, 2023 17:37

dereknola approved these changes Sep 1, 2023

View reviewed changes

Bump K3s version for v1.27

e2740c1

Updates k3s: k3s-io/k3s@f365a9c...8d074ec Signed-off-by: Brad Davidson <[email protected]>

brandond force-pushed the staticpod-sync-fix_release-1.27 branch from b64617d to e2740c1 Compare September 1, 2023 18:33

johnatasr approved these changes Sep 1, 2023

View reviewed changes

brandond merged commit 1de9953 into rancher:release-1.27 Sep 1, 2023

brandond deleted the staticpod-sync-fix_release-1.27 branch June 6, 2024 23:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[release-1.27] Add additional static pod cleanup during cluster reset #4724

[release-1.27] Add additional static pod cleanup during cluster reset #4724

brandond commented Sep 1, 2023

[release-1.27] Add additional static pod cleanup during cluster reset #4724

[release-1.27] Add additional static pod cleanup during cluster reset #4724

Conversation

brandond commented Sep 1, 2023

Proposed Changes

Types of Changes

Verification

Testing

Linked Issues

User-Facing Change

Further Comments