Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release-1.25] Add additional static pod cleanup during cluster reset #4726

Merged

Conversation

brandond
Copy link
Member

@brandond brandond commented Sep 1, 2023

Proposed Changes

Addresses issue with hangs or crashes when starting up servers following a cluster-reset, caused by etcd and/or the apiserver being restarted in unexpected sequences.

Background is discussed at #4707 (comment):

  • Shut down the etcd static pod (and the kubelet, to keep the kubelet from restarting it) at the end of the cluster-reset process, so that etcd doesn't have to be restarted and reconfigured midway through the next start. Etcd is explicitly shut down at the end of the cluster-reset process on k3s, we just haven't wired up the context on RKE2.
  • Remove the apiserver static pod manifest during rke2 startup, so that the kubelet doesn't start it before it's been written with the current config - after etcd starts.
    need to confirm that this doesn't do anything weird during normal restarts of the rke2 service
  • Use the absence of etcd db files on a node with etcd enabled as an indicator of cluster-reset, and force cleanup of the etcd and apiserver static pods early on in startup. This prevents them from being restarted later, while the kubelet and embedded controllers are trying to talk to them.

Also updates k3s.

Types of Changes

bugfix

Verification

  • See linked issue
  • In addition to the steps in the linked issue, should also see some new messages at the end of the cluster-reset process:
    INFO[0067] Shutting down kubelet and etcd
    ERRO[0067] Kubelet exited: signal: killed
    INFO[0072] Managed etcd cluster membership has been reset, restart without --cluster-reset flag now. Backup and delete ${datadir}/server/db on each peer etcd server and rejoin the nodes
    
  • Confirm that there is no etcd process running after rke2 exits at the end of the cluster-reset.

Testing

Linked Issues

User-Facing Change

Further Comments

Addresses issue with hangs or crashes when starting up servers following
a cluster-reset, caused by etcd and/or the apiserver being restarted in
unexpected sequences.

Signed-off-by: Brad Davidson <[email protected]>
@brandond brandond requested a review from a team as a code owner September 1, 2023 17:38
@brandond brandond force-pushed the staticpod-sync-fix_release-1.25 branch from 9f829e1 to 77030f3 Compare September 1, 2023 18:33
@brandond brandond merged commit 785512e into rancher:release-1.25 Sep 1, 2023
@brandond brandond deleted the staticpod-sync-fix_release-1.25 branch June 6, 2024 23:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants