Unrecoverable error when joining node attempts to retrieve etcd member list from itself #9661

brandond · 2024-03-06T21:37:57Z

Tracking issue for the sequence of events discussed in rancher/rke2#5557 (comment)

If users deploy an external load-balancer or DNS round-robin address list to provide the fixed registration endpoint, and add server nodes to the target pool before they have finished joining the cluster, nodes may attempt to join themselves. This can leave the joining node in a permanently broken state that requires manual cleanup to resolve.

We should enhance the cluster join process to allow detecting cases where a server is attempting to join itself, and either retry or return an error, rather than continuing on with partial information.

rajivml · 2024-03-08T11:11:22Z

Is it possible to get this prioritised please and have it backported to 1.26 because we are running into this issue on all our LTS releases and right now this issue is getting masked in the rke2-server retry loop that we have put in

Ideally, the node joining should work on the first attempt and we noticed sometimes it would take 5 to 10 restarts (40-60 minutes)

brandond · 2024-03-08T15:26:05Z

I don't have a timeline for when this will be resolved. Certainly not for the March releases, as code freeze is today. I'm confused what you mean by LTS, we do not offer LTS releases for k3s or rke2.

The only way I'm aware of to reproduce this requires an incorrectly configured environment where a server is sent its own requests when joining the cluster. If you are seeing this on a regular basis then you need to change the way you are configuring your fixed registration endpoints, and ensure that you do not send registration traffic servers before they are ready.

rajivml · 2024-03-08T16:26:33Z

I'm confused what you mean by LTS, we do not offer LTS releases for k3s or rke2.
Sorry for the confusion, when I mean LTS, it's our LTS release cadence where we bundle RKE2 and we periodically provide k8's update to our customers

We do have a health probe defined for 9345 port at load-balancer to route traffic destined for 9345 port. It seems that it is forwarding request to same node since rke2 is bringing up this port the moment rke2 service is started. Should we change the health probe to use a different port i.e either use etcd port or API server port, would it solve the issue and will there be any repercussions if we use a different port as a health probe on LB?

brandond · 2024-03-08T17:23:00Z

Same answer as I gave at rancher/rke2#5557 (comment), except for k3s the health-check would look like:

curl -ksf https://node:[email protected]:6443/v1-k3s/readyz

rajivml · 2024-03-11T03:20:55Z

@brandond Most of the load balancers support health probe on single port so its not practically possible to have health probes on multiple ports

Also the curl based check you have suggested that is also not feasible via LB as we won't be aware of token at the time of configuring healthprobe on the backend pool

Is it possible to have health probe on a single port and if yes what would be that port/component ?

brandond · 2024-03-11T03:33:19Z

You can set the token manually; you don't have to let the server generate a random one for you.

If all of your servers are control-plane+etcd, I would health check the apiserver on 6443.

rajivml · 2024-03-12T07:50:59Z

yeah all our servers are control-plane + etcd, we will try with 6443 and get back here with the outcome. Thanks Brandon

mynktl · 2024-03-28T12:54:49Z

@brandond With 6443 as health probe for rke2 port, We are not seeing any issue. Thanks for you help 🙏

mdrahman-suse · 2024-04-18T22:12:34Z

Unable to replicate / validate in k3s, but was able to replicate with rke2. Validation is done on the rke2 commit with k3s pull through and tracked in here: rancher/rke2#5804 (comment)

github-project-automation bot added this to K3s Development Mar 6, 2024

github-project-automation bot moved this to New in K3s Development Mar 6, 2024

brandond moved this from New to Accepted in K3s Development Mar 6, 2024

brandond added this to the Backlog milestone Mar 6, 2024

github-project-automation bot added this to K3s Backlog Mar 6, 2024

caroline-suse-rancher removed this from the Backlog milestone Mar 8, 2024

caroline-suse-rancher added the kind/enhancement An improvement to existing functionality label Mar 8, 2024

brandond removed this from K3s Backlog Mar 8, 2024

brandond mentioned this issue Mar 11, 2024

Send error response if member list cannot be retrieved #9722

Merged

brandond moved this from Accepted to Peer Review in K3s Development Mar 22, 2024

This was referenced Mar 26, 2024

[Release-1.28] - Unrecoverable error when joining node attempts to retrieve etcd member list from itself #9790

Closed

[Release-1.27] - Unrecoverable error when joining node attempts to retrieve etcd member list from itself #9791

Closed

brandond added this to the v1.29.4+k3s1 milestone Mar 26, 2024

aganesh-suse assigned mdrahman-suse Apr 4, 2024

caroline-suse-rancher moved this from Peer Review to To Test in K3s Development Apr 8, 2024

brandond self-assigned this Apr 11, 2024

brandond mentioned this issue Apr 18, 2024

Unrecoverable error when joining node attempts to retrieve etcd member list from itself rancher/rke2#5804

Closed

mdrahman-suse closed this as completed Apr 18, 2024

github-project-automation bot moved this from To Test to Done Issue in K3s Development Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unrecoverable error when joining node attempts to retrieve etcd member list from itself #9661

Unrecoverable error when joining node attempts to retrieve etcd member list from itself #9661

brandond commented Mar 6, 2024

rajivml commented Mar 8, 2024 •

edited

Loading

brandond commented Mar 8, 2024 •

edited

Loading

rajivml commented Mar 8, 2024

brandond commented Mar 8, 2024

rajivml commented Mar 11, 2024 •

edited

Loading

brandond commented Mar 11, 2024

rajivml commented Mar 12, 2024

mynktl commented Mar 28, 2024

mdrahman-suse commented Apr 18, 2024

Unrecoverable error when joining node attempts to retrieve etcd member list from itself #9661

Unrecoverable error when joining node attempts to retrieve etcd member list from itself #9661

Comments

brandond commented Mar 6, 2024

rajivml commented Mar 8, 2024 • edited Loading

brandond commented Mar 8, 2024 • edited Loading

rajivml commented Mar 8, 2024

brandond commented Mar 8, 2024

rajivml commented Mar 11, 2024 • edited Loading

brandond commented Mar 11, 2024

rajivml commented Mar 12, 2024

mynktl commented Mar 28, 2024

mdrahman-suse commented Apr 18, 2024

rajivml commented Mar 8, 2024 •

edited

Loading

brandond commented Mar 8, 2024 •

edited

Loading

rajivml commented Mar 11, 2024 •

edited

Loading