-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unrecoverable error when joining node attempts to retrieve etcd member list from itself #9661
Comments
Is it possible to get this prioritised please and have it backported to 1.26 because we are running into this issue on all our LTS releases and right now this issue is getting masked in the rke2-server retry loop that we have put in Ideally, the node joining should work on the first attempt and we noticed sometimes it would take 5 to 10 restarts (40-60 minutes) |
I don't have a timeline for when this will be resolved. Certainly not for the March releases, as code freeze is today. I'm confused what you mean by LTS, we do not offer LTS releases for k3s or rke2. The only way I'm aware of to reproduce this requires an incorrectly configured environment where a server is sent its own requests when joining the cluster. If you are seeing this on a regular basis then you need to change the way you are configuring your fixed registration endpoints, and ensure that you do not send registration traffic servers before they are ready. |
We do have a health probe defined for 9345 port at load-balancer to route traffic destined for 9345 port. It seems that it is forwarding request to same node since rke2 is bringing up this port the moment rke2 service is started. Should we change the health probe to use a different port i.e either use etcd port or API server port, would it solve the issue and will there be any repercussions if we use a different port as a health probe on LB? |
Same answer as I gave at rancher/rke2#5557 (comment), except for k3s the health-check would look like:
|
@brandond Most of the load balancers support health probe on single port so its not practically possible to have health probes on multiple ports Also the curl based check you have suggested that is also not feasible via LB as we won't be aware of token at the time of configuring healthprobe on the backend pool Is it possible to have health probe on a single port and if yes what would be that port/component ? |
You can set the token manually; you don't have to let the server generate a random one for you. If all of your servers are control-plane+etcd, I would health check the apiserver on 6443. |
yeah all our servers are control-plane + etcd, we will try with 6443 and get back here with the outcome. Thanks Brandon |
@brandond With 6443 as health probe for rke2 port, We are not seeing any issue. Thanks for you help 🙏 |
Unable to replicate / validate in k3s, but was able to replicate with rke2. Validation is done on the rke2 commit with k3s pull through and tracked in here: rancher/rke2#5804 (comment) |
Tracking issue for the sequence of events discussed in rancher/rke2#5557 (comment)
If users deploy an external load-balancer or DNS round-robin address list to provide the fixed registration endpoint, and add server nodes to the target pool before they have finished joining the cluster, nodes may attempt to join themselves. This can leave the joining node in a permanently broken state that requires manual cleanup to resolve.
We should enhance the cluster join process to allow detecting cases where a server is attempting to join itself, and either retry or return an error, rather than continuing on with partial information.
The text was updated successfully, but these errors were encountered: