-
Notifications
You must be signed in to change notification settings - Fork 275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cant join the 3rd master to new cluster #5080
Comments
Check the etcd pod logs under /var/log/pods. If you can't find anything interesting in there, attach all the pod logs, and the complete rke2-server logs from journald on the 3rd node. |
I have made some progress, and discovered that the 3rd node is not listening on TCP:2380 - but i havent had any luck, determining why its not doing that. faulty_node_journal_log.txt Update:
I checked the folder /etc/cni/net.d on the faulty node, and its completely empty - unlike the other working nodes. Can i somehow generate the config files? |
Did you check the etcd pod logs as suggested? It's not listening on the etcd port, and there are errors about connecting to etcd... but it sounds like you haven't checked the actual etcd logs yet. |
Sorry for leaving that part out - so i wanted to check the etcd logs under /var/logs/pods, but the folder is completely empty:
So i searched for that and found this post - and believed i had the same issue. I am new to K8S, so i might have misunderstood question - if so, please let me know :) |
From the journald log:
The 3rd node can't join because the 2nd node hasn't finished joining yet. Make sure that all of the correct ports are open between your nodes - see https://docs.rke2.io/install/requirements#inbound-network-rules for the list, taking particular note of the etcd ports. The etcd pod logs on the 1st and 2nd nodes probably have more information about the state of the cluster. edit: Was this node ( You can't join a 4th node to a 3-node cluster while one of them is unhealthy. |
Hi - thanks for the reply. dk1k8s07 are not present in the kubectl get nodes: dk1k8s03 was originally thought the be the 3rd master, but i never succeeded in getting in the cluster - so i reinstalled it with a new name (dk1k8s07), but the same IP address (192.168.20.53). UFW and AppArmor are disabled on all the nodes. I really appreciate your time on this! Let me know what to do now :) |
Hmm. It would be exceedingly difficult for the node to be present in etcd without also joining the Kubernetes cluster, but it sounds like you've got some pretty screwy things going on with this cluster. You could use etcdctl to remove the stale node, but easier than that would probably be to stop rke2 on all nodes, run |
And just to be sure, it means i should delete everything under "/var/lib/rancher" on the 2nd and 3rd node - correct? :) |
No, just the server/db directory. The cluster-reset will confirm that for you when it completes. |
I performed these steps:
Then i tried to run the rke2 server --cluster-reset on dk1k8s02, but keep getting this error message:
Update:
Any suggestions? |
Environmental Info:
RKE2 Version:
rke2 version v1.26.10+rke2r2 (21e3a8c)
go version go1.20.10 X:boringcrypto
Node(s) CPU architecture, OS, and Version:
Linux dk1k8s01 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Cluster Configuration:
The cluster consists of 2 etcd nodes, and i want to add a third one, before i add the agent nodes.
dk1k8sclu01 / 192.168.20.50 - vip / keepalived
dk1k8s01 / 192.168.20.51 - working (current holder of the Keepalived VIP)
dk1k8s02 / 192.168.20.52 - working
dk1k8s07 / 192.168.20.53 - faulty
root@dk1k8s01:~# /var/lib/rancher/rke2/bin/kubectl get nodes --kubeconfig /etc/rancher/rke2/rke2.yaml
NAME STATUS ROLES AGE VERSION
dk1k8s01 Ready control-plane,etcd,master 2d22h v1.26.10+rke2r2
dk1k8s02 Ready control-plane,etcd,master 2d21h v1.26.10+rke2r2
Describe the bug:
So i created 3 identical Ubuntu 22.04 VMs on Proxmox, and follow the same simple guide from Rancher, on how to install the platform. The first 2 nodes worked as expected, but the third one always fails to join the cluster.
I have performed the following actions:
When i join the 3rd node to the cluster i can see that it starts consuming a lot of CPU and memory, and after about 5 minutes it fails.
The log on the 3rd node just get flooded with this type of messages:
Additional context / logs:
faulty_node.txt
Im a bit lost here, and any help would be greatly appreciated! :)
The text was updated successfully, but these errors were encountered: