Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot join nodes back after cluster-reset when not restoring from snapshot #2991

Closed
rancher-max opened this issue May 25, 2022 · 5 comments
Closed
Assignees
Labels
Milestone

Comments

@rancher-max
Copy link
Contributor

Forwardport for: #2857

@caroline-suse-rancher
Copy link
Contributor

@brandond Do you know when we could plan to have this done? Or should it be put in the Backlog?

@brandond
Copy link
Member

This appears to be an issue with etcd, but we've been unable to reproduce it in a simple configuration such that upstream can diagnose and fix it. I would like QA to validate that it is still an issue with newer releases of etcd (such as the version shipped with 1.26), and if so we can put another spike of work into coming up with a simple reproducer for upstream.

@VestigeJ VestigeJ self-assigned this Mar 14, 2023
@VestigeJ
Copy link
Contributor

note I did diff the etcdctl get k8s output and they were different but they are a lot of noise in a comment. I have two stashed if you want to diff them yourself...

Reproduced using VERSION=v1.23.6+rke2r2

$ get_etcd

+------------------+---------+---------------------------+----------------------------+----------------------------+------------+
|        ID        | STATUS  |           NAME            |         PEER ADDRS         |        CLIENT ADDRS        | IS LEARNER |
+------------------+---------+---------------------------+----------------------------+----------------------------+------------+
| 119824c81c28314b | started | ip-172-31-19-252-4fd2809b | https://172.31.19.252:2380 | https://172.31.19.252:2379 |      false |
| d9ff197e0dcb86f5 | started | ip-172-31-20-246-0d8f4741 | https://172.31.20.246:2380 | https://172.31.20.246:2379 |      false |
+------------------+---------+---------------------------+----------------------------+----------------------------+------------+

ec2-user@ip-172-31-19-252:~$ kgn

NAME               STATUS     ROLES                       AGE   VERSION
ip-172-31-19-252   Ready      control-plane,etcd,master   28m   v1.23.6+rke2r2
ip-172-31-20-246   NotReady   control-plane,etcd,master   30m   v1.23.6+rke2r2

$ get_etcd

+------------------+---------+---------------------------+----------------------------+----------------------------+------------+
|        ID        | STATUS  |           NAME            |         PEER ADDRS         |        CLIENT ADDRS        | IS LEARNER |
+------------------+---------+---------------------------+----------------------------+----------------------------+------------+
| 119824c81c28314b | started | ip-172-31-19-252-4fd2809b | https://172.31.19.252:2380 | https://172.31.19.252:2379 |      false |
| d9ff197e0dcb86f5 | started | ip-172-31-20-246-0d8f4741 | https://172.31.20.246:2380 | https://172.31.20.246:2379 |      false |
+------------------+---------+---------------------------+----------------------------+----------------------------+------------+

ec2-user@ip-172-31-20-246:~$ kgn

NAME               STATUS     ROLES                       AGE   VERSION
ip-172-31-19-252   NotReady   control-plane,etcd,master   28m   v1.23.6+rke2r2
ip-172-31-20-246   Ready      control-plane,etcd,master   29m   v1.23.6+rke2r2

$ get_figs rke2

=========== rke2 config =========== 
write-kubeconfig-mode: 644
debug: true
token: garlicdinosaurs
profile: cis-1.6
selinux: true

Attempting to reproduce on v1.24.11+rke2r1 was more problematic

first node
ec2-user@ip-172-31-20-246:~$ kgn

NAME               STATUS     ROLES                       AGE     VERSION
ip-172-31-19-252   NotReady   control-plane,etcd,master   9m27s   v1.24.11+rke2r1
ip-172-31-20-246   Ready      control-plane,etcd,master   12m     v1.24.11+rke2r1

second node
ec2-user@ip-172-31-19-252:~$ kgn

NAME               STATUS     ROLES                       AGE   VERSION
ip-172-31-19-252   Ready      control-plane,etcd,master   10m   v1.24.11+rke2r1
ip-172-31-20-246   NotReady   control-plane,etcd,master   13m   v1.24.11+rke2r1

first node
ec2-user@ip-172-31-20-246:~$ get_etcd

+------------------+---------+---------------------------+----------------------------+----------------------------+------------+
|        ID        | STATUS  |           NAME            |         PEER ADDRS         |        CLIENT ADDRS        | IS LEARNER |
+------------------+---------+---------------------------+----------------------------+----------------------------+------------+
| 56542824e1640916 | started | ip-172-31-19-252-63f9e008 | https://172.31.19.252:2380 | https://172.31.19.252:2379 |      false |
| d9ff197e0dcb86f5 | started | ip-172-31-20-246-d9247a40 | https://172.31.20.246:2380 | https://172.31.20.246:2379 |      false |
+------------------+---------+---------------------------+----------------------------+----------------------------+------------+

second node
ec2-user@ip-172-31-19-252:~$ get_etcd

+------------------+---------+---------------------------+----------------------------+----------------------------+------------+
|        ID        | STATUS  |           NAME            |         PEER ADDRS         |        CLIENT ADDRS        | IS LEARNER |
+------------------+---------+---------------------------+----------------------------+----------------------------+------------+
| 56542824e1640916 | started | ip-172-31-19-252-63f9e008 | https://172.31.19.252:2380 | https://172.31.19.252:2379 |      false |
| d9ff197e0dcb86f5 | started | ip-172-31-20-246-d9247a40 | https://172.31.20.246:2380 | https://172.31.20.246:2379 |      false |
+------------------+---------+---------------------------+----------------------------+----------------------------+------------+

Attempting to reproduce on v1.26.2+rke2r1

first node

$ kgn

NAME               STATUS     ROLES                       AGE     VERSION
ip-172-31-19-252   NotReady   control-plane,etcd,master   9m39s   v1.26.2+rke2r1
ip-172-31-20-246   Ready      control-plane,etcd,master   11m     v1.26.2+rke2r1

ec2-user@ip-172-31-20-246:~$ kgp -A

NAMESPACE     NAME                                                    READY   STATUS      RESTARTS        AGE
kube-system   cloud-controller-manager-ip-172-31-19-252               1/1     Running     0               9m39s
kube-system   cloud-controller-manager-ip-172-31-20-246               1/1     Running     3 (3m34s ago)   11m
kube-system   etcd-ip-172-31-19-252                                   1/1     Running     0               9m4s
kube-system   etcd-ip-172-31-20-246                                   1/1     Running     1 (3m45s ago)   10m
kube-system   helm-install-rke2-canal-wkvfv                           0/1     Completed   0               11m
kube-system   helm-install-rke2-coredns-bjzjr                         0/1     Completed   0               11m
kube-system   helm-install-rke2-ingress-nginx-wxgnx                   0/1     Completed   0               11m
kube-system   helm-install-rke2-metrics-server-ht572                  0/1     Completed   0               11m
kube-system   helm-install-rke2-snapshot-controller-7j7dg             0/1     Completed   2               11m
kube-system   helm-install-rke2-snapshot-controller-crd-pdzcr         0/1     Completed   0               11m
kube-system   helm-install-rke2-snapshot-validation-webhook-pws8v     0/1     Completed   0               11m
kube-system   kube-apiserver-ip-172-31-19-252                         1/1     Running     0               9m39s
kube-system   kube-apiserver-ip-172-31-20-246                         1/1     Running     0               11m
kube-system   kube-controller-manager-ip-172-31-19-252                1/1     Running     0               9m39s
kube-system   kube-controller-manager-ip-172-31-20-246                1/1     Running     3 (3m34s ago)   11m
kube-system   kube-proxy-ip-172-31-19-252                             1/1     Running     0               9m36s
kube-system   kube-proxy-ip-172-31-20-246                             1/1     Running     1 (7m13s ago)   11m
kube-system   kube-scheduler-ip-172-31-19-252                         1/1     Running     0               9m39s
kube-system   kube-scheduler-ip-172-31-20-246                         1/1     Running     2 (3m33s ago)   11m
kube-system   rke2-canal-brrnf                                        2/2     Running     0               9m41s
kube-system   rke2-canal-r2vvp                                        2/2     Running     2 (7m13s ago)   10m
kube-system   rke2-coredns-rke2-coredns-854779488f-bl9zg              1/1     Running     1 (7m13s ago)   11m
kube-system   rke2-coredns-rke2-coredns-854779488f-m4qt6              1/1     Running     0               9m39s
kube-system   rke2-coredns-rke2-coredns-autoscaler-75b5699cf4-glz22   1/1     Running     1 (7m13s ago)   11m
kube-system   rke2-ingress-nginx-controller-6gnmm                     1/1     Running     0               9m19s
kube-system   rke2-ingress-nginx-controller-w5hwl                     1/1     Running     1 (7m13s ago)   10m
kube-system   rke2-metrics-server-778467dc76-gng6g                    1/1     Running     1 (7m13s ago)   10m
kube-system   rke2-snapshot-controller-7f55854fcc-x56l7               1/1     Running     1 (7m13s ago)   10m
kube-system   rke2-snapshot-validation-webhook-75bdb9f759-s9496       1/1     Running     1 (7m13s ago)   10m

ec2-user@ip-172-31-20-246:~$ get_etcd

+------------------+---------+---------------------------+----------------------------+----------------------------+------------+
|        ID        | STATUS  |           NAME            |         PEER ADDRS         |        CLIENT ADDRS        | IS LEARNER |
+------------------+---------+---------------------------+----------------------------+----------------------------+------------+
| d9ff197e0dcb86f5 | started | ip-172-31-20-246-b39c9ecb | https://172.31.20.246:2380 | https://172.31.20.246:2379 |      false |
| eb5736460076e9a2 | started | ip-172-31-19-252-dd23b050 | https://172.31.19.252:2380 | https://172.31.19.252:2379 |      false |
+------------------+---------+---------------------------+----------------------------+----------------------------+------------+

second node

ec2-user@ip-172-31-19-252:~$ kgn

NAME               STATUS     ROLES                       AGE     VERSION
ip-172-31-19-252   Ready      control-plane,etcd,master   9m37s   v1.26.2+rke2r1
ip-172-31-20-246   NotReady   control-plane,etcd,master   11m     v1.26.2+rke2r1

ec2-user@ip-172-31-19-252:~$ kgp -A

NAMESPACE     NAME                                                    READY   STATUS      RESTARTS        AGE
kube-system   cloud-controller-manager-ip-172-31-19-252               1/1     Running     2 (90s ago)     9m41s
kube-system   cloud-controller-manager-ip-172-31-20-246               0/1     Running     2 (6m36s ago)   11m
kube-system   etcd-ip-172-31-19-252                                   1/1     Running     1 (96s ago)     9m6s
kube-system   etcd-ip-172-31-20-246                                   0/1     Running     1 (3m47s ago)   10m
kube-system   helm-install-rke2-canal-wkvfv                           0/1     Completed   0               11m
kube-system   helm-install-rke2-coredns-bjzjr                         0/1     Completed   0               11m
kube-system   helm-install-rke2-ingress-nginx-wxgnx                   0/1     Completed   0               11m
kube-system   helm-install-rke2-metrics-server-ht572                  0/1     Completed   0               11m
kube-system   helm-install-rke2-snapshot-controller-7j7dg             0/1     Completed   2               11m
kube-system   helm-install-rke2-snapshot-controller-crd-pdzcr         0/1     Completed   0               11m
kube-system   helm-install-rke2-snapshot-validation-webhook-pws8v     0/1     Completed   0               11m
kube-system   kube-apiserver-ip-172-31-19-252                         1/1     Running     2 (90s ago)     9m41s
kube-system   kube-apiserver-ip-172-31-20-246                         0/1     Running     0               11m
kube-system   kube-controller-manager-ip-172-31-19-252                1/1     Running     2 (90s ago)     9m41s
kube-system   kube-controller-manager-ip-172-31-20-246                0/1     Running     2 (6m36s ago)   11m
kube-system   kube-proxy-ip-172-31-19-252                             1/1     Running     1 (96s ago)     9m38s
kube-system   kube-proxy-ip-172-31-20-246                             1/1     Running     1 (7m15s ago)   11m
kube-system   kube-scheduler-ip-172-31-19-252                         1/1     Running     2 (85s ago)     9m41s
kube-system   kube-scheduler-ip-172-31-20-246                         1/1     Running     1 (7m15s ago)   11m
kube-system   rke2-canal-brrnf                                        2/2     Running     2 (96s ago)     9m43s
kube-system   rke2-canal-r2vvp                                        0/2     Unknown     0               11m
kube-system   rke2-coredns-rke2-coredns-854779488f-bl9zg              0/1     Unknown     0               11m
kube-system   rke2-coredns-rke2-coredns-854779488f-m4qt6              1/1     Running     1 (96s ago)     9m41s
kube-system   rke2-coredns-rke2-coredns-autoscaler-75b5699cf4-glz22   0/1     Unknown     0               11m
kube-system   rke2-ingress-nginx-controller-6gnmm                     1/1     Running     1 (96s ago)     9m21s
kube-system   rke2-ingress-nginx-controller-w5hwl                     0/1     Unknown     0               10m
kube-system   rke2-metrics-server-778467dc76-gng6g                    0/1     Unknown     0               10m
kube-system   rke2-snapshot-controller-7f55854fcc-x56l7               0/1     Unknown     0               10m
kube-system   rke2-snapshot-validation-webhook-75bdb9f759-s9496       0/1     Unknown     0               10m

ec2-user@ip-172-31-19-252:~$ get_etcd

+------------------+---------+---------------------------+----------------------------+----------------------------+------------+
|        ID        | STATUS  |           NAME            |         PEER ADDRS         |        CLIENT ADDRS        | IS LEARNER |
+------------------+---------+---------------------------+----------------------------+----------------------------+------------+
| d9ff197e0dcb86f5 | started | ip-172-31-20-246-b39c9ecb | https://172.31.20.246:2380 | https://172.31.20.246:2379 |      false |
| eb5736460076e9a2 | started | ip-172-31-19-252-dd23b050 | https://172.31.19.252:2380 | https://172.31.19.252:2379 |      false |
+------------------+---------+---------------------------+----------------------------+----------------------------+------------+

@brandond
Copy link
Member

brandond commented Mar 17, 2023

Yes, this seems to be fairly easy to reproduce using embedded etcd. I need to be able to provide the etcd maintainers with a set of steps to reproduce on vanilla etcd, without using Kubernetes, in order for them to diagnose it.

@VestigeJ
Copy link
Contributor

As of RKE2 v1.27.4+rke2r1 I'm no longer seeing this behavior and am electing to close this issue as stale and no longer relevant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants