Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

no nodes have reconciled ETCDSnapshotFile resources, requeuing error on clusters with no etcd snapshots #4906

Closed
brandond opened this issue Oct 17, 2023 · 1 comment

Comments

@brandond
Copy link
Member

RKE2 tracking issue for

@mdrahman-suse
Copy link
Contributor

mdrahman-suse commented Oct 24, 2023

Validated with RC version 1.28.3-rc1+rke2r1

Environment Details

Infrastructure

  • Cloud
  • Hosted

Node(s) CPU architecture, OS, and Version:

Ubuntu 22.04.2 LTS (GNU/Linux 5.15.0-1031-aws x86_64)

Cluster Configuration:

Split servers: 1 etcd only, 1 cp only, 1 agent

Config.yaml:

# server1                                    |# server2                                               | # agent1                             |
---------------------------------------------|--------------------------------------------------------|--------------------------------------|
write-kubeconfig-mode: 644                   | write-kubeconfig-mode: 644                             | server: "https://<etcdonly-ip>:9345" |
token: <TOKEN>                               | token: <TOKEN>                                         | token: <TOKEN>                       |
node-external-ip: "<etcdonly-ip>"            | node-external-ip: "<cponly-ip>"                        | node-external-ip: "<agent-ip>"       |
node-name: etcdonly                          | node-name: cponly                                      | node-name: agent                     |
disable-apiserver: true                      | server: "https://<etcdonly-ip>:9345"                   |
disable-controller-manager: true             | disable-etcd: true                                     |
disable-scheduler: true                      | node-taint:                                            |
node-taint:                                  |   - node-role.kubernetes.io/control-plane:NoSchedule   |
  - node-role.kubernetes.io/etcd:NoExecute   |

Testing Steps

  1. Copy config.yaml
$ sudo mkdir -p /etc/rancher/rke2 && sudo cp config.yaml /etc/rancher/rke2
  1. Install rke2
  2. Check cp-only server log does not have the error
level=error msg="error syncing '_reconcile_': handler managed-etcd-snapshots-controller: no nodes have reconciled ETCDSnapshotFile resources, requeuing"
  1. Ensure the cluster is up and running

Replication Results:

  • Tried replication with master commit 86ca5f4 but was unable to see the error
  • Cluster does not even come up properly

Validation Results:

  • rke2 version used for validation:
rke2 version v1.28.3-rc1+rke2r1 (13bc384ad5a3eb070acc975e4a551b654bba9e42)
go version go1.20.10 X:boringcrypto
  • Observed that the error msg is NOT visible
$ sudo journalctl -u rke2-server | grep 'ETCDSnapshotFile'
Oct 24 02:06:06 ip-xxx-xx-7-64 rke2[1769]: time="2023-10-24T02:06:06Z" level=info msg="Starting k3s.cattle.io/v1, Kind=ETCDSnapshotFile controller"
ubuntu@ip-xxx-xx-7-64:~$
  • Cluster is up and running
NAME                                              STATUS   ROLES                  AGE    VERSION          INTERNAL-IP    EXTERNAL-IP      OS-IMAGE             KERNEL-VERSION    CONTAINER-RUNTIME
node/ip-xxx-xx-2-81.us-east-2.compute.internal    Ready    <none>                 16m    v1.28.3+rke2r1   xxx.xx.2.81    x.xxx.xx.166     Ubuntu 22.04.1 LTS   5.15.0-1019-aws   containerd://1.7.7-k3s1
node/ip-xxx-xx-7-64.us-east-2.compute.internal    Ready    control-plane,master   17m    v1.28.3+rke2r1   xxx.xx.7.64    x.xxx.xxx.169    Ubuntu 22.04.1 LTS   5.15.0-1019-aws   containerd://1.7.7-k3s1
node/ip-xxx-xx-9-224.us-east-2.compute.internal   Ready    etcd                   17m    v1.28.3+rke2r1   xxx.xx.9.224   xx.xxx.xxx.122   Ubuntu 22.04.1 LTS   5.15.0-1019-aws   containerd://1.7.7-k3s1

NAMESPACE     NAME                                                                      READY   STATUS      RESTARTS   AGE    IP             NODE                                         NOMINATED NODE   READINESS GATES
kube-system   pod/cloud-controller-manager-ip-xxx-xx-7-64.us-east-2.compute.internal    1/1     Running     0          17m    xxx.xx.7.64    ip-xxx-xx-7-64.us-east-2.compute.internal    <none>           <none>
kube-system   pod/cloud-controller-manager-ip-xxx-xx-9-224.us-east-2.compute.internal   1/1     Running     0          17m    xxx.xx.9.224   ip-xxx-xx-9-224.us-east-2.compute.internal   <none>           <none>
kube-system   pod/etcd-ip-xxx-xx-9-224.us-east-2.compute.internal                       1/1     Running     0          17m    xxx.xx.9.224   ip-xxx-xx-9-224.us-east-2.compute.internal   <none>           <none>
kube-system   pod/helm-install-rke2-canal-zfttd                                         0/1     Completed   0          17m    xxx.xx.7.64    ip-xxx-xx-7-64.us-east-2.compute.internal    <none>           <none>
kube-system   pod/helm-install-rke2-coredns-jbcqq                                       0/1     Completed   0          17m    xxx.xx.7.64    ip-xxx-xx-7-64.us-east-2.compute.internal    <none>           <none>
kube-system   pod/helm-install-rke2-ingress-nginx-shcc9                                 0/1     Completed   0          17m    xx.xx.2.6      ip-xxx-xx-2-81.us-east-2.compute.internal    <none>           <none>
kube-system   pod/helm-install-rke2-metrics-server-fmjdw                                0/1     Completed   0          17m    xx.xx.2.5      ip-xxx-xx-2-81.us-east-2.compute.internal    <none>           <none>
kube-system   pod/helm-install-rke2-snapshot-controller-crd-gvnhz                       0/1     Completed   0          17m    xx.xx.2.3      ip-xxx-xx-2-81.us-east-2.compute.internal    <none>           <none>
kube-system   pod/helm-install-rke2-snapshot-controller-xt7tp                           0/1     Completed   1          17m    xx.xx.2.2      ip-xxx-xx-2-81.us-east-2.compute.internal    <none>           <none>
kube-system   pod/helm-install-rke2-snapshot-validation-webhook-trx9n                   0/1     Completed   0          17m    xx.xx.2.4      ip-xxx-xx-2-81.us-east-2.compute.internal    <none>           <none>
kube-system   pod/kube-apiserver-ip-xxx-xx-7-64.us-east-2.compute.internal              1/1     Running     0          17m    xxx.xx.7.64    ip-xxx-xx-7-64.us-east-2.compute.internal    <none>           <none>
kube-system   pod/kube-controller-manager-ip-xxx-xx-7-64.us-east-2.compute.internal     1/1     Running     0          17m    xxx.xx.7.64    ip-xxx-xx-7-64.us-east-2.compute.internal    <none>           <none>
kube-system   pod/kube-proxy-ip-xxx-xx-2-81.us-east-2.compute.internal                  1/1     Running     0          17m    xxx.xx.2.81    ip-xxx-xx-2-81.us-east-2.compute.internal    <none>           <none>
kube-system   pod/kube-proxy-ip-xxx-xx-7-64.us-east-2.compute.internal                  1/1     Running     0          17m    xxx.xx.7.64    ip-xxx-xx-7-64.us-east-2.compute.internal    <none>           <none>
kube-system   pod/kube-proxy-ip-xxx-xx-9-224.us-east-2.compute.internal                 1/1     Running     0          17m    xxx.xx.9.224   ip-xxx-xx-9-224.us-east-2.compute.internal   <none>           <none>
kube-system   pod/kube-scheduler-ip-xxx-xx-7-64.us-east-2.compute.internal              1/1     Running     0          17m    xxx.xx.7.64    ip-xxx-xx-7-64.us-east-2.compute.internal    <none>           <none>
kube-system   pod/rke2-canal-4rprq                                                      2/2     Running     0          17m    xxx.xx.7.64    ip-xxx-xx-7-64.us-east-2.compute.internal    <none>           <none>
kube-system   pod/rke2-canal-9mlsn                                                      2/2     Running     0          17m    xxx.xx.2.81    ip-xxx-xx-2-81.us-east-2.compute.internal    <none>           <none>
kube-system   pod/rke2-canal-mpfwf                                                      2/2     Running     0          17m    xxx.xx.9.224   ip-xxx-xx-9-224.us-east-2.compute.internal   <none>           <none>
kube-system   pod/rke2-coredns-rke2-coredns-6b795db654-q7487                            1/1     Running     0          16m    xx.xx.1.2      ip-xxx-xx-9-224.us-east-2.compute.internal   <none>           <none>
kube-system   pod/rke2-coredns-rke2-coredns-6b795db654-tb6cn                            1/1     Running     0          17m    xx.xx.0.3      ip-xxx-xx-7-64.us-east-2.compute.internal    <none>           <none>
kube-system   pod/rke2-coredns-rke2-coredns-autoscaler-945fbd459-hnfrw                  1/1     Running     0          17m    xx.xx.0.2      ip-xxx-xx-7-64.us-east-2.compute.internal    <none>           <none>
kube-system   pod/rke2-ingress-nginx-controller-plh86                                   1/1     Running     0          13m    xx.xx.2.9      ip-xxx-xx-2-81.us-east-2.compute.internal    <none>           <none>
kube-system   pod/rke2-metrics-server-544c8c66fc-tlz85                                  1/1     Running     0          13m    xx.xx.2.7      ip-xxx-xx-2-81.us-east-2.compute.internal    <none>           <none>
kube-system   pod/rke2-snapshot-controller-59cc9cd8f4-hpb2t                             1/1     Running     0          13m    xx.xx.1.4      ip-xxx-xx-9-224.us-east-2.compute.internal   <none>           <none>
kube-system   pod/rke2-snapshot-validation-webhook-54c5989b65-qs7d4                     1/1     Running     0          13m    xx.xx.1.3      ip-xxx-xx-9-224.us-east-2.compute.internal   <none>           <none>

NOTE: The issue was not observed on all roles cluster setup

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants