Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CrashLoopBackOff status on helm-install-rke2-coredns pod after upgrade #5492

Closed
mdrahman-suse opened this issue Feb 21, 2024 · 1 comment
Closed
Assignees
Labels
kind/bug Something isn't working

Comments

@mdrahman-suse
Copy link
Contributor

mdrahman-suse commented Feb 21, 2024

Environmental Info:
RKE2 Version:

v1.29.2-rc1+rke2r1 and below RCs

Node(s) CPU architecture, OS, and Version:

Linux ip-xx 5.15.0-1031-aws #35-Ubuntu SMP Fri Feb 10 02:07:18 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:

1 server / HA setup (3 servers + 1 agent)

Describe the bug:

After a simple upgrade, the helm-install-rke2-coredns pod goes in to CrashLoopBackOff status
Pod logs shows error

Error: INSTALLATION FAILED: 1 error occurred:
	* Service "rke2-coredns" is invalid: spec.clusterIPs: Invalid value: []string{"10.43.0.10"}: failed to allocate IP 10.43.0.10: provided IP is already allocated

Steps To Reproduce:

  • Installed RKE2 with v1.29.1+rke2r1
  • Wait and ensure the cluster is up and running
  • Upgrade RKE2 with v1.29.2-rc1+rke2r1 (Manual upgrade or SUC)
  • Wait and ensure the cluster is up

Expected behavior:

  • All pods should be in either Running or Completed state

Actual behavior:

  • The helm-install-rke2-coredns pod goes in to CrashLoopBackOff state

  • Before upgrade

NAME           STATUS   ROLES                       AGE   VERSION          INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION    CONTAINER-RUNTIME
node/server1   Ready    control-plane,etcd,master   10m   v1.29.1+rke2r1   172.31.29.117   <none>        Ubuntu 22.04.2 LTS   5.15.0-1031-aws   containerd://1.7.11-k3s2

NAMESPACE         NAME                                                        READY   STATUS      RESTARTS      AGE     IP              NODE      NOMINATED NODE   READINESS GATES
calico-system     pod/calico-kube-controllers-f9666c8f4-rrw8z                 1/1     Running     0             9m5s    10.42.116.130   server1   <none>           <none>
calico-system     pod/calico-node-h2slv                                       1/1     Running     0             9m5s    172.31.29.117   server1   <none>           <none>
calico-system     pod/calico-typha-c55fcd44c-vcn97                            1/1     Running     0             9m5s    172.31.29.117   server1   <none>           <none>
kube-system       pod/cloud-controller-manager-server1                        1/1     Running     1 (53s ago)   10m     172.31.29.117   server1   <none>           <none>
kube-system       pod/etcd-server1                                            1/1     Running     0             10m     172.31.29.117   server1   <none>           <none>
kube-system       pod/helm-install-rke2-calico-crd-jk5tc                      0/1     Completed   0             10m     172.31.29.117   server1   <none>           <none>
kube-system       pod/helm-install-rke2-calico-qdtb9                          0/1     Completed   2             10m     172.31.29.117   server1   <none>           <none>
kube-system       pod/helm-install-rke2-coredns-fc8tn                         0/1     Completed   0             10m     172.31.29.117   server1   <none>           <none>
kube-system       pod/helm-install-rke2-ingress-nginx-5gssc                   0/1     Completed   0             10m     10.42.116.134   server1   <none>           <none>
kube-system       pod/helm-install-rke2-metrics-server-wptm5                  0/1     Completed   0             10m     10.42.116.133   server1   <none>           <none>
kube-system       pod/helm-install-rke2-snapshot-controller-crd-zjxwn         0/1     Completed   0             10m     10.42.116.132   server1   <none>           <none>
kube-system       pod/helm-install-rke2-snapshot-controller-jfdt6             0/1     Completed   0             10m     10.42.116.135   server1   <none>           <none>
kube-system       pod/helm-install-rke2-snapshot-validation-webhook-qw5c8     0/1     Completed   0             10m     10.42.116.136   server1   <none>           <none>
kube-system       pod/kube-apiserver-server1                                  1/1     Running     0             10m     172.31.29.117   server1   <none>           <none>
kube-system       pod/kube-controller-manager-server1                         1/1     Running     1 (53s ago)   10m     172.31.29.117   server1   <none>           <none>
kube-system       pod/kube-proxy-server1                                      1/1     Running     0             10m     172.31.29.117   server1   <none>           <none>
kube-system       pod/kube-scheduler-server1                                  1/1     Running     0             10m     172.31.29.117   server1   <none>           <none>
kube-system       pod/rke2-coredns-rke2-coredns-9849d5ddb-9bjb9               1/1     Running     0             9m33s   10.42.116.129   server1   <none>           <none>
kube-system       pod/rke2-coredns-rke2-coredns-autoscaler-64b867c686-d96q9   1/1     Running     0             9m33s   10.42.116.131   server1   <none>           <none>
kube-system       pod/rke2-ingress-nginx-controller-crstg                     1/1     Running     0             6m20s   10.42.116.141   server1   <none>           <none>
kube-system       pod/rke2-metrics-server-544c8c66fc-f8j7h                    1/1     Running     0             6m58s   10.42.116.137   server1   <none>           <none>
kube-system       pod/rke2-snapshot-controller-59cc9cd8f4-kpmth               1/1     Running     1 (53s ago)   6m57s   10.42.116.138   server1   <none>           <none>
kube-system       pod/rke2-snapshot-validation-webhook-54c5989b65-x9584       1/1     Running     0             6m55s   10.42.116.139   server1   <none>           <none>
tigera-operator   pod/tigera-operator-59d6c9b46-g4w6v                         1/1     Running     1 (53s ago)   9m15s   172.31.29.117   server1   <none>           <none>
  • After upgarde
NAME      STATUS   ROLES                       AGE   VERSION
server1   Ready    control-plane,etcd,master   40m   v1.29.2+rke2r1

NAMESPACE         NAME                                                    READY   STATUS             RESTARTS        AGE
calico-system     calico-kube-controllers-6c47dbdcd8-jh9h9                1/1     Running            0               24m
calico-system     calico-node-b6cvj                                       1/1     Running            0               24m
calico-system     calico-typha-7f88b9b9c8-6fwlm                           1/1     Running            0               24m
kube-system       cloud-controller-manager-server1                        1/1     Running            3 (26m ago)     39m
kube-system       etcd-server1                                            1/1     Running            0               26m
kube-system       helm-install-rke2-calico-crd-kxn45                      0/1     Completed          0               26m
kube-system       helm-install-rke2-calico-r7vqq                          0/1     Completed          0               26m
kube-system       helm-install-rke2-coredns-lsfg5                         0/1     CrashLoopBackOff   9 (3m10s ago)   26m
kube-system       helm-install-rke2-ingress-nginx-4dn74                   0/1     Completed          0               26m
kube-system       helm-install-rke2-metrics-server-fn6kt                  0/1     Completed          0               26m
kube-system       helm-install-rke2-snapshot-controller-crd-c67zq         0/1     Completed          0               26m
kube-system       helm-install-rke2-snapshot-controller-dzh7g             0/1     Completed          0               26m
kube-system       helm-install-rke2-snapshot-validation-webhook-d7kpv     0/1     Completed          0               26m
kube-system       kube-apiserver-server1                                  1/1     Running            0               26m
kube-system       kube-controller-manager-server1                         1/1     Running            1 (26m ago)     26m
kube-system       kube-proxy-server1                                      1/1     Running            0               26m
kube-system       kube-scheduler-server1                                  1/1     Running            0               26m
kube-system       rke2-coredns-784786f455-k5mcj                           0/1     Pending            0               3m10s
kube-system       rke2-coredns-autoscaler-f559b6d84-dsq2c                 1/1     Running            0               3m10s
kube-system       rke2-coredns-rke2-coredns-9849d5ddb-9bjb9               1/1     Running            0               38m
kube-system       rke2-coredns-rke2-coredns-autoscaler-64b867c686-d96q9   1/1     Running            0               38m
kube-system       rke2-ingress-nginx-controller-crstg                     1/1     Running            0               35m
kube-system       rke2-metrics-server-544c8c66fc-f8j7h                    1/1     Running            0               35m
kube-system       rke2-snapshot-controller-59cc9cd8f4-kpmth               1/1     Running            3 (26m ago)     35m
kube-system       rke2-snapshot-validation-webhook-54c5989b65-x9584       1/1     Running            0               35m
tigera-operator   tigera-operator-69d9db6f79-v89d9                        1/1     Running            0               24m

Additional context / logs:

Pod describe
k describe -n kube-system pod/helm-install-rke2-coredns-lsfg5
Name:             helm-install-rke2-coredns-lsfg5
Namespace:        kube-system
Priority:         0
Service Account:  helm-rke2-coredns
Node:             server1/172.31.29.117
Start Time:       Wed, 21 Feb 2024 15:26:12 +0000
Labels:           batch.kubernetes.io/controller-uid=0d1d1fa2-d2f6-44bb-8b86-44a40b688185
                  batch.kubernetes.io/job-name=helm-install-rke2-coredns
                  controller-uid=0d1d1fa2-d2f6-44bb-8b86-44a40b688185
                  helmcharts.helm.cattle.io/chart=rke2-coredns
                  job-name=helm-install-rke2-coredns
Annotations:      helmcharts.helm.cattle.io/configHash: SHA256=864F9D90876143023C65BF4A9E53D976919560678260637168E04F38FFE29048
Status:           Running
SeccompProfile:   RuntimeDefault
IP:               172.31.29.117
IPs:
  IP:           172.31.29.117
Controlled By:  Job/helm-install-rke2-coredns
Containers:
  helm:
    Container ID:  containerd://c50217f2269319e544f40ac93259215fe61d25c69fd2793820dccc1f1c2bd3f4
    Image:         rancher/klipper-helm:v0.8.2-build20230815
    Image ID:      docker.io/rancher/klipper-helm@sha256:b0b0c4f73f2391697edb52adffe4fc490de1c8590606024515bb906b2813554a
    Port:          <none>
    Host Port:     <none>
    Args:
      install
      --set-string
      global.clusterCIDR=10.42.0.0/16
      --set-string
      global.clusterCIDRv4=10.42.0.0/16
      --set-string
      global.clusterDNS=10.43.0.10
      --set-string
      global.clusterDomain=cluster.local
      --set-string
      global.rke2DataDir=/var/lib/rancher/rke2
      --set-string
      global.serviceCIDR=10.43.0.0/16
    State:       Waiting
      Reason:    CrashLoopBackOff
    Last State:  Terminated
      Reason:    Error
      Message:   Uninstalling failed helm_v3 chart
Installing helm_v3 chart

      Exit Code:    1
      Started:      Wed, 21 Feb 2024 15:48:15 +0000
      Finished:     Wed, 21 Feb 2024 15:48:17 +0000
    Ready:          False
    Restart Count:  9
    Environment:
      NAME:                     rke2-coredns
      VERSION:
      REPO:
      HELM_DRIVER:              secret
      CHART_NAMESPACE:          kube-system
      CHART:
      HELM_VERSION:
      TARGET_NAMESPACE:         kube-system
      AUTH_PASS_CREDENTIALS:    false
      KUBERNETES_SERVICE_HOST:  127.0.0.1
      KUBERNETES_SERVICE_PORT:  6443
      BOOTSTRAP:                true
      NO_PROXY:                 .svc,.cluster.local,10.42.0.0/16,10.43.0.0/16
      FAILURE_POLICY:           reinstall
    Mounts:
      /chart from content (rw)
      /config from values (rw)
      /home/klipper-helm/.cache from klipper-cache (rw)
      /home/klipper-helm/.config from klipper-config (rw)
      /home/klipper-helm/.helm from klipper-helm (rw)
      /tmp from tmp (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qw2ct (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True
  Initialized                 True
  Ready                       False
  ContainersReady             False
  PodScheduled                True
Volumes:
  klipper-helm:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  klipper-cache:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  klipper-config:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  tmp:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  values:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  chart-values-rke2-coredns
    Optional:    false
  content:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      chart-content-rke2-coredns
    Optional:  false
  kube-api-access-qw2ct:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              kubernetes.io/os=linux
                             node-role.kubernetes.io/control-plane=true
Tolerations:                 CriticalAddonsOnly op=Exists
                             node-role.kubernetes.io/control-plane:NoSchedule op=Exists
                             node-role.kubernetes.io/etcd:NoExecute op=Exists
                             node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule
                             node.kubernetes.io/not-ready:NoSchedule
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  23m                   default-scheduler  Successfully assigned kube-system/helm-install-rke2-coredns-lsfg5 to server1
  Normal   Pulled     20m (x5 over 22m)     kubelet            Container image "rancher/klipper-helm:v0.8.2-build20230815" already present on machine
  Normal   Created    20m (x5 over 22m)     kubelet            Created container helm
  Normal   Started    20m (x5 over 22m)     kubelet            Started container helm
  Warning  BackOff    2m48s (x87 over 22m)  kubelet            Back-off restarting failed container helm in pod helm-install-rke2-coredns-lsfg5_kube-system(afc7eb3a-daf9-42ba-a830-c518144e0620)
Pod logs
k logs -n kube-system pod/helm-install-rke2-coredns-lsfg5
if [[ ${KUBERNETES_SERVICE_HOST} =~ .*:.* ]]; then
	echo "KUBERNETES_SERVICE_HOST is using IPv6"
	CHART="${CHART//%\{KUBERNETES_API\}%/[${KUBERNETES_SERVICE_HOST}]:${KUBERNETES_SERVICE_PORT}}"
else
	CHART="${CHART//%\{KUBERNETES_API\}%/${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT}}"
fi

set +v -x
+ [[ true != \t\r\u\e ]]
+ [[ '' == \1 ]]
+ [[ '' == \v\2 ]]
+ shopt -s nullglob
+ [[ -f /config/ca-file.pem ]]
+ [[ -f /tmp/ca-file.pem ]]
+ [[ -n '' ]]
+ helm_content_decode
+ set -e
+ ENC_CHART_PATH=/chart/rke2-coredns.tgz.base64
+ CHART_PATH=/tmp/rke2-coredns.tgz
+ [[ ! -f /chart/rke2-coredns.tgz.base64 ]]
+ base64 -d /chart/rke2-coredns.tgz.base64
+ CHART=/tmp/rke2-coredns.tgz
+ set +e
+ [[ install != \d\e\l\e\t\e ]]
+ helm_repo_init
+ grep -q -e 'https\?://'
+ [[ helm_v3 == \h\e\l\m\_\v\3 ]]
+ [[ /tmp/rke2-coredns.tgz == stable/* ]]
+ [[ -n '' ]]
+ helm_update install --set-string global.clusterCIDR=10.42.0.0/16 --set-string global.clusterCIDRv4=10.42.0.0/16 --set-string global.clusterDNS=10.43.0.10 --set-string global.clusterDomain=cluster.local --set-string global.rke2DataDir=/var/lib/rancher/rke2 --set-string global.serviceCIDR=10.43.0.0/16
+ [[ helm_v3 == \h\e\l\m\_\v\3 ]]
++ helm_v3 ls --all -f '^rke2-coredns$' --namespace kube-system --output json
++ tr '[:upper:]' '[:lower:]'
++ jq -r '"\(.[0].app_version),\(.[0].status)"'
+ LINE=1.11.1,failed
+ IFS=,
+ read -r INSTALLED_VERSION STATUS _
+ VALUES=
+ [[ install = \d\e\l\e\t\e ]]
+ [[ 1.11.1 =~ ^(|null)$ ]]
+ [[ failed =~ ^(pending-install|pending-upgrade|pending-rollback)$ ]]
+ [[ failed == \d\e\p\l\o\y\e\d ]]
+ [[ failed =~ ^(deleted|failed|null|unknown)$ ]]
+ [[ reinstall == \r\e\i\n\s\t\a\l\l ]]
+ [[ helm_v3 == \h\e\l\m\_\v\3 ]]
+ echo 'Uninstalling failed helm_v3 chart'
+ helm_v3 uninstall rke2-coredns --namespace kube-system --wait
release "rke2-coredns" uninstalled
+ echo Deleted
Deleted
+ echo 'Installing helm_v3 chart'
+ helm_v3 install --set-string global.clusterCIDR=10.42.0.0/16 --set-string global.clusterCIDRv4=10.42.0.0/16 --set-string global.clusterDNS=10.43.0.10 --set-string global.clusterDomain=cluster.local --set-string global.rke2DataDir=/var/lib/rancher/rke2 --set-string global.serviceCIDR=10.43.0.0/16 rke2-coredns /tmp/rke2-coredns.tgz
Error: INSTALLATION FAILED: 1 error occurred:
	* Service "rke2-coredns" is invalid: spec.clusterIPs: Invalid value: []string{"10.43.0.10"}: failed to allocate IP 10.43.0.10: provided IP is already allocated


+ exit
@mdrahman-suse
Copy link
Contributor Author

Validated with RC v1.29.2-rc2+rke2r1

  • After upgrade from v1.29.1+rke2r1
$ rke2 -v
rke2 version v1.29.2-rc2+rke2r1 (03437e4a6942c2db1cba728fbee5ac9f264a28fb)
go version go1.21.7 X:boringcrypto

$ k get nodes
NAME      STATUS   ROLES                       AGE   VERSION
server1   Ready    control-plane,etcd,master   23m   v1.29.2+rke2r1

$ k get pods -A
NAMESPACE     NAME                                                    READY   STATUS      RESTARTS        AGE
kube-system   cloud-controller-manager-server1                        1/1     Running     4 (9m29s ago)   22m
kube-system   etcd-server1                                            1/1     Running     0               10m
kube-system   helm-install-rke2-canal-h6jbj                           0/1     Completed   0               8m29s
kube-system   helm-install-rke2-coredns-7btgj                         0/1     Completed   0               8m29s
kube-system   helm-install-rke2-ingress-nginx-bk75g                   0/1     Completed   0               8m29s
kube-system   helm-install-rke2-metrics-server-g2f4k                  0/1     Completed   0               8m29s
kube-system   helm-install-rke2-snapshot-controller-crd-jxrxc         0/1     Completed   0               8m28s
kube-system   helm-install-rke2-snapshot-controller-kvq4l             0/1     Completed   0               8m29s
kube-system   helm-install-rke2-snapshot-validation-webhook-vctpv     0/1     Completed   0               8m28s
kube-system   kube-apiserver-server1                                  1/1     Running     0               10m
kube-system   kube-controller-manager-server1                         1/1     Running     0               10m
kube-system   kube-proxy-server1                                      1/1     Running     0               8m36s
kube-system   kube-scheduler-server1                                  1/1     Running     0               10m
kube-system   rke2-canal-n6pz8                                        2/2     Running     0               7m5s
kube-system   rke2-coredns-rke2-coredns-6fd7bb5597-z28gt              1/1     Running     0               7m10s
kube-system   rke2-coredns-rke2-coredns-autoscaler-55fb4bbbcf-l45vt   1/1     Running     0               7m10s
kube-system   rke2-ingress-nginx-controller-6tkxz                     1/1     Running     0               16m
kube-system   rke2-metrics-server-544c8c66fc-qvmxg                    1/1     Running     0               17m
kube-system   rke2-snapshot-controller-59cc9cd8f4-ppd9r               1/1     Running     3 (9m24s ago)   17m
kube-system   rke2-snapshot-validation-webhook-54c5989b65-jddcw       1/1     Running     0               17m

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants