Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods stuck in Pending state while using latest commit #4894

Closed
ShylajaDevadiga opened this issue Oct 16, 2023 · 5 comments
Closed

Pods stuck in Pending state while using latest commit #4894

ShylajaDevadiga opened this issue Oct 16, 2023 · 5 comments
Assignees
Labels
kind/bug Something isn't working status/release-blocker

Comments

@ShylajaDevadiga
Copy link
Contributor

Environmental Info:
RKE2 Version:
rke2 version v1.28.2+dev.45c21222
rke2 version v1.27.6+dev.7c3bb478

Node(s) CPU architecture, OS, and Version:
Ubuntu 22.04, SLES 15 SP4

Cluster Configuration:
Single node

Describe the bug:
Pods are in pending state

$ kubectl describe pod -n kube-system rke2-coredns-rke2-coredns-767db6b7f9-xt95z
Name:                 rke2-coredns-rke2-coredns-767db6b7f9-xt95z
Namespace:            kube-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Service Account:      coredns
Node:                 <none>
Labels:               app.kubernetes.io/instance=rke2-coredns
                      app.kubernetes.io/name=rke2-coredns
                      k8s-app=kube-dns
                      pod-template-hash=767db6b7f9
Annotations:          checksum/config: bcf80a9706efc4765afe05995cfe0a3c8b7fe8028589bc17da9524fa0c97f51a
Status:               Pending
IP:                   
IPs:                  <none>
Controlled By:        ReplicaSet/rke2-coredns-rke2-coredns-767db6b7f9
Containers:
  coredns:
    Image:       rancher/hardened-coredns:v1.10.1-build20230607
    Ports:       53/UDP, 53/TCP, 9153/TCP
    Host Ports:  0/UDP, 0/TCP, 0/TCP
    Args:
      -conf
      /etc/coredns/Corefile
    Limits:
      cpu:     100m
      memory:  128Mi
    Requests:
      cpu:        100m
      memory:     128Mi
    Liveness:     http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5
    Readiness:    http-get http://:8181/ready delay=30s timeout=5s period=10s #success=1 #failure=5
    Environment:  <none>
    Mounts:
      /etc/coredns from config-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9hdj2 (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      rke2-coredns-rke2-coredns
    Optional:  false
  kube-api-access-9hdj2:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 CriticalAddonsOnly op=Exists
                             node-role.kubernetes.io/control-plane:NoSchedule op=Exists
                             node-role.kubernetes.io/etcd:NoExecute op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  67s   default-scheduler  0/1 nodes are available: 1 node(s) had untolerated taint {node.cloudprovider.kubernetes.io/uninitialized: true}. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..

Steps To Reproduce:
Install rke2 using commit 45c2122

Expected behavior:
All pods should be scheduled and in running state

Actual behavior:
Pods are waiting to be scheduled

Additional context / logs:
rke2.log

@brandond
Copy link
Member

brandond commented Oct 17, 2023

The rke2 cloud-controller is still running code from 1.26.4, which predates recent dual-stack changes. The following message seen in the cloud-controller-manager logs:

I1016 18:03:37.256417       1 node_controller.go:415] Initializing node rke2-server-2 with cloud provider
E1016 18:03:37.256479       1 node_controller.go:229] error syncing 'rke2-server-2': failed to get node modifiers from cloud provider: failed to parse node IP "172.17.0.4,fd7c:53a5:aef5::242:ac11:4" for node "rke2-server-2", requeuing

After updating K3s to latest in rancher/image-build-rke2-cloud-provider#30 we are seeing a different error:

I1017 00:02:46.187875       1 node_controller.go:431] Initializing node rke2-server-1 with cloud provider
E1017 00:02:46.188287       1 node_controller.go:240] error syncing 'rke2-server-1': failed to get node modifiers from cloud provider: provided node ip for node "rke2-server-1" is not valid: failed to parse node IP "172.17.0.3,fd7c:53a5:aef5::242:ac11:3": dual-stack not supported in this configuration, requeuing
I1017 00:03:25.286910       1 node_controller.go:431] Initializing node rke2-server-2 with cloud provider
E1017 00:03:25.287354       1 node_controller.go:240] error syncing 'rke2-server-2': failed to get node modifiers from cloud provider: provided node ip for node "rke2-server-2" is not valid: failed to parse node IP "172.17.0.4,fd7c:53a5:aef5::242:ac11:4": dual-stack not supported in this configuration, requeuing

This appears to be related to recent changes in dual-stack node-ip behavior on the K3s side - possibly k3s-io/k3s#8581

@brandond
Copy link
Member

brandond commented Oct 17, 2023

It looks like K3s is handling this by enabling the CloudDualStackNodeIPs=true feature-gate if the local node has dual-stack node IPs. I am honestly not sure how this is even working in K3s, as it appears to only be enabling that feature-gate on the kubelet, and not the cloud-controller-manager. I think there may be some leakage of feature-gate enablement on K3s due to all the components running in the same process that is allowing this to work there, but not here.

I also suspect this is a breaking change, as setting the node-ip to a dual-stack list will require the feature-gate to be enabled on ALL cloud-providers used with RKE2, or new nodes will fail to initialize and will remain tainted.

This is related to:

@brandond
Copy link
Member

brandond commented Oct 17, 2023

Confirmed that setting the feature-gate on the cloud-controller-manager fixes this (4cb0de3), but I think this should be done on the K3s side, not RKE2.

I am also concerned that this will be a breaking change for other cloud providers, unless for some reason the node-ip behavior is gated on use of the built-in stub cloud provider?

@brandond
Copy link
Member

Just confirming that the feature-gate should be enabled for both kubelet AND cloud-controller-manager, starting with 1.27:

@ShylajaDevadiga
Copy link
Contributor Author

Validated on rke2 version v1.28.3-rc2+rke2r1 (0d0d0e4)

Environment Details

Infrastructure
Cloud EC2 instance

Node(s) CPU architecture, OS, and Version:

> cat /etc/os-release
NAME="SLES"
VERSION="15-SP4"
VERSION_ID="15.4"
PRETTY_NAME="SUSE Linux Enterprise Server 15 SP4"

Cluster Configuration:
3 server 1 agent

Config.yaml:

write-kubeconfig-mode: "0644"
tls-san:
  - fake.fqdn.value
node-name: ip-172-31-12-203.us-east-2.compute.internal
profile: cis-1.23
node-external-ip: IP

Steps to reproduce the issue and validate

  1. Copy config.yaml
  2. Install rke2
ec2-user@ip-172-31-12-203:~> rke2 -v
rke2 version v1.28.3-rc2+rke2r1 (0d0d0e4879fdf95254461e3a49224f75d7b2dc3d)
go version go1.20.10 X:boringcrypto
ec2-user@ip-172-31-12-203:~> kubectl get nodes
NAME                                          STATUS   ROLES                       AGE     VERSION
ip-172-31-10-71.us-east-2.compute.internal    Ready    control-plane,etcd,master   5m2s    v1.28.3+rke2r1
ip-172-31-12-203.us-east-2.compute.internal   Ready    control-plane,etcd,master   7m49s   v1.28.3+rke2r1
ip-172-31-14-43.us-east-2.compute.internal    Ready    <none>                      4m5s    v1.28.3+rke2r1
ip-172-31-8-250.us-east-2.compute.internal    Ready    control-plane,etcd,master   4m15s   v1.28.3+rke2r1
ec2-user@ip-172-31-12-203:~> kubectl get pods -A
NAMESPACE     NAME                                                                   READY   STATUS      RESTARTS   AGE
dnsutils      dnsutils                                                               1/1     Running     0          6s
kube-system   cloud-controller-manager-ip-172-31-10-71.us-east-2.compute.internal    1/1     Running     0          4m42s
kube-system   cloud-controller-manager-ip-172-31-12-203.us-east-2.compute.internal   1/1     Running     0          7m47s
kube-system   cloud-controller-manager-ip-172-31-8-250.us-east-2.compute.internal    1/1     Running     0          4m14s
kube-system   etcd-ip-172-31-10-71.us-east-2.compute.internal                        1/1     Running     0          4m17s
kube-system   etcd-ip-172-31-12-203.us-east-2.compute.internal                       1/1     Running     0          7m21s
kube-system   etcd-ip-172-31-8-250.us-east-2.compute.internal                        1/1     Running     0          3m43s
kube-system   helm-install-rke2-canal-lnh4t                                          0/1     Completed   0          7m28s
kube-system   helm-install-rke2-coredns-sknh6                                        0/1     Completed   0          7m28s
kube-system   helm-install-rke2-ingress-nginx-ml8hc                                  0/1     Completed   0          7m28s
kube-system   helm-install-rke2-metrics-server-xn8g6                                 0/1     Completed   0          7m27s
kube-system   helm-install-rke2-snapshot-controller-crd-6vbrw                        0/1     Completed   0          7m27s
kube-system   helm-install-rke2-snapshot-controller-zfc9k                            0/1     Completed   1          7m26s
kube-system   helm-install-rke2-snapshot-validation-webhook-vkgj7                    0/1     Completed   0          7m25s
kube-system   kube-apiserver-ip-172-31-10-71.us-east-2.compute.internal              1/1     Running     0          4m14s
kube-system   kube-apiserver-ip-172-31-12-203.us-east-2.compute.internal             1/1     Running     0          7m49s
kube-system   kube-apiserver-ip-172-31-8-250.us-east-2.compute.internal              1/1     Running     0          4m11s
kube-system   kube-controller-manager-ip-172-31-10-71.us-east-2.compute.internal     1/1     Running     0          4m42s
kube-system   kube-controller-manager-ip-172-31-12-203.us-east-2.compute.internal    1/1     Running     0          7m49s
kube-system   kube-controller-manager-ip-172-31-8-250.us-east-2.compute.internal     1/1     Running     0          4m14s
kube-system   kube-proxy-ip-172-31-10-71.us-east-2.compute.internal                  1/1     Running     0          4m48s
kube-system   kube-proxy-ip-172-31-12-203.us-east-2.compute.internal                 1/1     Running     0          7m25s
kube-system   kube-proxy-ip-172-31-14-43.us-east-2.compute.internal                  1/1     Running     0          4m7s
kube-system   kube-proxy-ip-172-31-8-250.us-east-2.compute.internal                  1/1     Running     0          4m12s
kube-system   kube-scheduler-ip-172-31-10-71.us-east-2.compute.internal              1/1     Running     0          4m42s
kube-system   kube-scheduler-ip-172-31-12-203.us-east-2.compute.internal             1/1     Running     0          7m49s
kube-system   kube-scheduler-ip-172-31-8-250.us-east-2.compute.internal              1/1     Running     0          4m14s
kube-system   rke2-canal-97tgl                                                       2/2     Running     0          4m17s
kube-system   rke2-canal-gklxq                                                       2/2     Running     0          4m7s
kube-system   rke2-canal-jhmr8                                                       2/2     Running     0          7m18s
kube-system   rke2-canal-s8k2g                                                       2/2     Running     0          5m4s
kube-system   rke2-coredns-rke2-coredns-6b795db654-6m9dr                             1/1     Running     0          4m59s
kube-system   rke2-coredns-rke2-coredns-6b795db654-jzjm9                             1/1     Running     0          7m18s
kube-system   rke2-coredns-rke2-coredns-autoscaler-945fbd459-jdsjs                   1/1     Running     0          7m18s
kube-system   rke2-ingress-nginx-controller-56gjl                                    1/1     Running     0          6m30s
kube-system   rke2-ingress-nginx-controller-ncr4x                                    1/1     Running     0          3m22s
kube-system   rke2-ingress-nginx-controller-v75j2                                    1/1     Running     0          3m17s
kube-system   rke2-ingress-nginx-controller-vpnhm                                    1/1     Running     0          4m19s
kube-system   rke2-metrics-server-544c8c66fc-7lwzz                                   1/1     Running     0          6m43s
kube-system   rke2-snapshot-controller-59cc9cd8f4-kzmwr                              1/1     Running     0          6m36s
kube-system   rke2-snapshot-validation-webhook-54c5989b65-x6ddr                      1/1     Running     0          6m36s
ec2-user@ip-172-31-12-203:~> 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working status/release-blocker
Projects
None yet
Development

No branches or pull requests

3 participants