Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VMWare CPI/CSI Fails on RKE2 v1.28.14+rke2r1 #7439

Closed
Daemonslayer2048 opened this issue Dec 18, 2024 · 1 comment
Closed

VMWare CPI/CSI Fails on RKE2 v1.28.14+rke2r1 #7439

Daemonslayer2048 opened this issue Dec 18, 2024 · 1 comment

Comments

@Daemonslayer2048
Copy link

Environmental Info:
RKE2 Version:

rke2 version v1.28.14+rke2r1 (05928c524ec436f7d854c68dea34f3e3bf4d5287)
go version go1.22.6 X:boringcrypto

Node(s) CPU architecture, OS, and Version:
CPU: x86_64
OS: Rocky Linux 9.4
Version: Linux server-01 5.14.0-427.13.1.el9_4.x86_64 #1 SMP PREEMPT_DYNAMIC Wed May 1 19:11:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:
A simple 1 server, and 1 agent test deployment.

Describe the bug:
Deploying the VMWare CPI/CSI (appears to not) install the CRDs needed to run.

Steps To Reproduce:

  1. Install RKE2 via ansible
  2. Set the cloud-provider-name in /etc/rancher/rke2/config.yaml
    Example server config:
audit-policy-file: /etc/rancher/rke2/audit-policy.yaml
cloud-provider-name: rancher-vsphere
kube-apiserver-arg:
- tls-min-version=VersionTLS12
- tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
- authorization-mode=RBAC,Node
- anonymous-auth=false
- audit-policy-file=/etc/rancher/rke2/audit-policy.yaml
- audit-log-mode=blocking-strict
- audit-log-maxage=30
- audit-log-path=/var/lib/rancher/rke2/server/logs/audit.log
kube-controller-manager-arg:
- bind-address=127.0.0.1
- use-service-account-credentials=true
- tls-min-version=VersionTLS12
- tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
kube-scheduler-arg:
- tls-min-version=VersionTLS12
- tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
kubelet-arg:
- kube-reserved=cpu=400m,memory=1Gi
- system-reserved=cpu=400m,memory=1Gi
- protect-kernel-defaults=true
- read-only-port=0
- authorization-mode=Webhook
- streaming-connection-idle-timeout=5m
- max-pods=400
node-name: server-01
pod-security-admission-config-file: /etc/rancher/rke2/pod-security-admission-config.yaml
profile: cis
secrets-encryption: true
selinux: true
server: https://10.7.2.118:9345
tls-san:
- 10.7.2.118
token: efrklerkveormvokermgrem
use-service-account-credentials: true
write-kubeconfig-mode: '0600'
  1. Configure the rancher-vsphere helm charts, by adding a vsphere.yaml file to /var/lib/rancher/rke2/server/manifests/vsphere.yaml
    Example config:
---
apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
  name: rancher-vsphere-cpi
  namespace: kube-system
spec:
  valuesContent: |-
    vCenter:
      credentialsSecret:
        generate: true
        name: vsphere-cpi-creds
      datacenters: Lab Datacenter
      host: vcenter-1.lab
      insecureFlag: true
      labels:
        generate: false
        region: k8s-region
        zone: k8s-zone
      password: asupersecurepassword
      port: 443
      username: ausername
---
apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
  name: rancher-vsphere-csi
  namespace: kube-system
spec:
  valuesContent: |-
    vCenter:
      clusterId: clusterb
      configSecret:
        configTemplate: >
          [Global]
          cluster-id = {{ required ".Values.vCenter.clusterId must be provided"
          (default .Values.vCenter.clusterId .Values.global.cattle.clusterId) |
          quote }}
          user = {{ .Values.vCenter.username | quote }}
          password = {{ .Values.vCenter.password | quote }}
          port = {{ .Values.vCenter.port | quote }}
          insecure-flag = {{ .Values.vCenter.insecureFlag | quote }}
          [VirtualCenter {{ .Values.vCenter.host | quote }}]
          datacenters = {{ .Values.vCenter.datacenters | quote }}
        generate: true
        name: vsphere-config-secret
      datacenters: Lab Datacenter
      host: vcenter-1.lab
      insecureFlag: "1"
      password: asupersecurepassword
      port: 443
      username: ausername
    csiController:
      tolerations:
        - key: node.cloudprovider.kubernetes.io/uninitialized
          value: "true"
          effect: NoSchedule
  1. Re/Start RKE2 Server
  2. Observe CSI failures

Expected behavior:
The VMware CPI/CSI both install correctly and make PV/Cs available to the cluster.

Actual behavior:
The CPI installation appears to go fine however the CSI node throws an error implying a CRD has not been installed.

Additional context / logs:.
A list of all pods:

[root@server-01 jhanafin]# kubectl get po -A
NAMESPACE     NAME                                                    READY   STATUS             RESTARTS        AGE
kube-system   etcd-server-01                                          1/1     Running            0               6m35s
kube-system   helm-install-rancher-vsphere-cpi-vnp8p                  0/1     Completed          0               6m53s
kube-system   helm-install-rancher-vsphere-csi-pscmq                  0/1     Completed          0               6m53s
kube-system   helm-install-rke2-canal-67gvr                           0/1     Completed          0               6m53s
kube-system   helm-install-rke2-coredns-wsptj                         0/1     Completed          0               6m53s
kube-system   helm-install-rke2-ingress-nginx-gl5g6                   0/1     Completed          0               6m50s
kube-system   helm-install-rke2-metrics-server-rh4l4                  0/1     Completed          1               6m49s
kube-system   helm-install-rke2-snapshot-controller-crd-f2bcb         0/1     Completed          1               6m49s
kube-system   helm-install-rke2-snapshot-controller-w8s29             0/1     Completed          1               6m49s
kube-system   helm-install-rke2-snapshot-validation-webhook-5f8dp     0/1     Completed          1               6m49s
kube-system   kube-apiserver-server-01                                1/1     Running            0               7m10s
kube-system   kube-controller-manager-server-01                       1/1     Running            0               7m14s
kube-system   kube-proxy-agent-01                                     1/1     Running            0               5m4s
kube-system   kube-proxy-server-01                                    1/1     Running            0               7m10s
kube-system   kube-scheduler-server-01                                1/1     Running            0               7m14s
kube-system   rancher-vsphere-cpi-cloud-controller-manager-cjv47      1/1     Running            0               6m44s
kube-system   rke2-canal-787m4                                        2/2     Running            0               6m41s
kube-system   rke2-canal-9kph2                                        2/2     Running            0               5m5s
kube-system   rke2-coredns-rke2-coredns-7875c9c6b7-cmkdf              1/1     Running            0               6m42s
kube-system   rke2-coredns-rke2-coredns-7875c9c6b7-gkb8n              1/1     Running            0               5m1s
kube-system   rke2-coredns-rke2-coredns-autoscaler-564964dcd5-msmbb   1/1     Running            0               6m42s
kube-system   rke2-ingress-nginx-controller-sjqsc                     1/1     Running            0               5m24s
kube-system   rke2-ingress-nginx-controller-vdrxw                     1/1     Running            0               4m49s
kube-system   rke2-metrics-server-6cd986844b-wb4tm                    1/1     Running            0               5m36s
kube-system   rke2-snapshot-controller-59cc9cd8f4-984kl               1/1     Running            0               5m37s
kube-system   rke2-snapshot-validation-webhook-54c5989b65-wrlm6       1/1     Running            0               5m55s
kube-system   vsphere-csi-controller-8578cc867c-ctwdn                 2/5     CrashLoopBackOff   19 (33s ago)    6m42s
kube-system   vsphere-csi-controller-8578cc867c-drbhl                 3/5     CrashLoopBackOff   20 (3s ago)     6m42s
kube-system   vsphere-csi-controller-8578cc867c-nck6q                 3/5     CrashLoopBackOff   19 (79s ago)    6m42s
kube-system   vsphere-csi-node-m8svk                                  2/3     CrashLoopBackOff   5 (86s ago)     4m49s
kube-system   vsphere-csi-node-v9fqh                                  2/3     CrashLoopBackOff   6 (2m37s ago)   6m17s

Logs from vsphere-csi-controller-8578cc867c-ctwdn:

[root@server-01 jhanafin]# kubectl logs vsphere-csi-controller-8578cc867c-ctwdn -n kube-system
Defaulted container "csi-attacher" out of: csi-attacher, vsphere-csi-controller, liveness-probe, vsphere-syncer, csi-provisioner
I1218 01:49:35.776326       1 main.go:97] Version: v4.5.1
W1218 01:49:45.780505       1 connection.go:234] Still connecting to unix:///csi/csi.sock
W1218 01:49:55.780974       1 connection.go:234] Still connecting to unix:///csi/csi.sock
W1218 01:50:05.780611       1 connection.go:234] Still connecting to unix:///csi/csi.sock
E1218 01:50:05.782949       1 main.go:136] context deadline exceeded

Logs from vsphere-csi-node-m8svk:

[root@server-01 jhanafin]# kubectl logs vsphere-csi-node-m8svk -n kube-system
Defaulted container "node-driver-registrar" out of: node-driver-registrar, vsphere-csi-node, liveness-probe
I1218 01:50:56.850288       1 main.go:135] Version: v2.10.1
I1218 01:50:56.850462       1 main.go:136] Running node-driver-registrar in mode=
I1218 01:50:56.850494       1 main.go:157] Attempting to open a gRPC connection with: "/csi/csi.sock"
I1218 01:50:56.850571       1 connection.go:215] Connecting to unix:///csi/csi.sock
I1218 01:50:56.852921       1 main.go:164] Calling CSI driver to discover driver name
I1218 01:50:56.852986       1 connection.go:244] GRPC call: /csi.v1.Identity/GetPluginInfo
I1218 01:50:56.852999       1 connection.go:245] GRPC request: {}
I1218 01:50:56.858682       1 connection.go:251] GRPC response: {"name":"csi.vsphere.vmware.com","vendor_version":"v3.3.0"}
I1218 01:50:56.858738       1 connection.go:252] GRPC error: <nil>
I1218 01:50:56.858768       1 main.go:173] CSI driver name: "csi.vsphere.vmware.com"
I1218 01:50:56.858968       1 node_register.go:55] Starting Registration Server at: /registration/csi.vsphere.vmware.com-reg.sock
I1218 01:50:56.859543       1 node_register.go:64] Registration Server started at: /registration/csi.vsphere.vmware.com-reg.sock
I1218 01:50:56.859740       1 node_register.go:88] Skipping HTTP server because endpoint is set to: ""
I1218 01:50:58.834058       1 main.go:90] Received GetInfo call: &InfoRequest{}
I1218 01:50:58.855380       1 main.go:101] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:false,Error:RegisterPlugin error -- plugin registration failed with err: rpc error: code = Internal desc = failed to get CsiNodeTopology for the node: "agent-01". Error: failed to get API group resources: unable to retrieve the complete list of server APIs: cns.vmware.com/v1alpha1: the server could not find the requested resource,}
E1218 01:50:58.855520       1 main.go:103] Registration process failed with error: RegisterPlugin error -- plugin registration failed with err: rpc error: code = Internal desc = failed to get CsiNodeTopology for the node: "agent-01". Error: failed to get API group resources: unable to retrieve the complete list of server APIs: cns.vmware.com/v1alpha1: the server could not find the requested resource, restarting registration container.
@brandond
Copy link
Member

brandond commented Dec 18, 2024

I'm going to close this, as 1.28 is technically end of life.

That said, the version of the vsphere charts you're using was validated in #6338 and worked fine.

The error you're seeing suggests that an aggregated API resources is not available because the service that provides it is not available. It does not appear to be related to missing CRDs.

I'm not sure exactly where that comes from, but you might take a look at this issue and check for other errors with your configuration:

There are many other containers in the vsphere pods, have you checked them for errors as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants