Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SUC upgrade is failing on SLEMicro 5.5 with v1.28 #5230

Closed
mdrahman-suse opened this issue Jan 9, 2024 · 1 comment
Closed

SUC upgrade is failing on SLEMicro 5.5 with v1.28 #5230

mdrahman-suse opened this issue Jan 9, 2024 · 1 comment
Labels
kind/bug Something isn't working kind/os-validation

Comments

@mdrahman-suse
Copy link
Contributor

mdrahman-suse commented Jan 9, 2024

Environmental Info:
RKE2 Version:

Installed v1.28.4+rke2r1, Upgrade to v1.28.5+rke2r1

Node(s) CPU architecture, OS, and Version:

Linux ip-172-31-29-170 5.14.21-150500.55.28-default #1 SMP PREEMPT_DYNAMIC Fri Sep 22 10:04:29 UTC 2023 (c11336f) x86_64 x86_64 x86_64 GNU/Linux
~> cat /etc/os-release
NAME="SLE Micro"
VERSION="5.5"
VERSION_ID="5.5"
PRETTY_NAME="SUSE Linux Enterprise Micro 5.5"
ID="sle-micro"
ID_LIKE="suse"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:suse:sle-micro:5.5"
VARIANT_ID="sle-micro"
VARIANT_VERSION="20231011"

Cluster Configuration:

3 servers, 1 agent Selinux enabled on all the nodes

$ sestatus
SELinux status:                 enabled
SELinuxfs mount:                /sys/fs/selinux
SELinux root directory:         /etc/selinux
Loaded policy name:             targeted
Current mode:                   enforcing
Mode from config file:          enforcing
Policy MLS status:              enabled
Policy deny_unknown status:     allowed
Memory protection checking:     requested (insecure)
Max kernel policy version:      33
  • /etc/rancher/rke2/config.yaml
write-kubeconfig-mode: 644
token: <token>
node-name: server1
node-external-ip: <public-ip>
node-ip: <private-ip>
selinux: true
Plan: rke2-upgrade.yaml
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
  name: rke2-server-cp
  namespace: system-upgrade
  labels:
    rke2-upgrade: server
spec:
  concurrency: 1
  version: v1.28.5+rke2r1
  nodeSelector:
    matchExpressions:
      - {key: node-role.kubernetes.io/control-plane, operator: In, values: ["true"]}
  tolerations:
    - operator: Exists
  serviceAccountName: system-upgrade
  cordon: true
  upgrade:
    image: rancher/rke2-upgrade
---
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
  name: rke2-server-etcd
  namespace: system-upgrade
  labels:
    rke2-upgrade: server
spec:
  concurrency: 1
  version: v1.28.5+rke2r1
  nodeSelector:
    matchExpressions:
        - key: "rke.cattle.io/etcd-role"
          operator: Exists
        - key: node-role.kubernetes.io/etcd
          operator: In
          values: [ "true" ]
        - key: node-role.kubernetes.io/control-plane
          operator: NotIn
          values: [ "true" ]
  tolerations:
    - operator: Exists
  serviceAccountName: system-upgrade
  prepare:
    image: rancher/rke2-upgrade
    args: ["prepare", "rke2-server-cp"]
  cordon: true
  drain:
    force: true
  upgrade:
    image: rancher/rke2-upgrade
---
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
  name: rke2-agent
  namespace: system-upgrade
  labels:
    rke2-upgrade: agent
spec:
  concurrency: 2
  version: v1.28.5+rke2r1
  nodeSelector:
    matchExpressions:
      - {key: node-role.kubernetes.io/etcd, operator: NotIn, values: ["true"]}
      - {key: node-role.kubernetes.io/control-plane, operator: NotIn, values: ["true"]}
  serviceAccountName: system-upgrade
  prepare:
    image: rancher/rke2-upgrade
    args: ["prepare", "rke2-server-etcd"]
  drain:
    force: true
  upgrade:
    image: rancher/rke2-upgrade

Describe the bug:

When trying to upgrade rke2 using SUC on SLE Micro 5.5, it was observed that that upgrade does not go through successfully. Replicated the same with SLE Micro 5.4

Steps To Reproduce:

  • Installed RKE2:
    • servers: curl -sfL https://get.rke2.io | sudo INSTALL_RKE2_VERSION=v1.28.4+rke2r1 INSTALL_RKE2_METHOD=rpm sh -
    • agent: curl -sfL https://get.rke2.io | sudo INSTALL_RKE2_VERSION=v1.28.4+rke2r1 INSTALL_RKE2_METHOD=rpm INSTALL_RKE2_TYPE=agent sh -
  • Ensure the cluster is up and running and all the nodes joined successfully
  • Deploy some workloads
  • Deploy SUC using: kubectl apply -f https://raw.githubusercontent.com/rancher/system-upgrade-controller/master/manifests/system-upgrade-controller.yaml
  • Apply plan rke2-upgrade.yaml

Expected behavior:

The cluster is upgraded successfully with the defined version and is up and running.

Actual behavior:

  • The initial server is cordoned with Ready,SchedulingDisabled status
  • The apply-server-* pods remains in Error state
Cluster details
NAME           STATUS                     ROLES                       AGE     VERSION          INTERNAL-IP     EXTERNAL-IP      OS-IMAGE                          KERNEL-VERSION                 CONTAINER-RUNTIME
node/agent1    Ready                      <none>                      142m    v1.28.4+rke2r1   172.31.21.234   18.116.59.215    SUSE Linux Enterprise Micro 5.5   5.14.21-150500.55.28-default   containerd://1.7.7-k3s1
node/server1   Ready                      control-plane,etcd,master   3h22m   v1.28.4+rke2r1   172.31.29.170   18.224.139.223   SUSE Linux Enterprise Micro 5.5   5.14.21-150500.55.28-default   containerd://1.7.7-k3s1
node/server2   Ready                      control-plane,etcd,master   174m    v1.28.4+rke2r1   172.31.25.25    18.217.155.250   SUSE Linux Enterprise Micro 5.5   5.14.21-150500.55.28-default   containerd://1.7.7-k3s1
node/server3   Ready,SchedulingDisabled   control-plane,etcd,master   167m    v1.28.4+rke2r1   172.31.29.168   13.59.174.109    SUSE Linux Enterprise Micro 5.5   5.14.21-150500.55.28-default   containerd://1.7.7-k3s1

NAMESPACE        NAME                                                                  READY   STATUS      RESTARTS       AGE     IP              NODE      NOMINATED NODE   READINESS GATES
auto-clusterip   pod/test-clusterip-8496c7779d-bglb5                                   1/1     Running     0              30m     10.42.3.3       agent1    <none>           <none>
auto-clusterip   pod/test-clusterip-8496c7779d-xjzk5                                   1/1     Running     0              30m     10.42.2.3       server3   <none>           <none>
auto-daemonset   pod/test-daemonset-2tm7p                                              1/1     Running     0              30m     10.42.2.4       server3   <none>           <none>
auto-daemonset   pod/test-daemonset-gdxfx                                              1/1     Running     0              30m     10.42.0.15      server1   <none>           <none>
auto-daemonset   pod/test-daemonset-nk9c9                                              1/1     Running     0              30m     10.42.1.4       server2   <none>           <none>
auto-daemonset   pod/test-daemonset-pls2g                                              1/1     Running     0              30m     10.42.3.4       agent1    <none>           <none>
auto-dns         pod/dnsutils                                                          1/1     Running     0              30m     10.42.3.5       agent1    <none>           <none>
auto-ingress     pod/test-ingress-c5jqt                                                1/1     Running     0              30m     10.42.3.7       agent1    <none>           <none>
auto-ingress     pod/test-ingress-gd45z                                                1/1     Running     0              30m     10.42.1.5       server2   <none>           <none>
auto-nodeport    pod/test-nodeport-644767cc74-2h2qr                                    1/1     Running     0              30m     10.42.2.5       server3   <none>           <none>
auto-nodeport    pod/test-nodeport-644767cc74-fv8kx                                    1/1     Running     0              30m     10.42.3.6       agent1    <none>           <none>
kube-system      pod/cloud-controller-manager-server1                                  1/1     Running     1 (173m ago)   3h21m   172.31.29.170   server1   <none>           <none>
kube-system      pod/cloud-controller-manager-server2                                  1/1     Running     0              174m    172.31.25.25    server2   <none>           <none>
kube-system      pod/cloud-controller-manager-server3                                  1/1     Running     0              167m    172.31.29.168   server3   <none>           <none>
kube-system      pod/etcd-server1                                                      1/1     Running     0              3h21m   172.31.29.170   server1   <none>           <none>
kube-system      pod/etcd-server2                                                      1/1     Running     0              174m    172.31.25.25    server2   <none>           <none>
kube-system      pod/etcd-server3                                                      1/1     Running     0              166m    172.31.29.168   server3   <none>           <none>
kube-system      pod/helm-install-rke2-canal-km5lf                                     0/1     Completed   0              3h21m   172.31.29.170   server1   <none>           <none>
kube-system      pod/helm-install-rke2-coredns-7jc9x                                   0/1     Completed   0              3h21m   172.31.29.170   server1   <none>           <none>
kube-system      pod/helm-install-rke2-ingress-nginx-m84tf                             0/1     Completed   0              3h21m   10.42.0.6       server1   <none>           <none>
kube-system      pod/helm-install-rke2-metrics-server-85f7r                            0/1     Completed   0              3h21m   10.42.0.5       server1   <none>           <none>
kube-system      pod/helm-install-rke2-snapshot-controller-8j6f8                       0/1     Completed   1              3h21m   10.42.0.8       server1   <none>           <none>
kube-system      pod/helm-install-rke2-snapshot-controller-crd-m6nhz                   0/1     Completed   0              3h21m   10.42.0.2       server1   <none>           <none>
kube-system      pod/helm-install-rke2-snapshot-validation-webhook-gslsq               0/1     Completed   0              3h21m   10.42.0.3       server1   <none>           <none>
kube-system      pod/kube-apiserver-server1                                            1/1     Running     0              3h21m   172.31.29.170   server1   <none>           <none>
kube-system      pod/kube-apiserver-server2                                            1/1     Running     0              174m    172.31.25.25    server2   <none>           <none>
kube-system      pod/kube-apiserver-server3                                            1/1     Running     0              167m    172.31.29.168   server3   <none>           <none>
kube-system      pod/kube-controller-manager-server1                                   1/1     Running     1 (173m ago)   3h21m   172.31.29.170   server1   <none>           <none>
kube-system      pod/kube-controller-manager-server2                                   1/1     Running     1 (171m ago)   174m    172.31.25.25    server2   <none>           <none>
kube-system      pod/kube-controller-manager-server3                                   1/1     Running     0              167m    172.31.29.168   server3   <none>           <none>
kube-system      pod/kube-proxy-agent1                                                 1/1     Running     0              142m    172.31.21.234   agent1    <none>           <none>
kube-system      pod/kube-proxy-server1                                                1/1     Running     0              3h21m   172.31.29.170   server1   <none>           <none>
kube-system      pod/kube-proxy-server2                                                1/1     Running     0              174m    172.31.25.25    server2   <none>           <none>
kube-system      pod/kube-proxy-server3                                                1/1     Running     0              167m    172.31.29.168   server3   <none>           <none>
kube-system      pod/kube-scheduler-server1                                            1/1     Running     1 (173m ago)   3h21m   172.31.29.170   server1   <none>           <none>
kube-system      pod/kube-scheduler-server2                                            1/1     Running     0              174m    172.31.25.25    server2   <none>           <none>
kube-system      pod/kube-scheduler-server3                                            1/1     Running     0              167m    172.31.29.168   server3   <none>           <none>
kube-system      pod/rke2-canal-cgltx                                                  2/2     Running     0              174m    172.31.25.25    server2   <none>           <none>
kube-system      pod/rke2-canal-gcnld                                                  2/2     Running     0              3h21m   172.31.29.170   server1   <none>           <none>
kube-system      pod/rke2-canal-sjxnj                                                  2/2     Running     0              167m    172.31.29.168   server3   <none>           <none>
kube-system      pod/rke2-canal-tpp2g                                                  2/2     Running     0              142m    172.31.21.234   agent1    <none>           <none>
kube-system      pod/rke2-coredns-rke2-coredns-6b795db654-kvlg8                        1/1     Running     0              173m    10.42.1.2       server2   <none>           <none>
kube-system      pod/rke2-coredns-rke2-coredns-6b795db654-qssl6                        1/1     Running     0              3h21m   10.42.0.4       server1   <none>           <none>
kube-system      pod/rke2-coredns-rke2-coredns-autoscaler-945fbd459-zj62v              1/1     Running     0              3h21m   10.42.0.7       server1   <none>           <none>
kube-system      pod/rke2-ingress-nginx-controller-8lfrb                               1/1     Running     0              3h19m   10.42.0.13      server1   <none>           <none>
kube-system      pod/rke2-ingress-nginx-controller-lwjdv                               1/1     Running     0              172m    10.42.1.3       server2   <none>           <none>
kube-system      pod/rke2-ingress-nginx-controller-nwgdf                               1/1     Running     0              139m    10.42.3.2       agent1    <none>           <none>
kube-system      pod/rke2-ingress-nginx-controller-sz7ns                               1/1     Running     0              166m    10.42.2.2       server3   <none>           <none>
kube-system      pod/rke2-metrics-server-544c8c66fc-5gx2h                              1/1     Running     0              3h19m   10.42.0.9       server1   <none>           <none>
kube-system      pod/rke2-snapshot-controller-59cc9cd8f4-j9nmv                         1/1     Running     0              3h19m   10.42.0.12      server1   <none>           <none>
kube-system      pod/rke2-snapshot-validation-webhook-54c5989b65-dml2x                 1/1     Running     0              3h19m   10.42.0.10      server1   <none>           <none>
system-upgrade   pod/apply-rke2-server-cp-on-server3-with-15330b6f4704566ac81c-6l9jc   0/1     Error       0              12m     172.31.29.168   server3   <none>           <none>
system-upgrade   pod/apply-rke2-server-cp-on-server3-with-15330b6f4704566ac81c-9f95g   0/1     Error       0              17m     172.31.29.168   server3   <none>           <none>
system-upgrade   pod/apply-rke2-server-cp-on-server3-with-15330b6f4704566ac81c-d75sm   0/1     Error       0              18m     172.31.29.168   server3   <none>           <none>
system-upgrade   pod/apply-rke2-server-cp-on-server3-with-15330b6f4704566ac81c-rqhv4   0/1     Error       0              15m     172.31.29.168   server3   <none>           <none>
system-upgrade   pod/apply-rke2-server-cp-on-server3-with-15330b6f4704566ac81c-s8pnv   0/1     Error       0              7m30s   172.31.29.168   server3   <none>           <none>
system-upgrade   pod/apply-rke2-server-cp-on-server3-with-15330b6f4704566ac81c-vhrsm   0/1     Error       0              17m     172.31.29.168   server3   <none>           <none>
system-upgrade   pod/apply-rke2-server-cp-on-server3-with-15330b6f4704566ac81c-zh5z2   0/1     Error       0              18m     172.31.29.168   server3   <none>           <none>
system-upgrade   pod/system-upgrade-controller-5f646b9445-qdbgk                        1/1     Running     0              25m     10.42.1.6       server2   <none>           <none>

Additional context / logs:

  • No errors were observed in journalctl logs
  • Upon checking one of the apply-server pod logs the below was observed
$ kubectl logs -n system-upgrade pod/apply-rke2-server-cp-on-server3-with-15330b6f4704566ac81c-zh5z2
Defaulted container "upgrade" out of: upgrade, cordon (init)
+ upgrade
+ get_rke2_process_info
+ ps -ef
+ grep -E -v '(init|grep)'
+ grep -E '(/usr|/usr/local|/opt/rke2)/bin/rke2 .*(server|agent)'
+ awk '{print $1}'
+ RKE2_PID=1484
+ '[' -z 1484 ]
+ info 'rke2 binary is running with pid 1484'
+ echo '[INFO] ' 'rke2 binary is running with pid 1484'
[INFO]  rke2 binary is running with pid 1484
+ cat /host/proc/1484/cmdline
+ awk '{print $1}'
+ head -n 1
+ RKE2_BIN_PATH=/usr/bin/rke2
+ '[' -z /usr/bin/rke2 ]
+ return
+ replace_binary
+ NEW_BINARY=/opt/rke2
+ FULL_BIN_PATH=/host/usr/bin/rke2
+ '[' '!' -f /opt/rke2 ]
[INFO]  Comparing old and new binaries
+ info 'Comparing old and new binaries'
+ echo '[INFO] ' 'Comparing old and new binaries'
+ sha256sum /opt/rke2 /host/usr/bin/rke2
+ cut '-d ' -f1
+ wc -l
+ uniq
+ BIN_COUNT=2
+ '[' 2 '==' 1 ]
+ getfilecon+  /host/usr/bin/rke2
awk '{print $2}'
[INFO]  Deploying new rke2 binary to /usr/bin/rke2
+ RKE2_CONTEXT=system_u:object_r:container_runtime_exec_t:s0
+ info 'Deploying new rke2 binary to /usr/bin/rke2'
+ echo '[INFO] ' 'Deploying new rke2 binary to /usr/bin/rke2'
+ cp /opt/rke2 /host/usr/bin/rke2
cp: can't create '/host/usr/bin/rke2': File exists
@mdrahman-suse mdrahman-suse added kind/bug Something isn't working kind/os-validation labels Jan 9, 2024
@mdrahman-suse
Copy link
Contributor Author

RKE2 SUC upgrade with RPM is not supported/recommended. It should be documented here https://docs.rke2.io/upgrade/automated_upgrade/ or somewhere for reference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working kind/os-validation
Projects
None yet
Development

No branches or pull requests

1 participant