Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calico Helm Chart upgrade fails after upgrade from rke2 v1.28.8+rke2r1 to v1.28.12+rke2r1 / v1.29.6+rke2r1 #6633

Closed
shindebshekhar opened this issue Aug 26, 2024 · 5 comments

Comments

@shindebshekhar
Copy link

shindebshekhar commented Aug 26, 2024

Environmental Info:
RKE2 Version: v1.28.8+rke2r1

:~ # rke2 -v
rke2 version v1.28.8+rke2r1 (42cab2f)
go version go1.21.8 X:boringcrypto

Node(s) CPU architecture, OS, and Version:

Linux hostname 5.3.18-150300.59.161-default #1 SMP Thu May 9 06:59:05 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:

3 Master 3 Worker nodes

Describe the bug:

We are trying to upgrade rke2 from v1.28.8+rke2r1(fresh install) to v1.28.12+rke2r1 / v1.29.6+rke2r1

After upgrade rke2 service comes up but we see all the helm jobs fails for system component calico. Helm Jobs are retriggered in continuous loop(possibly trying to upgrade the above components)

For some reason instead of upgrading the calico chart, It tries to uninstall the tigera operator CRDs and calico CRDs. In this process it hangs as resources are still present. Please see below log output for calico CRD job.

kubectl get crds | grep -i calico --> No result

kubectl logs job/helm-install-rke2-calico-crd -n kube-system -f

if [[ ${KUBERNETES_SERVICE_HOST} =~ .*:.* ]]; then
        echo "KUBERNETES_SERVICE_HOST is using IPv6"
        CHART="${CHART//%\{KUBERNETES_API\}%/[${KUBERNETES_SERVICE_HOST}]:${KUBERNETES_SERVICE_PORT}}"
else
        CHART="${CHART//%\{KUBERNETES_API\}%/${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT}}"
fi

set +v -x
+ [[ true != \t\r\u\e ]]
+ [[ '' == \1 ]]
+ [[ '' == \v\2 ]]
+ shopt -s nullglob
+ [[ -f /config/ca-file.pem ]]
+ [[ -f /tmp/ca-file.pem ]]
+ [[ -n '' ]]
+ helm_content_decode
+ set -e
+ ENC_CHART_PATH=/chart/rke2-calico-crd.tgz.base64
+ CHART_PATH=/tmp/rke2-calico-crd.tgz
+ [[ ! -f /chart/rke2-calico-crd.tgz.base64 ]]
+ base64 -d /chart/rke2-calico-crd.tgz.base64
+ CHART=/tmp/rke2-calico-crd.tgz
+ set +e
+ [[ install != \d\e\l\e\t\e ]]
+ helm_repo_init
+ grep -q -e 'https\?://'
+ [[ helm_v3 == \h\e\l\m\_\v\3 ]]
+ [[ /tmp/rke2-calico-crd.tgz == stable/* ]]
+ [[ -n '' ]]
+ helm_update install --set-string global.clusterCIDR=192.168.128.0/17 --set-string global.clusterCIDRv4=192.168.128.0/17 --set-string global.clusterDNS=192.168.64.10 --set-string global.clusterDomain=cluster.local --set-string global.rke2DataDir=/var/lib/rancher/rke2 --set-string global.serviceCIDR=192.168.64.0/18
+ [[ helm_v3 == \h\e\l\m\_\v\3 ]]
++ helm_v3 ls --all -f '^rke2-calico-crd$' --namespace kube-system --output json
++ jq -r '"\(.[0].chart),\(.[0].status)"'
++ tr '[:upper:]' '[:lower:]'
**+ LINE=rke2-calico-crd-v3.27.002,uninstalling**
+ IFS=,
+ read -r INSTALLED_VERSION STATUS _
+ VALUES=
+ [[ install = \d\e\l\e\t\e ]]
+ [[ rke2-calico-crd-v3.27.002 =~ ^(|null)$ ]]
+ [[ uninstalling =~ ^(pending-install|pending-upgrade|pending-rollback)$ ]]
+ [[ uninstalling == \d\e\p\l\o\y\e\d ]]
+ [[ uninstalling =~ ^(deleted|failed|null|unknown)$ ]]
+ echo 'Installing helm_v3 chart'
+ helm_v3 install --set-string global.clusterCIDR=192.168.128.0/17 --set-string global.clusterCIDRv4=192.168.128.0/17 --set-string global.clusterDNS=192.168.64.10 --set-string global.clusterDomain=cluster.local --set-string global.rke2DataDir=/var/lib/rancher/rke2 --set-string global.serviceCIDR=192.168.64.0/18 rke2-calico-crd /tmp/rke2-calico-crd.tgz
Error: INSTALLATION FAILED: cannot re-use a name that is still in use
@brandond
Copy link
Member

It looks like for some reason the Helm job to upgrade the chart was interrupted while upgrading the chart. The helm controller responded by trying to uninstall and reinstall the chart, but the uninstall job was also interrupted - so now the chart is stuck in the "uninstalling" status.

You might try deleting the Helm secrets for the rke2-calico-crd release, and rke-calico as well if necessary. This should allow it to successfully reinstall the chart.

What process did you use to upgrade your cluster? We do not generally see issues with the Helm jobs being interrupted while upgrading, unless the upgrade is interrupted partway through, leaving nodes deploying conflicting component versions.

@rjchicago
Copy link

Was there any recovery from this? We ran into this issue yesterday and had to restore controller VM and etcd from snapshots.

The symptoms and logs match exactly what was posted above. We initially attempted to install the CRDs and recreate the required resources, but calico controller continued to crashloop.

Ultimately, the restore from snapshots worked, but we actually had to do that twice as after adding additional controllers ,the helm upgrade was re-triggered and we had to restart the process. We're now currently running with just the one controller - not an ideal state.

@wzrdtales
Copy link

probably this projectcalico/calico#9068, which was fixed upstream, but you will need to wait likely quite some time for this fix becoming available in rke and rancher

@brandond is there any possibility in rke2 to override the calico version being deployed?

@brandond
Copy link
Member

Calico 3.28.2 should go into next month's releases: rancher/rke2-charts#524

The issue is in the chart itself, so no you can't just bump the version of calico that the chart deploys. You'll need to wait for us to update the chart in RKE2.

Copy link
Contributor

github-actions bot commented Nov 7, 2024

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 45 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants
@brandond @rjchicago @wzrdtales @shindebshekhar @caroline-suse-rancher and others