Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wrangler.cattle.io/cisnetworkpolicy-node finalizer being left behind on newer RKE2 versions when deleting RKE2 nodes #5855

Closed
Oats87 opened this issue Apr 26, 2024 · 6 comments
Assignees

Comments

@Oats87
Copy link
Contributor

Oats87 commented Apr 26, 2024

Environmental Info:
RKE2 Version: v1.26.15+rke2r1

Node(s) CPU architecture, OS, and Version:
Not Applicable

Cluster Configuration:
1 Server, 2 Agents

Describe the bug:
When using RKE2 with cni: none and profile: cis-1.23 as options (and bringing your own CNI), after upgrade past v1.26.14+rke2r1, it is no longer possible to foreground delete nodes from the cluster.

Steps To Reproduce:

  • Install RKE2 v1.26.8+rke2r1

On the server node:

curl https://get.rke2.io | INSTALL_RKE2_VERSION=v1.26.8+rke2r1 sh - 
sudo cp -f /usr/local/share/rke2/rke2-cis-sysctl.conf /etc/sysctl.d/60-rke2-cis.conf
sudo systemctl restart systemd-sysctl
sudo useradd -r -c "etcd user" -s /sbin/nologin -M etcd -U
mkdir -p /etc/rancher/rke2
cat << EOF > /etc/rancher/rke2/config.yaml
cni: none
profile: cis-1.23
EOF
systemctl start rke2-server

export KUBECONFIG=/etc/rancher/rke2/rke2.yaml; export PATH=$PATH:/var/lib/rancher/rke2/bin
kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.26.1/manifests/calico.yaml

On the agent nodes:

curl https://get.rke2.io | INSTALL_RKE2_VERSION=v1.26.8+rke2r1 sh - 
sudo cp -f /usr/local/share/rke2/rke2-cis-sysctl.conf /etc/sysctl.d/60-rke2-cis.conf
sudo systemctl restart systemd-sysctl
sudo useradd -r -c "etcd user" -s /sbin/nologin -M etcd -U
mkdir -p /etc/rancher/rke2
cat << EOF > /etc/rancher/rke2/config.yaml
server: https://<server>:9345
token: token
profile: cis-1.23
EOF
systemctl start rke2-agent

Observe it is possible to see the nodes go Ready, and you can delete a node at this point i.e. kubectl delete node <my-node> works.

Next, upgrade to v1.26.15+rke2r1 i.e.

curl https://get.rke2.io | INSTALL_RKE2_VERSION=v1.26.15+rke2r1 sh - 
systemctl restart rke2-server

and

curl https://get.rke2.io | INSTALL_RKE2_VERSION=v1.26.15+rke2r1 sh - 
systemctl restart rke2-agent

after the cluster is back and all Ready from a node/kubelet perspective, attempt to delete a node and watch that it never deletes due to an orphaned finalizer i.e. wrangler.cattle.io/cisnetworkpolicy-node

Expected behavior:
My node deletes

Actual behavior:
Node hangs in deletion
Additional context / logs:
Looks like this regression was added with this PR: #5461

@brandond
Copy link
Member

after upgrade past v1.26.14+rke2r1, it is no longer possible to foreground delete nodes from the cluster.
Looks like this regression was added with this PR: #5461

Right, because previously the controller was running when it shouldn't and added the finalizer, now it no longer runs but the finalizer is still there. This is a fun issue with conditionally enabled controllers that add finalizers; they're hard to clean up after.

I guess we should run a quick startup check to remove node finalizers in an else case here:

rke2/pkg/rke2/rke2.go

Lines 118 to 121 in d30ec2a

cnis := clx.StringSlice("cni")
if cisMode && (len(cnis) == 0 || slice.ContainsString(cnis, "canal")) {
leaderControllers = append(leaderControllers, cisnetworkpolicy.Controller)
}

@Oats87
Copy link
Contributor Author

Oats87 commented Apr 26, 2024

On the Rancher side we traditionally either noop an OnRemove handler (but still run the informers/"controller") or run a logic on startup as you've said that cleans the finalizers up.

The issue here is it is not exactly clear to a user why their node deletion is hanging, and this is furthermore causing issues with provisioning as the cluster-api core controllers are unable to remove a node.

Regardless, it's a regression and not something I think should be handled on the provisioning side, hence why I filed this issue.

@brandond
Copy link
Member

brandond commented Apr 26, 2024

PR to remove the finalizer opened, should land for May cycle.

Unfortunately 1.26 is a couple months EOL so the fix will only be for v1.27+.

@caroline-suse-rancher
Copy link
Contributor

We're moving this out to June due to 1.30 delays and a tight code freeze window. Please let us know if that conflicts with any plans you had, @Oats87

@riuvshyn
Copy link

riuvshyn commented May 20, 2024

not sure if that is related, but I've noticed that after upgrade from v1.27.6+rke2r1 to v1.28.9+rke2r1 rotated workers not are not being removed and left stuck in NotReady,SchedulingDisabled state.

node events shows that CCM is continuously trying to remove the node:

Events:
  Type    Reason        Age                    From                             Message
  ----    ------        ----                   ----                             -------
  Normal  DeletingNode  45s (x30834 over 45h)  cloud-node-lifecycle-controller  Deleting node ip-172-23-105-147.eu-central-1.compute.internal because it does not exist in the cloud provider

I have tried to remove node manually with kubectl delete node ... but that is getting stuck.
Apparently it is getting blocked by :

  finalizers:
  - wrangler.cattle.io/cisnetworkpolicy-node

When I remove this finalizer node is getting removed automatically.

cni: cilium
profile: cis-1.23

this looks similar to #1895

@fmoral2
Copy link
Contributor

fmoral2 commented Jun 12, 2024

Validated on Version:

-$ rke2 version v1.30.1+dev.3aaa16c9 (3aaa16c9b17da45e9f3475ba5011ed90a49a2e42)
 

Environment Details

Infrastructure
Cloud EC2 instance

Node(s) CPU architecture, OS, and Version:
ubuntu
AMD

Cluster Configuration:
-1 node server
-2 node agents

Steps to validate the fix

  1. Install rke2 with:
profile: cis-1.23
cni: none
  1. Install a cni like: kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.26.1/manifests/calico.yaml
  2. Delete one agent node
  3. Upgrade
  4. Delete another agent node
  5. Validate that is deleted and not hanging
  6. Validate nodes and pods

Reproduction Issue:

 
 
$ kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.26.1/manifests/calico.yaml

~$ k delete node ip-0.us-east-2.compute.internal 
node "ip-0.us-east-2.compute.internal" deleted

$ k get nodes
NAME                                          STATUS   ROLES                       AGE   VERSION
ip-0.us-east-2.compute.internal    Ready    <none>                      14m   v1.26.15+rke2r1
ip-172-31-1-169.us-east-2.compute.internal    Ready    <none>                      15m   v1.26.15+rke2r1
ip-172-31-12-190.us-east-2.compute.internal   Ready    control-plane,etcd,master   18m   v1.26.15+rke2r1


Still there with: 

finalizers:
  - wrangler.cattle.io/cisnetworkpolicy-node

Validation Results:

install -  v1.27.6+rke2r1

 $ kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.26.1/manifests/calico.yaml


upgrade to    v1.30.1+rke2r1 


~$ k delete node ip-0.us-east-2.compute.internal 
node "ip-0.us-east-2.compute.internal" deleted

~$ k delete node ip-0.us-east-2.compute.internal 
node ip- .us-east-2.compute.internal deleted

$ k get nodes
NAME                                          STATUS   ROLES                       AGE   VERSION
ip-1 .us-east-2.compute.internal   Ready    control-plane,etcd,master   18m    v1.30.1+rke2r1




 

@fmoral2 fmoral2 closed this as completed Jun 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants