Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition when rke2-windows-calico removes and creates an HNS network #5335

Closed
manuelbuil opened this issue Jan 30, 2024 · 1 comment
Closed
Assignees
Labels
kind/bug Something isn't working priority/high

Comments

@manuelbuil
Copy link
Contributor

manuelbuil commented Jan 30, 2024

Environmental Info:
RKE2 Version:

Node(s) CPU architecture, OS, and Version:

Cluster Configuration:

Describe the bug:

An HNS overlay network is always bound to a physical interface. When that network is created or deleted, the IPs of that interface "disappear" for some seconds. It's hard to say for how many seconds but according to my experience, it is something in between 6 and 20 seconds.

When the windows node is starting, we generate all the required config files for Calico CNI plugin and create the "External" Calico network, which is the one with the vxlan overlay. However, before creating that network, we delete all existing networks:

https://github.com/rancher/rke2/blob/master/pkg/windows/calico.go#L284

The deleteAllNetworks() function: https://github.com/rancher/rke2/blob/master/pkg/windows/utils.go#L99, applies first a "GET" query and then removes all the returned networks. Immediately afterwards, the code creates the network. If the interface was not specified, it searches for the interface with the nodeIP: https://github.com/rancher/rke2/blob/master/pkg/windows/calico.go#L301. It is possible that when the search for the interface happens, the interface is still unavailable due to the deletion of the HNS overlay network.

As a consequence, calico fails to start and we get the error:

no interface has the ip: a.b.c.d

Internal reference: SURE-7509

Steps To Reproduce:

  • Installed RKE2:

Expected behavior:

When we create the network we are 100% that the interfaces are ready

Actual behavior:

We are not checking that interfaces are ready and that can cause problems when creating the network

Additional context / logs:

@ShylajaDevadiga
Copy link
Contributor

Validated using rke2 version v1.29.2-rc1+rke2r1

Environment Details

Infrastructure
Cloud EC2 instance

Node(s) CPU architecture, OS, and Version:
Ubuntu 22.04
Windows Server 2022

Steps to Reproduce and Validate following the steps in the PR

1 - Deploy rke2 server with calico
2 - Deploy rke2-agent on windows
3 - Once everything is up, Stop-Service rke2 and C:\usr\local\bin\rke2.exe agent service --delete
4 - Verify that there is at least one HNS Network: get-hnsnetwork
5 - Start the rke2-agent on windows again with debug: true (remember to remove the node first or it will complain about password already there)

You should at least see the messages:

 Deleting network: XXXXXX before starting calico"
And

Calico is waiting for the interface with ip: XXXXXX to come back

Created a cluster with 1 server, 1 linux agent, 1 windows agent

Reproduction results on rke2 version v1.29.1+rke2r1,

ubuntu@ip-172-31-6-199:~$ rke2 -v
rke2 version v1.29.1+rke2r1 (e6a0d1b4d779b0a0c73ecb2ee8ae2703d0025e6f)

ubuntu@ip-172-31-6-199:~$ kubectl get nodes
NAME                                         STATUS   ROLES                       AGE     VERSION
ip-172-31-14-81.us-east-2.compute.internal   Ready    <none>                      3h33m   v1.29.1+rke2r1
ip-172-31-6-199.us-east-2.compute.internal   Ready    control-plane,etcd,master   3h35m   v1.29.1+rke2r1
ip-ac1f2610                                  Ready    <none>                      3h29m   v1.29.1
ubuntu@ip-172-31-6-199:~$ kubectl get nodes
NAME                                         STATUS     ROLES                       AGE     VERSION
ip-172-31-14-81.us-east-2.compute.internal   Ready      <none>                      3h33m   v1.29.1+rke2r1
ip-172-31-6-199.us-east-2.compute.internal   Ready      control-plane,etcd,master   3h35m   v1.29.1+rke2r1
ip-ac1f2610                                  NotReady   <none>                      3h29m   v1.29.1
ubuntu@ip-172-31-6-199:~$ kubectl delete nodes ip-ac1f2610
node "ip-ac1f2610" deleted

ubuntu@ip-172-31-6-199:~$ kubectl get nodes
NAME                                         STATUS   ROLES                       AGE     VERSION
ip-172-31-14-81.us-east-2.compute.internal   Ready    <none>                      3h39m   v1.29.1+rke2r1
ip-172-31-6-199.us-east-2.compute.internal   Ready    control-plane,etcd,master   3h41m   v1.29.1+rke2r1

Unable to join the node back to the cluster
Logs

time="2024-02-17T08:32:45Z" level=info msg="Generating HNS networks, please wait"
time="2024-02-17T08:32:45Z" level=debug msg="[GET]=>[/networks/] Request : "
...
...
time="2024-02-17T08:32:45Z" level=debug msg="hcsshim::HNSNetwork::Delete id=5A56E56F-129E-4E61-A994-3F80FB72E4C6"
time="2024-02-17T08:32:45Z" level=debug msg="[DELETE]=>[/networks/5A56E56F-129E-4E61-A994-3F80FB72E4C6] Request : "
time="2024-02-17T08:32:45Z" level=debug msg="hcsshim::HNSNetwork::Delete id=61A75503-4844-4105-AE15-52A7FAA52B0C"
time="2024-02-17T08:32:45Z" level=debug msg="[DELETE]=>[/networks/61A75503-4844-4105-AE15-52A7FAA52B0C] Request : "
time="2024-02-17T08:32:46Z" level=debug msg="evaluating if the interface: Ethernet with addresses [fe80::b8c0:914b:6b99:9542/64], contains ip: 172.31.5.96"
time="2024-02-17T08:32:46Z" level=debug msg="evaluating if the interface: Loopback Pseudo-Interface 1 with addresses [::1/128 127.0.0.1/8], contains ip: 172.31.5.96"
time="2024-02-17T08:32:46Z" level=debug msg="evaluating if the interface: vEthernet (nat) with addresses [fe80::a952:3d05:c187:af97/64 192.168.80.1/20], contains ip: 172.31.5.96"
time="2024-02-17T08:32:46Z" level=fatal msg="no interface has the ip: 172.31.5.96"

Validation results on v1.29.2-rc1+rke2r1

$ rke2 -v
rke2 version v1.29.2-rc1+rke2r1 (2bb7020162174863547a0b4773b74acf6fdab71c)
go version go1.21.7 X:boringcrypto

$ kubectl get nodes
NAME                                        STATUS   ROLES                       AGE     VERSION
ip-172-31-0-14.us-east-2.compute.internal   Ready    <none>                      6h38m   v1.29.2+rke2r1
ip-172-31-7-58.us-east-2.compute.internal   Ready    control-plane,etcd,master   6h41m   v1.29.2+rke2r1
ip-ac1f2610                                 Ready    <none>                      6h35m   v1.29.2

Deleted the windows node from the cluster following the verification steps after stopping the service

ubuntu@ip-172-31-7-58:~$ kubectl get nodes 
NAME                                        STATUS     ROLES                       AGE     VERSION
ip-172-31-0-14.us-east-2.compute.internal   Ready      <none>                      7h39m   v1.29.2+rke2r1
ip-172-31-7-58.us-east-2.compute.internal   Ready      control-plane,etcd,master   7h42m   v1.29.2+rke2r1
ip-ac1f2610                                 NotReady   <none>                      7h36m   v1.29.2
ubuntu@ip-172-31-7-58:~$ kubectl delete node ip-ac1f2610
node "ip-ac1f2610" deleted

Joined the windows node back to the cluster

$ kubectl get nodes 
NAME                                        STATUS   ROLES                       AGE     VERSION
ip-172-31-0-14.us-east-2.compute.internal   Ready    <none>                      7h42m   v1.29.2+rke2r1
ip-172-31-7-58.us-east-2.compute.internal   Ready    control-plane,etcd,master   7h45m   v1.29.2+rke2r1
ip-ac1f2610                                 Ready    <none>                      103s    v1.29.2
ubuntu@ip-172-31-7-58:~$ 

Logs while joining the node to the cluster

time="2024-02-17T04:03:06Z" level=debug msg="Deleting network: Calico before starting calico"
time="2024-02-17T04:03:06Z" level=debug msg="hcsshim::HNSNetwork::Delete id=712F537D-49A9-4E34-BF3D-9D04477FC915"
time="2024-02-17T04:03:06Z" level=debug msg="[DELETE]=>[/networks/712F537D-49A9-4E34-BF3D-9D04477FC915] Request : "
time="2024-02-17T04:03:06Z" level=debug msg="Deleting network: External before starting calico"
time="2024-02-17T04:03:06Z" level=debug msg="hcsshim::HNSNetwork::Delete id=FEF14789-815C-4D3C-AB17-0B6A451245A1"
time="2024-02-17T04:03:06Z" level=debug msg="[DELETE]=>[/networks/FEF14789-815C-4D3C-AB17-0B6A451245A1] Request : "
time="2024-02-17T04:03:09Z" level=debug msg="Calico is waiting for the interface with ip: 172.31.3.120 to come back"
time="2024-02-17T04:03:09Z" level=debug msg="evaluating if the interface: Ethernet with addresses [2600:1f16:1d38:1c00:717c:8729:6160:1005/128 fe80::a932:1f02:1b24:cc8d/64 172.31.3.120/20], contains ip: 172.31.3.120"
time="2024-02-17T04:03:09Z" level=debug msg="Calico is waiting for the interface with ip: 172.31.3.120 to come back"
time="2024-02-17T04:03:09Z" level=debug msg="evaluating if the interface: Ethernet with addresses [2600:1f16:1d38:1c00:717c:8729:6160:1005/128 fe80::a932:1f02:1b24:cc8d/64 172.31.3.120/20], contains ip: 172.31.3.120"
time="2024-02-17T04:03:09Z" level=debug msg="evaluating if the interface: Ethernet with addresses [2600:1f16:1d38:1c00:717c:8729:6160:1005/128 fe80::a932:1f02:1b24:cc8d/64 172.31.3.120/20], contains ip: 172.31.3.120"
time="2024-02-17T04:03:09Z" level=info msg="Creating VXLAN network using the vxlanAdapter: Ethernet"
time="2024-02-17T04:03:09Z" level=debug msg="hcsshim::HNSNetwork::Create id="
...
...
time="2024-02-17T04:03:22Z" level=info msg="Node ip-ac1f2610 registered. Calico can start"
time="2024-02-17T04:03:22Z" level=info msg="Calico Envs: [KUBE_NETWORK=Calico.* KUBECONFIG=c:\\var\\lib\\rancher\\rke2\\agent\\calico.kubeconfig NODENAME=ip-ac1f2610 CALICO_K8S_NODE_REF=ip-ac1f2610 IP=172.31.3.120 USE_POD_CIDR=false CALICO_NODENAME_FILE=c:\\var\\lib\\rancher\\rke2\\agent\\calico_node_name CALICO_NETWORKING_BACKEND=vxlan CALICO_DATASTORE_TYPE=kubernetes IP_AUTODETECTION_METHOD=first-found VXLAN_VNI=4096]"
time="2024-02-17T04:03:22Z" level=debug msg="[GET]=>[/endpoints/] Request : "
time="2024-02-17T04:03:22Z" level=warning msg="can't find Calico_ep HNS endpoint, retrying" error="Endpoint Calico_ep not found"
time="2024-02-17T04:03:26Z" level=debug msg="Wrote ping"
time="2024-02-17T04:03:27Z" level=debug msg="[GET]=>[/endpoints/] Request : "
time="2024-02-17T04:03:27Z" level=warning msg="can't find Calico_ep HNS endpoint, retrying" error="Endpoint Calico_ep not found"
time="2024-02-17T04:03:28Z" level=error msg="Calico exited: exit status 1. Retrying"
time="2024-02-17T04:03:31Z" level=debug msg="Wrote ping"
time="2024-02-17T04:03:32Z" level=info msg="Node ip-ac1f2610 registered. Calico can start"
time="2024-02-17T04:03:32Z" level=info msg="Calico Envs: [KUBE_NETWORK=Calico.* KUBECONFIG=c:\\var\\lib\\rancher\\rke2\\agent\\calico.kubeconfig NODENAME=ip-ac1f2610 CALICO_K8S_NODE_REF=ip-ac1f2610 IP=172.31.3.120 USE_POD_CIDR=false CALICO_NODENAME_FILE=c:\\var\\lib\\rancher\\rke2\\agent\\calico_node_name CALICO_NETWORKING_BACKEND=vxlan CALICO_DATASTORE_TYPE=kubernetes IP_AUTODETECTION_METHOD=first-found VXLAN_VNI=4096]"
time="2024-02-17T04:03:32Z" level=debug msg="[GET]=>[/endpoints/] Request : "
time="2024-02-17T04:03:32Z" level=debug msg="Network Response : [{\"ID\":\"66e66236-d5ba-47ae-8e02-fc2eea252124\",\"Name\":\"Calico_ep\",\"Version\":55834574851,\"AdditionalParams\":{},\"Resources\":{\"AdditionalParams\":{},\"AllocationOrder\":1,\"Allocators\":[{\"AdditionalParams\":{},\"AllocationOrder\":0,\"CA\":\"10.42.96.130\",\"CAV6\":\"\",\"Flags\":0,\"Health\":{\"LastErrorCode\":0,\"LastUpdateTime\":133526162124092237},\"ID\":\"151E82A0-D0F0-4AFD-886A-4CF8805FB8C3\",\"IsLocal\":false,\"IsPolicy\":true,\"PA\":\"172.31.3.120\",\"PAV6\":\"\",\"State\":3,\"Tag\":\"VNET Policy\"}],\"CompartmentOperationTime\":0,\"Flags\":0,\"Health\":{\"LastErrorCode\":0,\"LastUpdateTime\":133526162124092237},\"ID\":\"4B64DBDE-08CF-410F-85AD-31EED25D7036\",\"PortOperationTime\":0,\"State\":1,\"SwitchOperationTime\":0,\"VfpOperationTime\":0,\"parentId\":\"AA5F04B2-C3AD-4090-96DB-2C42C91D0388\"},\"State\":1,\"VirtualNetwork\":\"98e8d347-7a29-4e1c-8296-457140fad1b3\",\"VirtualNetworkName\":\"Calico\",\"Policies\":[{\"PA\":\"172.31.3.120\",\"Type\":\"PA\"}],\"IsRemoteEndpoint\":true,\"MacAddress\":\"<REDACTED>\",\"IPAddress\":\"10.42.96.130\",\"EncapOverhead\":50,\"SharedContainers\":[]}]"
time="2024-02-17T04:03:32Z" level=info msg="Reserved VIP for kube-proxy: 10.42.96.130"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working priority/high
Projects
None yet
Development

No branches or pull requests

3 participants