-
Notifications
You must be signed in to change notification settings - Fork 278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race condition when rke2-windows-calico removes and creates an HNS network #5335
Comments
Validated using rke2 version v1.29.2-rc1+rke2r1Environment DetailsInfrastructure Node(s) CPU architecture, OS, and Version: Steps to Reproduce and Validate following the steps in the PR
Created a cluster with 1 server, 1 linux agent, 1 windows agent Reproduction results on rke2 version v1.29.1+rke2r1,
Unable to join the node back to the cluster
Validation results on v1.29.2-rc1+rke2r1
Deleted the windows node from the cluster following the verification steps after stopping the service
Joined the windows node back to the cluster
Logs while joining the node to the cluster
|
Environmental Info:
RKE2 Version:
Node(s) CPU architecture, OS, and Version:
Cluster Configuration:
Describe the bug:
An HNS overlay network is always bound to a physical interface. When that network is created or deleted, the IPs of that interface "disappear" for some seconds. It's hard to say for how many seconds but according to my experience, it is something in between 6 and 20 seconds.
When the windows node is starting, we generate all the required config files for Calico CNI plugin and create the "External" Calico network, which is the one with the vxlan overlay. However, before creating that network, we delete all existing networks:
https://github.com/rancher/rke2/blob/master/pkg/windows/calico.go#L284
The
deleteAllNetworks()
function: https://github.com/rancher/rke2/blob/master/pkg/windows/utils.go#L99, applies first a "GET" query and then removes all the returned networks. Immediately afterwards, the code creates the network. If the interface was not specified, it searches for the interface with the nodeIP: https://github.com/rancher/rke2/blob/master/pkg/windows/calico.go#L301. It is possible that when the search for the interface happens, the interface is still unavailable due to the deletion of the HNS overlay network.As a consequence, calico fails to start and we get the error:
Internal reference: SURE-7509
Steps To Reproduce:
Expected behavior:
When we create the network we are 100% that the interfaces are ready
Actual behavior:
We are not checking that interfaces are ready and that can cause problems when creating the network
Additional context / logs:
The text was updated successfully, but these errors were encountered: