Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote Pod getting deleted when Virtual Kubelet Pod restarts #2748

Open
Sharathmk99 opened this issue Sep 30, 2024 · 2 comments
Open

Remote Pod getting deleted when Virtual Kubelet Pod restarts #2748

Sharathmk99 opened this issue Sep 30, 2024 · 2 comments
Labels
kind/bug Something isn't working

Comments

@Sharathmk99
Copy link
Contributor

What happened:

Virtual Kubelet restarts for some exception(will open new issue for this) and during the startup of the Virtual Kubelet we see some pods under some namespaces are getting deleted in remote cluster. Looks like some race condition.
Please note: Not all pods under affected namespace is deleted.

For example below are the logs of virtual kubelet and grep for pod name(pod-123) under namespace(namespace-123)

I0930 08:18:19.923187       1 reflector.go:317] Pod fallback reflection not yet completely initialized (item: "namespace-123/pod-123")
I0930 08:18:19.924729       1 reflector.go:317] ServiceAccount fallback reflection not yet completely initialized (item: "namespace-123/pod-123")
I0930 08:18:23.308783       1 pod.go:341] Pod "namespace-123/pod-123" successfully marked as Failed (OffloadingAborted)
I0930 08:18:23.700397       1 reflector.go:327] ServiceAccount reflection not yet completely initialized for local namespace "namespace-123" (item: "pod-123")
I0930 08:18:24.013970       1 reflector.go:327] ServiceAccount reflection not yet completely initialized for local namespace "namespace-123" (item: "pod-123")
I0930 08:18:25.502650       1 reflector.go:327] Pod reflection not yet completely initialized for local namespace "namespace-123" (item: "pod-123")
I0930 08:18:42.109607       1 secret.go:102] Skipping reflection of remote Secret "namespace-123/pod-123-token" as containing service account tokens
I0930 08:18:44.904709       1 podns.go:195] Deleting remote shadowpod "namespace-123/pod-123", since local pod "namespace-123/pod-123" has been previously rejected
I0930 08:18:44.913469       1 namespaced.go:97] Remote ShadowPod "namespace-123/pod-123" successfully deleted
I0930 08:18:44.949221       1 podns.go:199] Skipping reflection of local pod "namespace-123/pod-123" as previously rejected
I0930 08:18:47.825376       1 podns.go:199] Skipping reflection of local pod "namespace-123/pod-123" as previously rejected
I0930 08:18:48.713175       1 podns.go:199] Skipping reflection of local pod "namespace-123/pod-123" as previously rejected
I0930 08:18:48.727150       1 podns.go:199] Skipping reflection of local pod "namespace-123/pod-123" as previously rejected
I0930 08:18:48.738589       1 podns.go:199] Skipping reflection of local pod "namespace-123/pod-123" as previously rejected

In reflector.go for some reason it was not able to find namespace namespace-123 and it started printing Failed to retrieve
https://github.com/liqotech/liqo/blob/v0.10.1/pkg/virtualKubelet/reflection/generic/reflector.go#L307

Because of that pod.go in https://github.com/liqotech/liqo/blob/v0.10.1/pkg/virtualKubelet/reflection/workload/pod.go#L334 marked the local pod as Failed

As local pod is marked as Failed podns.go in https://github.com/liqotech/liqo/blob/v0.10.1/pkg/virtualKubelet/reflection/workload/podns.go#L191 deleted the ShadowPod which in turn deleted the Pod in Remote cluster.

Above flow happens only when virtual kubelet pod is restarts. We confirmed by comparing pod restart time vs pod delete time.

What you expected to happen:

No remote pod managed by liqo should get deleted.

How to reproduce it (as minimally and precisely as possible):

It's difficult to reproduce.

Anything else we need to know?:

Note: We have ~200 namespace offloaded to remote cluster and we have ~1500 pods reflected to remote cluster. I'm not sure if scale of namespace and pod what we have is creating the problem.

Environment:

  • Liqo version: v0.10.1
  • Liqoctl version: v0.10.1
  • Kubernetes version (use kubectl version): 1.27
  • Cloud provider or hardware configuration: Kubeadm
  • Node image:
  • Network plugin and version:
  • Install tools:
  • Others:
@Sharathmk99 Sharathmk99 added the kind/bug Something isn't working label Sep 30, 2024
@aleoli
Copy link
Member

aleoli commented Oct 7, 2024

Hi @Sharathmk99!

Thanks for reporting it! We will try to reproduce it; if you have some way or scenario for easy reproducibility, please share it with us

@Sharathmk99
Copy link
Contributor Author

Thank you @aleoli. I'm not able to reproduce in smaller cluster.

In our production cluster, it's happening very frequently. If required we can quickly connect over call and debug on cluster directly.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants