-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Falco restarts periodically #2476
Comments
I thought now while posting this to slack I should add additional info here. This is reproducible on all 3 of my clusters (test, dev and prod). All 3 are in AKS running Ubuntu 18.04 and Kubernetes 1.24. I don't know exactly when this started happening; we've been doing a major rearchitecture of the environment and kubernetes cluster upgrade for the last several months. The last time I looked closely at falco, we were on v0.31.1 in September 2022, and it was not having these issues. It was running out of memory (due to improperly tuned memory limits), however this was also on an old cluster architecture (AKS K8s 1.22 with an entirely different application namespace configuration). We completed the rearchitecture and cluster upgrade recently and I got some time to start working on fixing those issues when I discovered that now Falco is no longer running out of memory but instead being exited. However, there have been a number of changes between when we noticed the out of memory restarts and when we discovered what has been happening now where falco is exiting with exit code 1. We've upgraded Falco to 0.33.0 in October, 0.34.0 in February, and 0.34.1 in March, upgraded the cluster from K8s 1.22 to 1.24, and disabled automounting the service account in favor of the Bound Service Account Token Volume Projection feature of K8s 1.22 and 1.24. The prod cluster has falco deployed from flat kubernetes manifests which were themselves created by running I would like to provide the config details privately to facilitate debugging the issue. Please let me know if this is possible somehow? |
I forgot to mention that the dev and test clusters currently don't run any custom rules, but the prod cluster does (and its rule set is based on an old rule set). So one element that can potentially be ruled out is the rules. Since the test and dev clusters only run the rules from the helm chart, it is safe to assume that neither the old rules on the prod cluster nor the new rules on the dev and test clusters are likely to be causing this. |
cc @alacuku (since it relates to the charts potentially). Is there a chance you could collect the complete stderr/stdout error of one Falco pod at termination time? That would give more complete information. |
Yes, I'll try to collect it today |
Closing in favor of #2485 |
Describe the bug
Falco is restarting periodically.
We've been running Falco on AKS since sometime in mid 2021, and we keep it up to date with the latest releases. For a long time, it ran without issues. I'm not certain exactly when this started happening but it has been maybe 6 months since I looked closely at the falco deployment for issues.
I have our falco setup to make use of the
terminationMessagePolicy
field on the kubernetes daemonset manifest, so the outputs below come fromkubectl describe
on the pod. TheMessage:
field contains the last several lines of stdout from falco before it exited.How to reproduce it
Install falco from the helm chart
Expected behaviour
Falco should run without issues
Screenshots
Environment
Additional context
When this occurs, in most cases, we can see that falco exited with exit code 1, which appears similar to what happens when a service fails its health probe and is restarted by the kubelet:
However tonight I found one instance where falco exited with code 139 (sigsegv):
The text was updated successfully, but these errors were encountered: