Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Falco restarts periodically #2476

Closed
tspearconquest opened this issue Apr 6, 2023 · 5 comments
Closed

Falco restarts periodically #2476

tspearconquest opened this issue Apr 6, 2023 · 5 comments
Labels

Comments

@tspearconquest
Copy link
Contributor

tspearconquest commented Apr 6, 2023

Describe the bug
Falco is restarting periodically.

We've been running Falco on AKS since sometime in mid 2021, and we keep it up to date with the latest releases. For a long time, it ran without issues. I'm not certain exactly when this started happening but it has been maybe 6 months since I looked closely at the falco deployment for issues.

I have our falco setup to make use of the terminationMessagePolicy field on the kubernetes daemonset manifest, so the outputs below come from kubectl describe on the pod. The Message: field contains the last several lines of stdout from falco before it exited.

How to reproduce it
Install falco from the helm chart

Expected behaviour
Falco should run without issues

Screenshots

Environment

  • Falco version: 0.34.1
  • System info:
  • Cloud provider or hardware configuration: AKS Kubernetes 1.24
  • OS: Ubuntu 18.04
  • Kernel: 5.4.x
  • Installation method: Kubernetes

Additional context

When this occurs, in most cases, we can see that falco exited with exit code 1, which appears similar to what happens when a service fails its health probe and is restarted by the kubelet:

    State:       Running
      Started:   Wed, 05 Apr 2023 21:40:52 -0500
    Last State:  Terminated
      Reason:    Error
      Message:   _fields": {"container.id":"d9ea88071e64","container.image.repository":"registry.gitlab.com/my-repo/falco/falco","container.image.tag":"v0.34.1-36","evt.time.iso8601":1680748851073837812,"fd.name":"10.200.44.249:44562->10.200.128.1:443","k8s.ns.name":"falco","k8s.pod.name":"falco-xwzvb","proc.cmdline":"falco /usr/bin/falco --cri /run/containerd/containerd.sock -K /var/run/secrets/kubernetes.io/serviceaccount/token -k https://10.200.128.1 --k8s-node redacted -pk","proc.pid":41403}}
{"hostname":"redacted","output":"2023-04-06T02:40:51.101170239+0000: Notice Unexpected connection to K8s API Server from container (command=falco /usr/bin/falco --cri /run/containerd/containerd.sock -K /var/run/secrets/kubernetes.io/serviceaccount/token -k https://10.200.128.1 --k8s-node redacted -pk pid=41403 k8s.ns=falco k8s.pod=falco-xwzvb container=d9ea88071e64 image=registry.gitlab.com/my-repo/falco/falco:v0.34.1-36 connection=10.200.44.249:44568->10.200.128.1:443)","priority":"Notice","rule":"Contact K8S API Server From Container","source":"syscall","tags":["T1565","container","k8s","mitre_discovery","network"],"time":"2023-04-06T02:40:51.101170239Z", "output_fields": {"container.id":"d9ea88071e64","container.image.repository":"registry.gitlab.com/my-repo/falco/falco","container.image.tag":"v0.34.1-36","evt.time.iso8601":1680748851101170239,"fd.name":"10.200.44.249:44568->10.200.128.1:443","k8s.ns.name":"falco","k8s.pod.name":"falco-xwzvb","proc.cmdline":"falco /usr/bin/falco --cri /run/containerd/containerd.sock -K /var/run/secrets/kubernetes.io/serviceaccount/token -k https://10.200.128.1 --k8s-node redacted -pk","proc.pid":41403}}
Events detected: 17882
Rule counts by severity:
   NOTICE: 17874
   INFO: 8
Triggered rules by rule name:
   Launch Privileged Container: 8
   Non sudo setuid: 12584
   Contact K8S API Server From Container: 5290

      Exit Code:    1
      Started:      Wed, 05 Apr 2023 20:42:40 -0500
      Finished:     Wed, 05 Apr 2023 21:40:50 -0500
    Ready:          True
    Restart Count:  3

However tonight I found one instance where falco exited with code 139 (sigsegv):

    State:       Running
      Started:   Wed, 05 Apr 2023 21:48:23 -0500
    Last State:  Terminated
      Reason:    Error
      Message:   69937+0000: Notice Unexpected connection to K8s API Server from container (command=falco /usr/bin/falco --cri /run/containerd/containerd.sock -K /var/run/secrets/kubernetes.io/serviceaccount/token -k https://10.200.128.1 --k8s-node redacted -pk pid=2231265 k8s.ns=falco k8s.pod=falco-wbrmv container=abe83f8f99d8 image=registry.gitlab.com/my-repo/falco/falco:v0.34.1-36 connection=10.200.45.242:57688->10.200.128.1:443)","priority":"Notice","rule":"Contact K8S API Server From Container","source":"syscall","tags":["T1565","container","k8s","mitre_discovery","network"],"time":"2023-04-06T02:48:19.186769937Z", "output_fields": {"container.id":"abe83f8f99d8","container.image.repository":"registry.gitlab.com/my-repo/falco/falco","container.image.tag":"v0.34.1-36","evt.time.iso8601":1680749299186769937,"fd.name":"10.200.45.242:57688->10.200.128.1:443","k8s.ns.name":"falco","k8s.pod.name":"falco-wbrmv","proc.cmdline":"falco /usr/bin/falco --cri /run/containerd/containerd.sock -K /var/run/secrets/kubernetes.io/serviceaccount/token -k https://10.200.128.1 --k8s-node redacted -pk","proc.pid":2231265}}
2023-04-06T02:48:21+0000: An error occurred in an event source, forcing termination...
2023-04-06T02:48:22+0000: Shutting down gRPC server. Waiting until external connections are closed by clients
2023-04-06T02:48:22+0000: Waiting for the gRPC threads to complete
2023-04-06T02:48:22+0000: grpc: assertion failed: grpc_server_request_registered_call( server_->server(), registered_method, &call_, &context_->deadline_, context_->client_metadata_.arr(), payload, call_cq_->cq(), notification_cq->cq(), this) == GRPC_CALL_OK
{"hostname":"redacted","output":"2023-04-06T02:48:21.180655889+0000: Notice Unexpected connection to K8s API Server from container (command=falco /usr/bin/falco --cri /run/containerd/containerd.sock -K /var/run/secrets/kubernetes.io/serviceaccount/token -k https://10.200.128.1 --k8s-n
      Exit Code:    139
      Started:      Wed, 05 Apr 2023 20:19:42 -0500
      Finished:     Wed, 05 Apr 2023 21:48:22 -0500
    Ready:          True
    Restart Count:  3
@tspearconquest
Copy link
Contributor Author

tspearconquest commented Apr 6, 2023

I thought now while posting this to slack I should add additional info here.

This is reproducible on all 3 of my clusters (test, dev and prod). All 3 are in AKS running Ubuntu 18.04 and Kubernetes 1.24. I don't know exactly when this started happening; we've been doing a major rearchitecture of the environment and kubernetes cluster upgrade for the last several months. The last time I looked closely at falco, we were on v0.31.1 in September 2022, and it was not having these issues. It was running out of memory (due to improperly tuned memory limits), however this was also on an old cluster architecture (AKS K8s 1.22 with an entirely different application namespace configuration).

We completed the rearchitecture and cluster upgrade recently and I got some time to start working on fixing those issues when I discovered that now Falco is no longer running out of memory but instead being exited. However, there have been a number of changes between when we noticed the out of memory restarts and when we discovered what has been happening now where falco is exiting with exit code 1.

We've upgraded Falco to 0.33.0 in October, 0.34.0 in February, and 0.34.1 in March, upgraded the cluster from K8s 1.22 to 1.24, and disabled automounting the service account in favor of the Bound Service Account Token Volume Projection feature of K8s 1.22 and 1.24.

The prod cluster has falco deployed from flat kubernetes manifests which were themselves created by running helm template on the chart before chart version 2.0, making adjustments to the manifests, and then deploying them. This was how we got the custom rules on prod. I've been working this past 2 weeks to replace that deployment process with one that actually uses helm to do the install directly to the cluster, and also working on a new image build process; so the dev and test clusters have falco deployed from helm with the completely redesigned image and the latest helm chart version while the prod cluster uses the new image but is still deployed from kube manifests. This has been happening since at least a month ago, yet despite all of the changes with moving to helm and redesigning our internal image build process, all 3 clusters still produce the issue reliably.

I would like to provide the config details privately to facilitate debugging the issue. Please let me know if this is possible somehow?

@tspearconquest
Copy link
Contributor Author

I forgot to mention that the dev and test clusters currently don't run any custom rules, but the prod cluster does (and its rule set is based on an old rule set). So one element that can potentially be ruled out is the rules. Since the test and dev clusters only run the rules from the helm chart, it is safe to assume that neither the old rules on the prod cluster nor the new rules on the dev and test clusters are likely to be causing this.

@jasondellaluce
Copy link
Contributor

cc @alacuku (since it relates to the charts potentially).

Is there a chance you could collect the complete stderr/stdout error of one Falco pod at termination time? That would give more complete information.

@tspearconquest
Copy link
Contributor Author

Yes, I'll try to collect it today

@tspearconquest
Copy link
Contributor Author

Closing in favor of #2485

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants