Falco restarts periodically #2476

tspearconquest · 2023-04-06T03:25:48Z

Describe the bug
Falco is restarting periodically.

We've been running Falco on AKS since sometime in mid 2021, and we keep it up to date with the latest releases. For a long time, it ran without issues. I'm not certain exactly when this started happening but it has been maybe 6 months since I looked closely at the falco deployment for issues.

I have our falco setup to make use of the terminationMessagePolicy field on the kubernetes daemonset manifest, so the outputs below come from kubectl describe on the pod. The Message: field contains the last several lines of stdout from falco before it exited.

How to reproduce it
Install falco from the helm chart

Expected behaviour
Falco should run without issues

Screenshots

Environment

Falco version: 0.34.1

System info:

Cloud provider or hardware configuration: AKS Kubernetes 1.24
OS: Ubuntu 18.04

Kernel: 5.4.x

Installation method: Kubernetes

Additional context

When this occurs, in most cases, we can see that falco exited with exit code 1, which appears similar to what happens when a service fails its health probe and is restarted by the kubelet:

    State:       Running
      Started:   Wed, 05 Apr 2023 21:40:52 -0500
    Last State:  Terminated
      Reason:    Error
      Message:   _fields": {"container.id":"d9ea88071e64","container.image.repository":"registry.gitlab.com/my-repo/falco/falco","container.image.tag":"v0.34.1-36","evt.time.iso8601":1680748851073837812,"fd.name":"10.200.44.249:44562->10.200.128.1:443","k8s.ns.name":"falco","k8s.pod.name":"falco-xwzvb","proc.cmdline":"falco /usr/bin/falco --cri /run/containerd/containerd.sock -K /var/run/secrets/kubernetes.io/serviceaccount/token -k https://10.200.128.1 --k8s-node redacted -pk","proc.pid":41403}}
{"hostname":"redacted","output":"2023-04-06T02:40:51.101170239+0000: Notice Unexpected connection to K8s API Server from container (command=falco /usr/bin/falco --cri /run/containerd/containerd.sock -K /var/run/secrets/kubernetes.io/serviceaccount/token -k https://10.200.128.1 --k8s-node redacted -pk pid=41403 k8s.ns=falco k8s.pod=falco-xwzvb container=d9ea88071e64 image=registry.gitlab.com/my-repo/falco/falco:v0.34.1-36 connection=10.200.44.249:44568->10.200.128.1:443)","priority":"Notice","rule":"Contact K8S API Server From Container","source":"syscall","tags":["T1565","container","k8s","mitre_discovery","network"],"time":"2023-04-06T02:40:51.101170239Z", "output_fields": {"container.id":"d9ea88071e64","container.image.repository":"registry.gitlab.com/my-repo/falco/falco","container.image.tag":"v0.34.1-36","evt.time.iso8601":1680748851101170239,"fd.name":"10.200.44.249:44568->10.200.128.1:443","k8s.ns.name":"falco","k8s.pod.name":"falco-xwzvb","proc.cmdline":"falco /usr/bin/falco --cri /run/containerd/containerd.sock -K /var/run/secrets/kubernetes.io/serviceaccount/token -k https://10.200.128.1 --k8s-node redacted -pk","proc.pid":41403}}
Events detected: 17882
Rule counts by severity:
   NOTICE: 17874
   INFO: 8
Triggered rules by rule name:
   Launch Privileged Container: 8
   Non sudo setuid: 12584
   Contact K8S API Server From Container: 5290

      Exit Code:    1
      Started:      Wed, 05 Apr 2023 20:42:40 -0500
      Finished:     Wed, 05 Apr 2023 21:40:50 -0500
    Ready:          True
    Restart Count:  3

However tonight I found one instance where falco exited with code 139 (sigsegv):

    State:       Running
      Started:   Wed, 05 Apr 2023 21:48:23 -0500
    Last State:  Terminated
      Reason:    Error
      Message:   69937+0000: Notice Unexpected connection to K8s API Server from container (command=falco /usr/bin/falco --cri /run/containerd/containerd.sock -K /var/run/secrets/kubernetes.io/serviceaccount/token -k https://10.200.128.1 --k8s-node redacted -pk pid=2231265 k8s.ns=falco k8s.pod=falco-wbrmv container=abe83f8f99d8 image=registry.gitlab.com/my-repo/falco/falco:v0.34.1-36 connection=10.200.45.242:57688->10.200.128.1:443)","priority":"Notice","rule":"Contact K8S API Server From Container","source":"syscall","tags":["T1565","container","k8s","mitre_discovery","network"],"time":"2023-04-06T02:48:19.186769937Z", "output_fields": {"container.id":"abe83f8f99d8","container.image.repository":"registry.gitlab.com/my-repo/falco/falco","container.image.tag":"v0.34.1-36","evt.time.iso8601":1680749299186769937,"fd.name":"10.200.45.242:57688->10.200.128.1:443","k8s.ns.name":"falco","k8s.pod.name":"falco-wbrmv","proc.cmdline":"falco /usr/bin/falco --cri /run/containerd/containerd.sock -K /var/run/secrets/kubernetes.io/serviceaccount/token -k https://10.200.128.1 --k8s-node redacted -pk","proc.pid":2231265}}
2023-04-06T02:48:21+0000: An error occurred in an event source, forcing termination...
2023-04-06T02:48:22+0000: Shutting down gRPC server. Waiting until external connections are closed by clients
2023-04-06T02:48:22+0000: Waiting for the gRPC threads to complete
2023-04-06T02:48:22+0000: grpc: assertion failed: grpc_server_request_registered_call( server_->server(), registered_method, &call_, &context_->deadline_, context_->client_metadata_.arr(), payload, call_cq_->cq(), notification_cq->cq(), this) == GRPC_CALL_OK
{"hostname":"redacted","output":"2023-04-06T02:48:21.180655889+0000: Notice Unexpected connection to K8s API Server from container (command=falco /usr/bin/falco --cri /run/containerd/containerd.sock -K /var/run/secrets/kubernetes.io/serviceaccount/token -k https://10.200.128.1 --k8s-n
      Exit Code:    139
      Started:      Wed, 05 Apr 2023 20:19:42 -0500
      Finished:     Wed, 05 Apr 2023 21:48:22 -0500
    Ready:          True
    Restart Count:  3

The text was updated successfully, but these errors were encountered:

tspearconquest · 2023-04-06T15:28:45Z

I thought now while posting this to slack I should add additional info here.

This is reproducible on all 3 of my clusters (test, dev and prod). All 3 are in AKS running Ubuntu 18.04 and Kubernetes 1.24. I don't know exactly when this started happening; we've been doing a major rearchitecture of the environment and kubernetes cluster upgrade for the last several months. The last time I looked closely at falco, we were on v0.31.1 in September 2022, and it was not having these issues. It was running out of memory (due to improperly tuned memory limits), however this was also on an old cluster architecture (AKS K8s 1.22 with an entirely different application namespace configuration).

We completed the rearchitecture and cluster upgrade recently and I got some time to start working on fixing those issues when I discovered that now Falco is no longer running out of memory but instead being exited. However, there have been a number of changes between when we noticed the out of memory restarts and when we discovered what has been happening now where falco is exiting with exit code 1.

We've upgraded Falco to 0.33.0 in October, 0.34.0 in February, and 0.34.1 in March, upgraded the cluster from K8s 1.22 to 1.24, and disabled automounting the service account in favor of the Bound Service Account Token Volume Projection feature of K8s 1.22 and 1.24.

The prod cluster has falco deployed from flat kubernetes manifests which were themselves created by running helm template on the chart before chart version 2.0, making adjustments to the manifests, and then deploying them. This was how we got the custom rules on prod. I've been working this past 2 weeks to replace that deployment process with one that actually uses helm to do the install directly to the cluster, and also working on a new image build process; so the dev and test clusters have falco deployed from helm with the completely redesigned image and the latest helm chart version while the prod cluster uses the new image but is still deployed from kube manifests. This has been happening since at least a month ago, yet despite all of the changes with moving to helm and redesigning our internal image build process, all 3 clusters still produce the issue reliably.

I would like to provide the config details privately to facilitate debugging the issue. Please let me know if this is possible somehow?

tspearconquest · 2023-04-06T15:49:00Z

I forgot to mention that the dev and test clusters currently don't run any custom rules, but the prod cluster does (and its rule set is based on an old rule set). So one element that can potentially be ruled out is the rules. Since the test and dev clusters only run the rules from the helm chart, it is safe to assume that neither the old rules on the prod cluster nor the new rules on the dev and test clusters are likely to be causing this.

jasondellaluce · 2023-04-07T09:39:21Z

cc @alacuku (since it relates to the charts potentially).

Is there a chance you could collect the complete stderr/stdout error of one Falco pod at termination time? That would give more complete information.

tspearconquest · 2023-04-07T15:05:15Z

Yes, I'll try to collect it today

tspearconquest · 2023-04-10T20:05:22Z

Closing in favor of #2485

tspearconquest added the kind/bug label Apr 6, 2023

tspearconquest mentioned this issue Apr 10, 2023

Support for Bound Service Account Token projected volume #2485

Closed

tspearconquest closed this as completed Apr 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Falco restarts periodically #2476

Falco restarts periodically #2476

tspearconquest commented Apr 6, 2023 •

edited

Loading

tspearconquest commented Apr 6, 2023 •

edited

Loading

tspearconquest commented Apr 6, 2023

jasondellaluce commented Apr 7, 2023

tspearconquest commented Apr 7, 2023

tspearconquest commented Apr 10, 2023

Falco restarts periodically #2476

Falco restarts periodically #2476

Comments

tspearconquest commented Apr 6, 2023 • edited Loading

tspearconquest commented Apr 6, 2023 • edited Loading

tspearconquest commented Apr 6, 2023

jasondellaluce commented Apr 7, 2023

tspearconquest commented Apr 7, 2023

tspearconquest commented Apr 10, 2023

tspearconquest commented Apr 6, 2023 •

edited

Loading

tspearconquest commented Apr 6, 2023 •

edited

Loading