Falco crash after few minutes on GKE 1.24 #2694

jr-instantsystem · 2023-07-25T07:32:53Z

Describe the bug

Hi

We are evaluating Falco on one of cluster, and we face regular restart of each Falco container.
For instance, it has restarted ~120 times in one night.

How to reproduce it

It is deployed using the Helm chart 3.3.0 (falco 0.35.1) as deamonset, on a GKE cluster running Kubernetes 1.24.
The falco config is:

resources:
  requests:
    cpu: 10m
    memory: 1024Mi
  limits:
    cpu: 8
    memory: 1024Mi

driver:
  enabled: true
  kind: ebpf

...

falco:
  log_level: debug

  # increased to prevent de 300k+ syscall drop
  # any other lower value result to syscall drops
  syscall_buf_size_preset: 10

  # reduce watched syscalls
  base_syscalls:
    custom_set: [clone, clone3, fork, vfork, execve, execveat, close, socket, bind, getsockopt, setresuid, setsid, setuid, setgid, setpgid, setresgid, setsid, capset, chdir, chroot, fchdir]

  # try to prevent k8s_replicaset_handler_state and k8s_deployment_handler_state collection error
  metadata_download:
    chunk_wait_us: 3000
    max_mb: 200
    watch_freq_sec: 5

After few minutes, the container crash (exitCode: 1), here is a container log:

2023-07-25T07:01:33.18537424Z stderr F Tue Jul 25 07:01:33 2023: Falco version: 0.35.1 (x86_64)
2023-07-25T07:01:33.185426211Z stderr F Tue Jul 25 07:01:33 2023: CLI args: /usr/bin/falco --cri /run/containerd/containerd.sock --cri /run/crio/crio.sock -K /var/run/secrets/kubernetes.io/serviceaccount/token -k https://10.15.240.1 --k8s-node xxxxxxxxxxx -pk
2023-07-25T07:01:33.185434284Z stderr F Tue Jul 25 07:01:33 2023: Falco initialized with configuration file: /etc/falco/falco.yaml
2023-07-25T07:01:33.185794179Z stderr F Tue Jul 25 07:01:33 2023: Configured rules filenames:
2023-07-25T07:01:33.185806961Z stderr F Tue Jul 25 07:01:33 2023:    /etc/falco/falco_rules.yaml
2023-07-25T07:01:33.185812332Z stderr F Tue Jul 25 07:01:33 2023:    /etc/falco/rules.d
2023-07-25T07:01:33.187130175Z stderr F Tue Jul 25 07:01:33 2023: Loading rules from file /etc/falco/falco_rules.yaml
2023-07-25T07:01:33.502315519Z stderr F Tue Jul 25 07:01:33 2023: Loading rules from file /etc/falco/rules.d/rules-whitelist.yaml
2023-07-25T07:01:33.698625223Z stderr F Tue Jul 25 07:01:33 2023: Watching file '/etc/falco/falco.yaml'
2023-07-25T07:01:33.698687106Z stderr F Tue Jul 25 07:01:33 2023: Watching file '/etc/falco/falco_rules.yaml'
2023-07-25T07:01:33.698695396Z stderr F Tue Jul 25 07:01:33 2023: Watching file '/etc/falco/rules.d/rules-whitelist.yaml'
2023-07-25T07:01:33.698700406Z stderr F Tue Jul 25 07:01:33 2023: Watching directory '/etc/falco/rules.d'
2023-07-25T07:01:33.698725517Z stderr F Tue Jul 25 07:01:33 2023: Setting metadata download max size to 200 MB
2023-07-25T07:01:33.698730491Z stderr F Tue Jul 25 07:01:33 2023: Setting metadata download chunk wait time to 3000 μs
2023-07-25T07:01:33.698734826Z stderr F Tue Jul 25 07:01:33 2023: Setting metadata download watch frequency to 5 seconds
2023-07-25T07:01:33.698855324Z stderr F Tue Jul 25 07:01:33 2023: (31) syscalls in rules: accept, accept4, connect, creat, dup, dup2, dup3, execve, execveat, finit_module, init_module, link, linkat, listen, mkdir, mkdirat, open, openat, openat2, ptrace, rename, renameat, renameat2, rmdir, setuid, socket, symlink, symlinkat, unlink, unlinkat, userfaultfd
2023-07-25T07:01:33.699040536Z stderr F Tue Jul 25 07:01:33 2023: +(20) syscalls added (base_syscalls override): bind, capset, chdir, chroot, clone, clone3, close, execve, execveat, fchdir, fork, getsockopt, setgid, setpgid, setresgid, setresuid, setsid, setuid, socket, vfork
2023-07-25T07:01:33.69917998Z stderr F Tue Jul 25 07:01:33 2023: (48) syscalls selected in total (final set): accept, accept4, bind, capset, chdir, chroot, clone, clone3, close, connect, creat, dup, dup2, dup3, execve, execveat, fchdir, finit_module, fork, getsockopt, init_module, link, linkat, listen, mkdir, mkdirat, open, openat, openat2, procexit, ptrace, rename, renameat, renameat2, rmdir, setgid, setpgid, setresgid, setresuid, setsid, setuid, socket, symlink, symlinkat, unlink, unlinkat, userfaultfd, vfork
2023-07-25T07:01:33.699191005Z stderr F Tue Jul 25 07:01:33 2023: The chosen syscall buffer dimension is: 536870912 bytes (512 MBs)
2023-07-25T07:01:33.699206167Z stderr F Tue Jul 25 07:01:33 2023: Starting health webserver with threadiness 8, listening on port 8765
2023-07-25T07:01:33.699510693Z stderr F Tue Jul 25 07:01:33 2023: Loaded event sources: syscall
2023-07-25T07:01:33.69954521Z stderr F Tue Jul 25 07:01:33 2023: Enabled event sources: syscall
2023-07-25T07:01:33.699558223Z stderr F Tue Jul 25 07:01:33 2023: Opening event source 'syscall'
2023-07-25T07:01:33.699577868Z stderr F Tue Jul 25 07:01:33 2023: Opening 'syscall' source with BPF probe. BPF probe path: /root/.falco/falco-bpf.o
2023-07-25T07:03:20.748110885Z stdout F k8s_handler (k8s_replicaset_handler_state::collect_data()[https://10.15.240.1] an error occurred while receiving data from k8s_replicaset_handler_state, m_blocking_socket=1, m_watching=0, SSL Socket handler (k8s_replicaset_handler_state): Connection closed.
2023-07-25T07:03:20.748245086Z stdout F k8s_handler (k8s_deployment_handler_state::collect_data()[https://10.15.240.1] an error occurred while receiving data from k8s_deployment_handler_state, m_blocking_socket=1, m_watching=0, SSL Socket handler (k8s_deployment_handler_state): Connection closed.
2023-07-25T07:05:01.571335298Z stdout F k8s_handler (k8s_replicaset_handler_state::collect_data()[https://10.15.240.1] an error occurred while receiving data from k8s_replicaset_handler_state, m_blocking_socket=1, m_watching=0, SSL Socket handler (k8s_replicaset_handler_state): Connection closed.
2023-07-25T07:07:00.450648655Z stdout F k8s_handler (k8s_replicaset_handler_state::collect_data()[https://10.15.240.1] an error occurred while receiving data from k8s_replicaset_handler_state, m_blocking_socket=1, m_watching=0, SSL Socket handler (k8s_replicaset_handler_state): Connection closed.
2023-07-25T07:08:53.375979945Z stdout F k8s_handler (k8s_replicaset_handler_state::collect_data()[https://10.15.240.1] an error occurred while receiving data from k8s_replicaset_handler_state, m_blocking_socket=1, m_watching=0, SSL Socket handler (k8s_replicaset_handler_state): Connection closed.
2023-07-25T07:08:54.7734627Z stderr F Tue Jul 25 07:08:54 2023: Closing event source 'syscall'
2023-07-25T07:08:55.0192363Z stdout F Events detected: 0
2023-07-25T07:08:55.019291677Z stdout F Rule counts by severity:
2023-07-25T07:08:55.019298632Z stdout F Triggered rules by rule name:
2023-07-25T07:08:55.192851753Z stderr F Error: Socket handler (k8s_node_handler_event), error 5 (Success) while connecting to 10.15.240.1:443

Expected behaviour

No crash :)

Environment

Helm chart 3.3.0 (falco 0.35.1) as deamonset, on a GKE cluster running Kubernetes 1.24.

Falco version: 0.35.1
System info:
{
"machine": "x86_64",
"nodename": "falco-bk264",
"release": "5.10.162+",
"sysname": "Linux",
"version": "Digwatch compiler #1 SMP Sat Mar 11 15:59:33 UTC 2023"
}
Cloud provider or hardware configuration: GKE cluster running Kubernetes 1.24.
OS: cos_containerd
Kernel: Linux 5.10.162+ update: delete notices about chisels libs#1 SMP Sat Mar 11 15:59:33 UTC 2023 x86_64 GNU/Linux
Installation method: Helm chart 3.3.0 (falco 0.35.1)

The text was updated successfully, but these errors were encountered:

jasondellaluce · 2023-07-25T12:21:35Z

cc @alacuku

jr-instantsystem · 2023-08-04T07:53:31Z

We have deploy the exact same setup/conf on a 1.27 cluster, everything seems fine for now.
The other difference between the 2 clusters is the size. The previous one run ~1200 PODs, the new one ~100, if it can help :)

alacuku · 2023-08-04T08:58:00Z

Hey @jr-instantsystem, it's a known issue. In large clusters, the k8s client does not perform well. A workaround is to disable the k8s metadata collection and only rely on fields available from the container runtimes. Please have a look here: https://falco.org/docs/reference/rules/supported-fields/#field-class-k8s

alacuku · 2023-08-04T08:59:21Z

Also, here is the tracking issue: #2973

jr-instantsystem · 2023-08-07T13:07:15Z

hi @alacuku ,
We disabled the metadata and it looks fine now, thx !

Andreagit97 · 2023-08-24T12:25:29Z

can we close this since it is probably a duplicate of #2973?

jr-instantsystem · 2023-08-28T08:41:34Z

Yes, we don't have any problem anymore since we disabled the metadata

Andreagit97 · 2023-08-28T09:12:59Z

thanks, we will update you when a new version of the k8s client is out

gold-kou · 2023-10-05T10:21:56Z

@alacuku @jr-instantsystem

disable the k8s metadata collection

Can you tell me how can I disable metadata collection?

I tried the --disable-cri-async option, but the error still happens.

Andreagit97 · 2023-10-06T13:39:52Z

@gold-kou if you are using the helm chart as installation method you need to put kubernetes.enabled=false, see below 👇

  kubernetes:
    # -- Enable Kubernetes meta data collection via a connection to the Kubernetes API server.
    # When this option is disabled, Falco falls back to the container annotations to grap the meta data.
    # In such a case, only the ID, name, namespace, labels of the pod will be available.
    enabled: true
    # -- The apiAuth value is to provide the authentication method Falco should use to connect to the Kubernetes API.
    # The argument's documentation from Falco is provided here for reference:
    #
    #  <bt_file> | <cert_file>:<key_file[#password]>[:<ca_cert_file>], --k8s-api-cert <bt_file> | <cert_file>:<key_file[#password]>[:<ca_cert_file>]
    #     Use the provided files names to authenticate user and (optionally) verify the K8S API server identity.
    #     Each entry must specify full (absolute, or relative to the current directory) path to the respective file.
    #     Private key password is optional (needed only if key is password protected).
    #     CA certificate is optional. For all files, only PEM file format is supported.
    #     Specifying CA certificate only is obsoleted - when single entry is provided
    #     for this option, it will be interpreted as the name of a file containing bearer token.
    #     Note that the format of this command-line option prohibits use of files whose names contain
    #     ':' or '#' characters in the file name.
    # -- Provide the authentication method Falco should use to connect to the Kubernetes API.
    apiAuth: /home/andrea/Downloads/falco-0.36.0-x86_64/token
    ## -- Provide the URL Falco should use to connect to the Kubernetes API.
    apiUrl: "https://127.0.0.1:33229"
    # -- If true, only the current node (on which Falco is running) will be considered when requesting metadata of pods
    # to the API server. Disabling this option may have a performance penalty on large clusters.
    enableNodeFilter: true

gold-kou · 2023-10-11T02:15:25Z

I take down a memo for somebody who doesn't use helm.

What I did:
I deleted k and K args from /usr/bin/falco.

Reason:
In the Helm chart, kubernetes.enabled is used in the below if statement. So I understand that K is a key point.
https://github.com/falcosecurity/charts/blob/master/falco/templates/pod-template.tpl#L83-L84

According to the below page, k is also a key point.
https://github.com/falcosecurity/charts/blob/master/falco/generated/helm-values.md?plain=1#L73

jr-instantsystem added the kind/bug label Jul 25, 2023

jr-instantsystem changed the title ~~Falco crash after few minutes on GKE~~ Falco crash after few minutes on GKE 1.24 Aug 4, 2023

Andreagit97 closed this as completed Aug 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Falco crash after few minutes on GKE 1.24 #2694

Falco crash after few minutes on GKE 1.24 #2694

jr-instantsystem commented Jul 25, 2023

jasondellaluce commented Jul 25, 2023

jr-instantsystem commented Aug 4, 2023

alacuku commented Aug 4, 2023

alacuku commented Aug 4, 2023

jr-instantsystem commented Aug 7, 2023

Andreagit97 commented Aug 24, 2023

jr-instantsystem commented Aug 28, 2023

Andreagit97 commented Aug 28, 2023 •

edited

Loading

gold-kou commented Oct 5, 2023

Andreagit97 commented Oct 6, 2023

gold-kou commented Oct 11, 2023

Falco crash after few minutes on GKE 1.24 #2694

Falco crash after few minutes on GKE 1.24 #2694

Comments

jr-instantsystem commented Jul 25, 2023

jasondellaluce commented Jul 25, 2023

jr-instantsystem commented Aug 4, 2023

alacuku commented Aug 4, 2023

alacuku commented Aug 4, 2023

jr-instantsystem commented Aug 7, 2023

Andreagit97 commented Aug 24, 2023

jr-instantsystem commented Aug 28, 2023

Andreagit97 commented Aug 28, 2023 • edited Loading

gold-kou commented Oct 5, 2023

Andreagit97 commented Oct 6, 2023

gold-kou commented Oct 11, 2023

Andreagit97 commented Aug 28, 2023 •

edited

Loading