Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Falco crash after few minutes on GKE 1.24 #2694

Closed
jr-instantsystem opened this issue Jul 25, 2023 · 11 comments
Closed

Falco crash after few minutes on GKE 1.24 #2694

jr-instantsystem opened this issue Jul 25, 2023 · 11 comments
Labels

Comments

@jr-instantsystem
Copy link

Describe the bug

Hi

We are evaluating Falco on one of cluster, and we face regular restart of each Falco container.
For instance, it has restarted ~120 times in one night.

How to reproduce it

It is deployed using the Helm chart 3.3.0 (falco 0.35.1) as deamonset, on a GKE cluster running Kubernetes 1.24.
The falco config is:

resources:
  requests:
    cpu: 10m
    memory: 1024Mi
  limits:
    cpu: 8
    memory: 1024Mi

driver:
  enabled: true
  kind: ebpf

...

falco:
  log_level: debug

  # increased to prevent de 300k+ syscall drop
  # any other lower value result to syscall drops
  syscall_buf_size_preset: 10

  # reduce watched syscalls
  base_syscalls:
    custom_set: [clone, clone3, fork, vfork, execve, execveat, close, socket, bind, getsockopt, setresuid, setsid, setuid, setgid, setpgid, setresgid, setsid, capset, chdir, chroot, fchdir]

  # try to prevent k8s_replicaset_handler_state and k8s_deployment_handler_state collection error
  metadata_download:
    chunk_wait_us: 3000
    max_mb: 200
    watch_freq_sec: 5

After few minutes, the container crash (exitCode: 1), here is a container log:

2023-07-25T07:01:33.18537424Z stderr F Tue Jul 25 07:01:33 2023: Falco version: 0.35.1 (x86_64)
2023-07-25T07:01:33.185426211Z stderr F Tue Jul 25 07:01:33 2023: CLI args: /usr/bin/falco --cri /run/containerd/containerd.sock --cri /run/crio/crio.sock -K /var/run/secrets/kubernetes.io/serviceaccount/token -k https://10.15.240.1 --k8s-node xxxxxxxxxxx -pk
2023-07-25T07:01:33.185434284Z stderr F Tue Jul 25 07:01:33 2023: Falco initialized with configuration file: /etc/falco/falco.yaml
2023-07-25T07:01:33.185794179Z stderr F Tue Jul 25 07:01:33 2023: Configured rules filenames:
2023-07-25T07:01:33.185806961Z stderr F Tue Jul 25 07:01:33 2023:    /etc/falco/falco_rules.yaml
2023-07-25T07:01:33.185812332Z stderr F Tue Jul 25 07:01:33 2023:    /etc/falco/rules.d
2023-07-25T07:01:33.187130175Z stderr F Tue Jul 25 07:01:33 2023: Loading rules from file /etc/falco/falco_rules.yaml
2023-07-25T07:01:33.502315519Z stderr F Tue Jul 25 07:01:33 2023: Loading rules from file /etc/falco/rules.d/rules-whitelist.yaml
2023-07-25T07:01:33.698625223Z stderr F Tue Jul 25 07:01:33 2023: Watching file '/etc/falco/falco.yaml'
2023-07-25T07:01:33.698687106Z stderr F Tue Jul 25 07:01:33 2023: Watching file '/etc/falco/falco_rules.yaml'
2023-07-25T07:01:33.698695396Z stderr F Tue Jul 25 07:01:33 2023: Watching file '/etc/falco/rules.d/rules-whitelist.yaml'
2023-07-25T07:01:33.698700406Z stderr F Tue Jul 25 07:01:33 2023: Watching directory '/etc/falco/rules.d'
2023-07-25T07:01:33.698725517Z stderr F Tue Jul 25 07:01:33 2023: Setting metadata download max size to 200 MB
2023-07-25T07:01:33.698730491Z stderr F Tue Jul 25 07:01:33 2023: Setting metadata download chunk wait time to 3000 μs
2023-07-25T07:01:33.698734826Z stderr F Tue Jul 25 07:01:33 2023: Setting metadata download watch frequency to 5 seconds
2023-07-25T07:01:33.698855324Z stderr F Tue Jul 25 07:01:33 2023: (31) syscalls in rules: accept, accept4, connect, creat, dup, dup2, dup3, execve, execveat, finit_module, init_module, link, linkat, listen, mkdir, mkdirat, open, openat, openat2, ptrace, rename, renameat, renameat2, rmdir, setuid, socket, symlink, symlinkat, unlink, unlinkat, userfaultfd
2023-07-25T07:01:33.699040536Z stderr F Tue Jul 25 07:01:33 2023: +(20) syscalls added (base_syscalls override): bind, capset, chdir, chroot, clone, clone3, close, execve, execveat, fchdir, fork, getsockopt, setgid, setpgid, setresgid, setresuid, setsid, setuid, socket, vfork
2023-07-25T07:01:33.69917998Z stderr F Tue Jul 25 07:01:33 2023: (48) syscalls selected in total (final set): accept, accept4, bind, capset, chdir, chroot, clone, clone3, close, connect, creat, dup, dup2, dup3, execve, execveat, fchdir, finit_module, fork, getsockopt, init_module, link, linkat, listen, mkdir, mkdirat, open, openat, openat2, procexit, ptrace, rename, renameat, renameat2, rmdir, setgid, setpgid, setresgid, setresuid, setsid, setuid, socket, symlink, symlinkat, unlink, unlinkat, userfaultfd, vfork
2023-07-25T07:01:33.699191005Z stderr F Tue Jul 25 07:01:33 2023: The chosen syscall buffer dimension is: 536870912 bytes (512 MBs)
2023-07-25T07:01:33.699206167Z stderr F Tue Jul 25 07:01:33 2023: Starting health webserver with threadiness 8, listening on port 8765
2023-07-25T07:01:33.699510693Z stderr F Tue Jul 25 07:01:33 2023: Loaded event sources: syscall
2023-07-25T07:01:33.69954521Z stderr F Tue Jul 25 07:01:33 2023: Enabled event sources: syscall
2023-07-25T07:01:33.699558223Z stderr F Tue Jul 25 07:01:33 2023: Opening event source 'syscall'
2023-07-25T07:01:33.699577868Z stderr F Tue Jul 25 07:01:33 2023: Opening 'syscall' source with BPF probe. BPF probe path: /root/.falco/falco-bpf.o
2023-07-25T07:03:20.748110885Z stdout F k8s_handler (k8s_replicaset_handler_state::collect_data()[https://10.15.240.1] an error occurred while receiving data from k8s_replicaset_handler_state, m_blocking_socket=1, m_watching=0, SSL Socket handler (k8s_replicaset_handler_state): Connection closed.
2023-07-25T07:03:20.748245086Z stdout F k8s_handler (k8s_deployment_handler_state::collect_data()[https://10.15.240.1] an error occurred while receiving data from k8s_deployment_handler_state, m_blocking_socket=1, m_watching=0, SSL Socket handler (k8s_deployment_handler_state): Connection closed.
2023-07-25T07:05:01.571335298Z stdout F k8s_handler (k8s_replicaset_handler_state::collect_data()[https://10.15.240.1] an error occurred while receiving data from k8s_replicaset_handler_state, m_blocking_socket=1, m_watching=0, SSL Socket handler (k8s_replicaset_handler_state): Connection closed.
2023-07-25T07:07:00.450648655Z stdout F k8s_handler (k8s_replicaset_handler_state::collect_data()[https://10.15.240.1] an error occurred while receiving data from k8s_replicaset_handler_state, m_blocking_socket=1, m_watching=0, SSL Socket handler (k8s_replicaset_handler_state): Connection closed.
2023-07-25T07:08:53.375979945Z stdout F k8s_handler (k8s_replicaset_handler_state::collect_data()[https://10.15.240.1] an error occurred while receiving data from k8s_replicaset_handler_state, m_blocking_socket=1, m_watching=0, SSL Socket handler (k8s_replicaset_handler_state): Connection closed.
2023-07-25T07:08:54.7734627Z stderr F Tue Jul 25 07:08:54 2023: Closing event source 'syscall'
2023-07-25T07:08:55.0192363Z stdout F Events detected: 0
2023-07-25T07:08:55.019291677Z stdout F Rule counts by severity:
2023-07-25T07:08:55.019298632Z stdout F Triggered rules by rule name:
2023-07-25T07:08:55.192851753Z stderr F Error: Socket handler (k8s_node_handler_event), error 5 (Success) while connecting to 10.15.240.1:443

Expected behaviour

No crash :)

Environment

Helm chart 3.3.0 (falco 0.35.1) as deamonset, on a GKE cluster running Kubernetes 1.24.

  • Falco version: 0.35.1
  • System info:
    {
    "machine": "x86_64",
    "nodename": "falco-bk264",
    "release": "5.10.162+",
    "sysname": "Linux",
    "version": "Digwatch compiler #1 SMP Sat Mar 11 15:59:33 UTC 2023"
    }
  • Cloud provider or hardware configuration: GKE cluster running Kubernetes 1.24.
  • OS: cos_containerd
  • Kernel: Linux 5.10.162+ update: delete notices about chisels libs#1 SMP Sat Mar 11 15:59:33 UTC 2023 x86_64 GNU/Linux
  • Installation method: Helm chart 3.3.0 (falco 0.35.1)
@jasondellaluce
Copy link
Contributor

cc @alacuku

@jr-instantsystem jr-instantsystem changed the title Falco crash after few minutes on GKE Falco crash after few minutes on GKE 1.24 Aug 4, 2023
@jr-instantsystem
Copy link
Author

We have deploy the exact same setup/conf on a 1.27 cluster, everything seems fine for now.
The other difference between the 2 clusters is the size. The previous one run ~1200 PODs, the new one ~100, if it can help :)

@alacuku
Copy link
Member

alacuku commented Aug 4, 2023

Hey @jr-instantsystem, it's a known issue. In large clusters, the k8s client does not perform well. A workaround is to disable the k8s metadata collection and only rely on fields available from the container runtimes. Please have a look here: https://falco.org/docs/reference/rules/supported-fields/#field-class-k8s

@alacuku
Copy link
Member

alacuku commented Aug 4, 2023

Also, here is the tracking issue: #2973

@jr-instantsystem
Copy link
Author

hi @alacuku ,
We disabled the metadata and it looks fine now, thx !

@Andreagit97
Copy link
Member

can we close this since it is probably a duplicate of #2973?

@jr-instantsystem
Copy link
Author

Yes, we don't have any problem anymore since we disabled the metadata

@Andreagit97
Copy link
Member

Andreagit97 commented Aug 28, 2023

thanks, we will update you when a new version of the k8s client is out

@gold-kou
Copy link

gold-kou commented Oct 5, 2023

@alacuku @jr-instantsystem

disable the k8s metadata collection

Can you tell me how can I disable metadata collection?

I tried the --disable-cri-async option, but the error still happens.

@Andreagit97
Copy link
Member

@gold-kou if you are using the helm chart as installation method you need to put kubernetes.enabled=false, see below 👇

  kubernetes:
    # -- Enable Kubernetes meta data collection via a connection to the Kubernetes API server.
    # When this option is disabled, Falco falls back to the container annotations to grap the meta data.
    # In such a case, only the ID, name, namespace, labels of the pod will be available.
    enabled: true
    # -- The apiAuth value is to provide the authentication method Falco should use to connect to the Kubernetes API.
    # The argument's documentation from Falco is provided here for reference:
    #
    #  <bt_file> | <cert_file>:<key_file[#password]>[:<ca_cert_file>], --k8s-api-cert <bt_file> | <cert_file>:<key_file[#password]>[:<ca_cert_file>]
    #     Use the provided files names to authenticate user and (optionally) verify the K8S API server identity.
    #     Each entry must specify full (absolute, or relative to the current directory) path to the respective file.
    #     Private key password is optional (needed only if key is password protected).
    #     CA certificate is optional. For all files, only PEM file format is supported.
    #     Specifying CA certificate only is obsoleted - when single entry is provided
    #     for this option, it will be interpreted as the name of a file containing bearer token.
    #     Note that the format of this command-line option prohibits use of files whose names contain
    #     ':' or '#' characters in the file name.
    # -- Provide the authentication method Falco should use to connect to the Kubernetes API.
    apiAuth: /home/andrea/Downloads/falco-0.36.0-x86_64/token
    ## -- Provide the URL Falco should use to connect to the Kubernetes API.
    apiUrl: "https://127.0.0.1:33229"
    # -- If true, only the current node (on which Falco is running) will be considered when requesting metadata of pods
    # to the API server. Disabling this option may have a performance penalty on large clusters.
    enableNodeFilter: true

@gold-kou
Copy link

I take down a memo for somebody who doesn't use helm.

What I did:
I deleted k and K args from /usr/bin/falco.

Reason:
In the Helm chart, kubernetes.enabled is used in the below if statement. So I understand that K is a key point.
https://github.com/falcosecurity/charts/blob/master/falco/templates/pod-template.tpl#L83-L84

According to the below page, k is also a key point.
https://github.com/falcosecurity/charts/blob/master/falco/generated/helm-values.md?plain=1#L73

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants