Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missed events in a short-lived container running as a CronJob #2013

Closed
abroglesc opened this issue May 23, 2022 · 34 comments
Closed

Missed events in a short-lived container running as a CronJob #2013

abroglesc opened this issue May 23, 2022 · 34 comments

Comments

@abroglesc
Copy link

Describe the bug

Falco didn't capture event(s) in a container running as a CronJob.

How to reproduce it

Create a shell script similar to test.sh:

#!/bin/bash

set -euxo pipefail

touch "/etc/test.txt"

Use a CronJob to execute the test.sh:

---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: test-cronjob
spec:
  schedule: "*/5 * * * *"
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: falco-test-cronjob
            image: # YOUR IMAGE
            command: ["/bin/bash", "test.sh"]
            volumeMounts:
            - mountPath: /host
              name: host-mount
            securityContext:
              privileged: true
          restartPolicy: OnFailure
          serviceAccountName: random-sa-name
          volumes:
          - name: host-mount
            hostPath:
              path: /
              type: Directory

Expected behaviour

The used rule set is correct. Confirmed by observing similar type of events for containers that run for a longer period of time
than the container used to run the test case (in test.sh).

Observe (at least):

  • creation of a new process
execve("/usr/bin/touch", ["touch", "/etc/test.txt"], 0x7fff176111d8 /* 60 vars */) = 0
  • opening and possibly creation of a file
openat(AT_FDCWD, "/etc/test.txt", O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK, 0666) = 3

Environment

  • System info:
# falco --support | jq .system_info
Mon May 23 16:28:03 2022: Falco version 0.31.1
Mon May 23 16:28:03 2022: Falco initialized with configuration file /etc/falco/falco.yaml
Mon May 23 16:28:03 2022: Loading rules from file /rules/falco_rules.yaml:
{
  "machine": "x86_64",
  "nodename": "redacted",
  "release": "5.4.170+",
  "sysname": "Linux",
  "version": "#1 SMP Wed Mar 23 10:13:41 PDT 2022"
}
  • Cloud provider or hardware configuration:
    GKE

  • OS:

# cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
  • Kernel:
Linux gke-k8s-{redacted} 5.4.170+ falcosecurity/libs#1 SMP Wed Mar 23 10:13:41 PDT 2022 x86_64 GNU/Linux
  • Installation method:
    Kubernetes; Part of a daemon set running in a non-privileged container_.

Additional context

Introducing a delay (ex: sleep 10) before touch "/etc/test.txt" eliminates the issue.

@leogr
Copy link
Member

leogr commented May 24, 2022

Could you try to disable the K8s metadata support by removing -k / -K ?
(just for debugging purposes)

Thanks in advance.

@cpungasc
Copy link

Could you try to disable the K8s metadata support by removing -k / -K ? (just for debugging purposes)

Thanks in advance.

Thanks for getting back so fast.

I followed your advice and disabled -k and -K. After 3 manual tests: the events were successfully captured.

We are curious, when falco fails to enrich metadata and doesn't find the pod or the container, is it silently dropping the events? Can we log the failed state or be aware of it, please?

Thank you!

@leogr
Copy link
Member

leogr commented May 25, 2022

I had a suspect, and you confirmed it :)
However, I'm not yet sure which is the root cause.

We are curious, when falco fails to enrich metadata and doesn't find the pod or the container, is it silently dropping the events? Can we log the failed state or be aware of it, please?

Falco should not drop in this case. When some metadata is missing, the expected behavior is to emit the alert without the missing metadata.

The cause of the wrong behavior may be either:

  • Falco is actually dropping the event (because for some reason processing that event takes too long) and the syscall dropping threshold setting hides the dropping notification (could you try with threshold: 0? you should see all dropping notifications)

OR

  • Libsinsp has a bug that skips the event when is not able to fetch K8s metadata in time.

I guess the first. Let me know if you can try. Anyway, whatever the case, I think we have to address this issue.

/cc @jasondellaluce
/cc @FedeDP
/cc @Andreagit97

@FedeDP
Copy link
Contributor

FedeDP commented May 25, 2022

Yep, i agree with Leo: i'd try with threshold: 0 to see if the event is actually dropped.
Otherwise, we might have a bug in sinsp!

@cpungasc
Copy link

cpungasc commented May 25, 2022

I have tried with the following setup/changes. I have enabled back -k/-K and set the syscall_event_drops.threshold value to 0 (as you can observe lower).

root@gke-k8s-runtime-dev--{redacted}:# grep -i -A 5 -B 5 "threshold" falco.yaml
# token bucket. The rate corresponds to one message every 30 seconds
# with a burst of one message (by default).
#
# The messages are emitted when the percentage of dropped system calls
# with respect the number of events in the last second
# is greater than the given threshold (a double in the range [0, 1]).
#
# For debugging/testing it is possible to simulate the drops using
# the `simulate_drops: true`. In this case the threshold does not apply.

syscall_event_drops:
  threshold: 0
  actions:
    - log
    - alert
  rate: .03333
  max_burst: 10

I left the cluster to run for more than 15 minutes (with CronJobs executing every 3 minutes). The last log events from falco remained:

> kubectl logs -f pods/falco-monitoring -c falco

* Success: eBPF probe symlinked to /root/.falco/falco-bpf.o
Wed May 25 12:11:06 2022: Falco version 0.31.1-{redacted}
Wed May 25 12:11:06 2022: Falco initialized with configuration file /falco.yaml
Wed May 25 12:11:06 2022: Loading rules from file /rules/falco_rules.yaml:

Please let me know if I should look somewhere else or apply any other changes. Thank you!

@leogr
Copy link
Member

leogr commented May 30, 2022

  • Success: eBPF probe symlinked to /root/.falco/falco-bpf.o
    Wed May 25 12:11:06 2022: Falco version 0.31.1-{redacted}
    Wed May 25 12:11:06 2022: Falco initialized with configuration file /falco.yaml
    Wed May 25 12:11:06 2022: Loading rules from file /rules/falco_rules.yaml:

Hey @cpungasc

If there was no event drop, it is likely a bug. We need to investigate more. Thank you for your detailed report 🙏

Out of curiosity: why is the Falco version redacted? Have you modified Falco or the driver?

@FedeDP
Copy link
Contributor

FedeDP commented May 30, 2022

Hi!
Can you post the rule you are expecting to see triggered?
Thank you for the support!

@cpungasc
Copy link

  • Success: eBPF probe symlinked to /root/.falco/falco-bpf.o
    Wed May 25 12:11:06 2022: Falco version 0.31.1-{redacted}
    Wed May 25 12:11:06 2022: Falco initialized with configuration file /falco.yaml
    Wed May 25 12:11:06 2022: Loading rules from file /rules/falco_rules.yaml:

Hey @cpungasc

If there was no event drop, it is likely a bug. We need to investigate more. Thank you for your detailed report 🙏

Out of curiosity: why is the Falco version redacted? Have you modified Falco or the driver?

Hey @leogr

Sure, let me know if I can help in any other way.
Regarding the redacted part, we have a fork where we have built some additional component(s) for rules management. The only modification I am aware of is in the JSON formatter for falco, but it is not something that should cause the issue. Let me know if you have doubts and would like me to detail.
Thanks for the question, I should have given context around that.

@cpungasc
Copy link

Hi! Can you post the rule you are expecting to see triggered? Thank you for the support!

Hi @FedeDP

Yes, here are the relevant rules:

- condition: (evt.type=open or evt.type=openat) and evt.is_open_read=true and fd.typechar='f'
    and fd.num>=0
  macro: open_read

- condition: open_read
  desc: A file was accessed within the container.
  output: '%evt.time,%evt.datetime,%k8s.pod.name,%k8s.pod.id,%k8s.ns.name,%container.id,%container.name,%container.image,%evt.type,%evt.dir,%evt.arg.name,%proc.name,%proc.args,%proc.cmdline,%proc.exe,%proc.exeline,%proc.pid,%proc.pname,%proc.pcmdline,%proc.ppid,%proc.tty,%proc.cwd,%user.name,%fd.name,%fd.directory,%fd.filename,%fd.containerdirectory'
  priority: INFO
  rule: File read monitor

- rule: Process Creation
  condition: (evt.type = execve or syscall.type = execve or evt.type = fork or evt.type = clone)
  desc: A process was spawned within the container.
  output: "%k8s.pod.name,%k8s.pod.id,%k8s.ns.name,%container.id,%container.name,%container.image,%evt.type,%evt.dir,%evt.arg.name,%proc.name %proc.args,%proc.cmdline,%proc.exe,%proc.exeline,%proc.pid,%proc.pname,%proc.pcmdline,%proc.ppid,%proc.tty,%proc.cwd,%user.name,%fd.name"
  priority: INFO

Hope this helps.

@FedeDP
Copy link
Contributor

FedeDP commented May 30, 2022

Hi!
Given that no container/k8s related data is used as a condition, they're not related to missed events.
Most probably, you are really seeing events dropping!

A bit of context: both container (docker/podman/cri) and k8s metadata are collected in separate threads and their values is <NA> until the info are actually loaded.

In the following output:

> kubectl logs -f pods/falco-monitoring -c falco

* Success: eBPF probe symlinked to /root/.falco/falco-bpf.o
Wed May 25 12:11:06 2022: Falco version 0.31.1-{redacted}
Wed May 25 12:11:06 2022: Falco initialized with configuration file /falco.yaml
Wed May 25 12:11:06 2022: Loading rules from file /rules/falco_rules.yaml:

(ie: the test run with syscall_event_drops.threshold: 0 )
what was the output once you ctrl-c Falco?

@cpungasc
Copy link

Hi! Given that no container/k8s related data is used as a condition, they're not related to missed events. Most probably, you are really seeing events dropping!

A bit of context: both container (docker/podman/cri) and k8s metadata are collected in separate threads and their values is <NA> until the info are actually loaded.

In the following output:

> kubectl logs -f pods/falco-monitoring -c falco

* Success: eBPF probe symlinked to /root/.falco/falco-bpf.o
Wed May 25 12:11:06 2022: Falco version 0.31.1-{redacted}
Wed May 25 12:11:06 2022: Falco initialized with configuration file /falco.yaml
Wed May 25 12:11:06 2022: Loading rules from file /rules/falco_rules.yaml:

(ie: the test run with syscall_event_drops.threshold: 0 ) what was the output once you ctrl-c Falco?

Ran the test once more. The output was:

Mon May 30 14:44:19 2022: Falco initialized with configuration file /falco.yaml
Mon May 30 14:44:19 2022: Loading rules from file /rules/falco_rules.yaml:
Mon May 30 15:17:08 2022: SIGINT received, exiting...
Syscall event drop monitoring:
   - event drop detected: 318 occurrences
   - num times actions taken: 75
Events detected: 186143
Rule counts by severity:
   INFO: 186143
Triggered rules by rule name:
   New network connection: 163597
   Process Creation: 22410
   File write monitor: 136

@FedeDP
Copy link
Contributor

FedeDP commented May 30, 2022

Ok, so we got some event drops.
Weird things from top of my head:

  • it seems weird that we always drop events for the exact timeframe of the cronjob
  • why removing the -k flag helps? Ok, it increases CPU usage, but you are not dropping that much for it to make a difference...

@leogr any thought?

@leogr
Copy link
Member

leogr commented Jun 1, 2022

Ok, so we got some event drops. Weird things from top of my head:

  • it seems weird that we always drop events for the exact timeframe of the cronjob
  • why removing the -k flag helps? Ok, it increases CPU usage, but you are not dropping that much for it to make a difference...

@leogr any thought?

I guess that k8s support is blocking the main thread for a short amount of time, OR it lets the Falco process goes over 100% CPU usage so that the kernel switches immediately to another process. Both situations can make Falco drop events.

We definitely need to investigate.

@poiana
Copy link
Contributor

poiana commented Aug 30, 2022

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

@leogr
Copy link
Member

leogr commented Sep 1, 2022

/remove-lifecycle stale
cc @alacuku

@poiana
Copy link
Contributor

poiana commented Nov 30, 2022

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

@alacuku
Copy link
Member

alacuku commented Nov 30, 2022

/remove-lifecycle stale

@leogr
Copy link
Member

leogr commented Dec 16, 2022

cc @falcosecurity/falco-maintainers any updates?

@FedeDP
Copy link
Contributor

FedeDP commented Dec 16, 2022

I'd tag @deepskyblue86 that did lots of research and fixes on short-lived containers, right? (Thank you!)

@deepskyblue86
Copy link
Member

Actually I didn't experience any drop so far, just dealing with container creation and container metadata lookup in libs.
I'm not familiar with k8s handling in Falco.

@poiana
Copy link
Contributor

poiana commented Mar 16, 2023

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

@alacuku
Copy link
Member

alacuku commented Mar 17, 2023

/remove-lifecycle stale

@poiana
Copy link
Contributor

poiana commented Jun 17, 2023

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

@poiana
Copy link
Contributor

poiana commented Jul 17, 2023

Stale issues rot after 30d of inactivity.

Mark the issue as fresh with /remove-lifecycle rotten.

Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle rotten

@poiana
Copy link
Contributor

poiana commented Aug 16, 2023

Rotten issues close after 30d of inactivity.

Reopen the issue with /reopen.

Mark the issue as fresh with /remove-lifecycle rotten.

Provide feedback via https://github.com/falcosecurity/community.
/close

@poiana
Copy link
Contributor

poiana commented Aug 16, 2023

@poiana: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue with /reopen.

Mark the issue as fresh with /remove-lifecycle rotten.

Provide feedback via https://github.com/falcosecurity/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@poiana poiana closed this as completed Aug 16, 2023
@abroglesc
Copy link
Author

/reopen

@poiana poiana reopened this Aug 16, 2023
@poiana
Copy link
Contributor

poiana commented Aug 16, 2023

@abroglesc: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Andreagit97
Copy link
Member

Working on that here #2973

@Andreagit97 Andreagit97 modified the milestones: 0.36.0, TBD Aug 31, 2023
@poiana
Copy link
Contributor

poiana commented Sep 30, 2023

Rotten issues close after 30d of inactivity.

Reopen the issue with /reopen.

Mark the issue as fresh with /remove-lifecycle rotten.

Provide feedback via https://github.com/falcosecurity/community.
/close

@poiana poiana closed this as completed Sep 30, 2023
@poiana
Copy link
Contributor

poiana commented Sep 30, 2023

@poiana: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue with /reopen.

Mark the issue as fresh with /remove-lifecycle rotten.

Provide feedback via https://github.com/falcosecurity/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Andreagit97
Copy link
Member

/remove-lifecycle rotten

@Andreagit97
Copy link
Member

The new k8smeta plugin should solve this issue, give it a try with our dedicated helm chart and Falco 0.36.2 #2973 (comment)

@Andreagit97
Copy link
Member

This should be solved by Falco 0.37.0! Feel free to reopen if this is still an issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants