Missed events in a short-lived container running as a CronJob #2013

abroglesc · 2022-05-23T16:45:08Z

Describe the bug

Falco didn't capture event(s) in a container running as a CronJob.

How to reproduce it

Create a shell script similar to test.sh:

#!/bin/bash

set -euxo pipefail

touch "/etc/test.txt"

Use a CronJob to execute the test.sh:

---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: test-cronjob
spec:
  schedule: "*/5 * * * *"
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: falco-test-cronjob
            image: # YOUR IMAGE
            command: ["/bin/bash", "test.sh"]
            volumeMounts:
            - mountPath: /host
              name: host-mount
            securityContext:
              privileged: true
          restartPolicy: OnFailure
          serviceAccountName: random-sa-name
          volumes:
          - name: host-mount
            hostPath:
              path: /
              type: Directory

Expected behaviour

The used rule set is correct. Confirmed by observing similar type of events for containers that run for a longer period of time
than the container used to run the test case (in test.sh).

Observe (at least):

creation of a new process

execve("/usr/bin/touch", ["touch", "/etc/test.txt"], 0x7fff176111d8 /* 60 vars */) = 0

opening and possibly creation of a file

openat(AT_FDCWD, "/etc/test.txt", O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK, 0666) = 3

Environment

System info:

# falco --support | jq .system_info
Mon May 23 16:28:03 2022: Falco version 0.31.1
Mon May 23 16:28:03 2022: Falco initialized with configuration file /etc/falco/falco.yaml
Mon May 23 16:28:03 2022: Loading rules from file /rules/falco_rules.yaml:
{
  "machine": "x86_64",
  "nodename": "redacted",
  "release": "5.4.170+",
  "sysname": "Linux",
  "version": "#1 SMP Wed Mar 23 10:13:41 PDT 2022"
}

Cloud provider or hardware configuration:
GKE
OS:

# cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

Kernel:

Linux gke-k8s-{redacted} 5.4.170+ falcosecurity/libs#1 SMP Wed Mar 23 10:13:41 PDT 2022 x86_64 GNU/Linux

Installation method:
Kubernetes; Part of a daemon set running in a non-privileged container_.

Additional context

Introducing a delay (ex: sleep 10) before touch "/etc/test.txt" eliminates the issue.

The text was updated successfully, but these errors were encountered:

leogr · 2022-05-24T16:35:21Z

Could you try to disable the K8s metadata support by removing -k / -K ?
(just for debugging purposes)

Thanks in advance.

cpungasc · 2022-05-24T18:34:57Z

Could you try to disable the K8s metadata support by removing -k / -K ? (just for debugging purposes)

Thanks in advance.

Thanks for getting back so fast.

I followed your advice and disabled -k and -K. After 3 manual tests: the events were successfully captured.

We are curious, when falco fails to enrich metadata and doesn't find the pod or the container, is it silently dropping the events? Can we log the failed state or be aware of it, please?

Thank you!

leogr · 2022-05-25T09:37:01Z

I had a suspect, and you confirmed it :)
However, I'm not yet sure which is the root cause.

We are curious, when falco fails to enrich metadata and doesn't find the pod or the container, is it silently dropping the events? Can we log the failed state or be aware of it, please?

Falco should not drop in this case. When some metadata is missing, the expected behavior is to emit the alert without the missing metadata.

The cause of the wrong behavior may be either:

Falco is actually dropping the event (because for some reason processing that event takes too long) and the syscall dropping threshold setting hides the dropping notification (could you try with threshold: 0? you should see all dropping notifications)

OR

Libsinsp has a bug that skips the event when is not able to fetch K8s metadata in time.

I guess the first. Let me know if you can try. Anyway, whatever the case, I think we have to address this issue.

/cc @jasondellaluce
/cc @FedeDP
/cc @Andreagit97

FedeDP · 2022-05-25T09:51:15Z

Yep, i agree with Leo: i'd try with threshold: 0 to see if the event is actually dropped.
Otherwise, we might have a bug in sinsp!

cpungasc · 2022-05-25T12:20:53Z

I have tried with the following setup/changes. I have enabled back -k/-K and set the syscall_event_drops.threshold value to 0 (as you can observe lower).

root@gke-k8s-runtime-dev--{redacted}:# grep -i -A 5 -B 5 "threshold" falco.yaml
# token bucket. The rate corresponds to one message every 30 seconds
# with a burst of one message (by default).
#
# The messages are emitted when the percentage of dropped system calls
# with respect the number of events in the last second
# is greater than the given threshold (a double in the range [0, 1]).
#
# For debugging/testing it is possible to simulate the drops using
# the `simulate_drops: true`. In this case the threshold does not apply.

syscall_event_drops:
  threshold: 0
  actions:
    - log
    - alert
  rate: .03333
  max_burst: 10

I left the cluster to run for more than 15 minutes (with CronJobs executing every 3 minutes). The last log events from falco remained:

> kubectl logs -f pods/falco-monitoring -c falco

* Success: eBPF probe symlinked to /root/.falco/falco-bpf.o
Wed May 25 12:11:06 2022: Falco version 0.31.1-{redacted}
Wed May 25 12:11:06 2022: Falco initialized with configuration file /falco.yaml
Wed May 25 12:11:06 2022: Loading rules from file /rules/falco_rules.yaml:

Please let me know if I should look somewhere else or apply any other changes. Thank you!

leogr · 2022-05-30T08:20:49Z

Success: eBPF probe symlinked to /root/.falco/falco-bpf.o
Wed May 25 12:11:06 2022: Falco version 0.31.1-{redacted}
Wed May 25 12:11:06 2022: Falco initialized with configuration file /falco.yaml
Wed May 25 12:11:06 2022: Loading rules from file /rules/falco_rules.yaml:

Hey @cpungasc

If there was no event drop, it is likely a bug. We need to investigate more. Thank you for your detailed report 🙏

Out of curiosity: why is the Falco version redacted? Have you modified Falco or the driver?

FedeDP · 2022-05-30T08:27:50Z

Hi!
Can you post the rule you are expecting to see triggered?
Thank you for the support!

cpungasc · 2022-05-30T08:44:59Z

Success: eBPF probe symlinked to /root/.falco/falco-bpf.o
Wed May 25 12:11:06 2022: Falco version 0.31.1-{redacted}
Wed May 25 12:11:06 2022: Falco initialized with configuration file /falco.yaml
Wed May 25 12:11:06 2022: Loading rules from file /rules/falco_rules.yaml:

Hey @cpungasc

If there was no event drop, it is likely a bug. We need to investigate more. Thank you for your detailed report 🙏

Out of curiosity: why is the Falco version redacted? Have you modified Falco or the driver?

Hey @leogr

Sure, let me know if I can help in any other way.
Regarding the redacted part, we have a fork where we have built some additional component(s) for rules management. The only modification I am aware of is in the JSON formatter for falco, but it is not something that should cause the issue. Let me know if you have doubts and would like me to detail.
Thanks for the question, I should have given context around that.

cpungasc · 2022-05-30T08:54:13Z

Hi! Can you post the rule you are expecting to see triggered? Thank you for the support!

Hi @FedeDP

Yes, here are the relevant rules:

- condition: (evt.type=open or evt.type=openat) and evt.is_open_read=true and fd.typechar='f'
    and fd.num>=0
  macro: open_read

- condition: open_read
  desc: A file was accessed within the container.
  output: '%evt.time,%evt.datetime,%k8s.pod.name,%k8s.pod.id,%k8s.ns.name,%container.id,%container.name,%container.image,%evt.type,%evt.dir,%evt.arg.name,%proc.name,%proc.args,%proc.cmdline,%proc.exe,%proc.exeline,%proc.pid,%proc.pname,%proc.pcmdline,%proc.ppid,%proc.tty,%proc.cwd,%user.name,%fd.name,%fd.directory,%fd.filename,%fd.containerdirectory'
  priority: INFO
  rule: File read monitor

- rule: Process Creation
  condition: (evt.type = execve or syscall.type = execve or evt.type = fork or evt.type = clone)
  desc: A process was spawned within the container.
  output: "%k8s.pod.name,%k8s.pod.id,%k8s.ns.name,%container.id,%container.name,%container.image,%evt.type,%evt.dir,%evt.arg.name,%proc.name %proc.args,%proc.cmdline,%proc.exe,%proc.exeline,%proc.pid,%proc.pname,%proc.pcmdline,%proc.ppid,%proc.tty,%proc.cwd,%user.name,%fd.name"
  priority: INFO

Hope this helps.

FedeDP · 2022-05-30T12:43:40Z

Hi!
Given that no container/k8s related data is used as a condition, they're not related to missed events.
Most probably, you are really seeing events dropping!

A bit of context: both container (docker/podman/cri) and k8s metadata are collected in separate threads and their values is <NA> until the info are actually loaded.

In the following output:

> kubectl logs -f pods/falco-monitoring -c falco

* Success: eBPF probe symlinked to /root/.falco/falco-bpf.o
Wed May 25 12:11:06 2022: Falco version 0.31.1-{redacted}
Wed May 25 12:11:06 2022: Falco initialized with configuration file /falco.yaml
Wed May 25 12:11:06 2022: Loading rules from file /rules/falco_rules.yaml:

(ie: the test run with syscall_event_drops.threshold: 0 )
what was the output once you ctrl-c Falco?

cpungasc · 2022-05-30T15:19:48Z

Hi! Given that no container/k8s related data is used as a condition, they're not related to missed events. Most probably, you are really seeing events dropping!

A bit of context: both container (docker/podman/cri) and k8s metadata are collected in separate threads and their values is <NA> until the info are actually loaded.

In the following output:
> kubectl logs -f pods/falco-monitoring -c falco

* Success: eBPF probe symlinked to /root/.falco/falco-bpf.o
Wed May 25 12:11:06 2022: Falco version 0.31.1-{redacted}
Wed May 25 12:11:06 2022: Falco initialized with configuration file /falco.yaml
Wed May 25 12:11:06 2022: Loading rules from file /rules/falco_rules.yaml:
(ie: the test run with syscall_event_drops.threshold: 0 ) what was the output once you ctrl-c Falco?

Ran the test once more. The output was:

Mon May 30 14:44:19 2022: Falco initialized with configuration file /falco.yaml
Mon May 30 14:44:19 2022: Loading rules from file /rules/falco_rules.yaml:
Mon May 30 15:17:08 2022: SIGINT received, exiting...
Syscall event drop monitoring:
   - event drop detected: 318 occurrences
   - num times actions taken: 75
Events detected: 186143
Rule counts by severity:
   INFO: 186143
Triggered rules by rule name:
   New network connection: 163597
   Process Creation: 22410
   File write monitor: 136

FedeDP · 2022-05-30T15:23:37Z

Ok, so we got some event drops.
Weird things from top of my head:

it seems weird that we always drop events for the exact timeframe of the cronjob
why removing the -k flag helps? Ok, it increases CPU usage, but you are not dropping that much for it to make a difference...

@leogr any thought?

leogr · 2022-06-01T15:00:57Z

Ok, so we got some event drops. Weird things from top of my head:

it seems weird that we always drop events for the exact timeframe of the cronjob

why removing the -k flag helps? Ok, it increases CPU usage, but you are not dropping that much for it to make a difference...

@leogr any thought?

I guess that k8s support is blocking the main thread for a short amount of time, OR it lets the Falco process goes over 100% CPU usage so that the kernel switches immediately to another process. Both situations can make Falco drop events.

We definitely need to investigate.

poiana · 2022-08-30T15:44:53Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

leogr · 2022-09-01T09:03:55Z

/remove-lifecycle stale
cc @alacuku

poiana · 2022-11-30T09:34:33Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

alacuku · 2022-11-30T16:41:36Z

/remove-lifecycle stale

leogr · 2022-12-16T10:13:08Z

cc @falcosecurity/falco-maintainers any updates?

FedeDP · 2022-12-16T10:16:52Z

I'd tag @deepskyblue86 that did lots of research and fixes on short-lived containers, right? (Thank you!)

deepskyblue86 · 2022-12-16T21:45:35Z

Actually I didn't experience any drop so far, just dealing with container creation and container metadata lookup in libs.
I'm not familiar with k8s handling in Falco.

poiana · 2023-03-16T21:58:27Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

alacuku · 2023-03-17T06:20:14Z

/remove-lifecycle stale

poiana · 2023-06-17T19:31:22Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

poiana · 2023-07-17T19:32:06Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh with /remove-lifecycle rotten.

Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle rotten

poiana · 2023-08-16T19:32:55Z

Rotten issues close after 30d of inactivity.

Reopen the issue with /reopen.

Mark the issue as fresh with /remove-lifecycle rotten.

Provide feedback via https://github.com/falcosecurity/community.
/close

poiana · 2023-08-16T19:32:58Z

@poiana: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue with /reopen.

Mark the issue as fresh with /remove-lifecycle rotten.

Provide feedback via https://github.com/falcosecurity/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

abroglesc · 2023-08-16T21:39:33Z

/reopen

poiana · 2023-08-16T21:39:35Z

@abroglesc: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Andreagit97 · 2023-08-17T10:42:03Z

Working on that here #2973

poiana · 2023-09-30T15:45:10Z

Rotten issues close after 30d of inactivity.

Reopen the issue with /reopen.

Mark the issue as fresh with /remove-lifecycle rotten.

Provide feedback via https://github.com/falcosecurity/community.
/close

poiana · 2023-09-30T15:45:13Z

@poiana: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue with /reopen.

Mark the issue as fresh with /remove-lifecycle rotten.

Provide feedback via https://github.com/falcosecurity/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Andreagit97 · 2023-10-02T08:12:26Z

/remove-lifecycle rotten

Andreagit97 · 2024-01-12T14:42:17Z

The new k8smeta plugin should solve this issue, give it a try with our dedicated helm chart and Falco 0.36.2 #2973 (comment)

Andreagit97 · 2024-01-30T17:24:19Z

This should be solved by Falco 0.37.0! Feel free to reopen if this is still an issue

abroglesc added the kind/bug label May 23, 2022

poiana added the lifecycle/stale label Aug 30, 2022

poiana removed the lifecycle/stale label Sep 1, 2022

poiana added the lifecycle/stale label Nov 30, 2022

poiana removed the lifecycle/stale label Nov 30, 2022

leogr added this to Falco Roadmap Nov 30, 2022

poiana added the lifecycle/stale label Mar 16, 2023

poiana removed the lifecycle/stale label Mar 17, 2023

poiana added the lifecycle/stale label Jun 17, 2023

LucaGuerra removed this from Falco Roadmap Jul 7, 2023

poiana added lifecycle/rotten and removed lifecycle/stale labels Jul 17, 2023

poiana closed this as completed Aug 16, 2023

poiana reopened this Aug 16, 2023

Andreagit97 modified the milestones: 0.36.0, TBD Aug 31, 2023

poiana closed this as completed Sep 30, 2023

Andreagit97 reopened this Oct 2, 2023

poiana removed the lifecycle/rotten label Oct 2, 2023

alacuku mentioned this issue Oct 4, 2023

[WIP] new(libsinsp): extract pod_uid from cgroups falcosecurity/libs#1377

Closed

Andreagit97 modified the milestones: TBD, 0.37.0 Oct 12, 2023

Andreagit97 added the area/k8s-client label Nov 6, 2023

Andreagit97 assigned Andreagit97 and alacuku Nov 6, 2023

Andreagit97 closed this as completed Jan 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missed events in a short-lived container running as a CronJob #2013

Missed events in a short-lived container running as a CronJob #2013

abroglesc commented May 23, 2022

leogr commented May 24, 2022

cpungasc commented May 24, 2022

leogr commented May 25, 2022

FedeDP commented May 25, 2022

cpungasc commented May 25, 2022 •

edited

Loading

leogr commented May 30, 2022

FedeDP commented May 30, 2022

cpungasc commented May 30, 2022

cpungasc commented May 30, 2022

FedeDP commented May 30, 2022

cpungasc commented May 30, 2022

FedeDP commented May 30, 2022 •

edited

Loading

leogr commented Jun 1, 2022 •

edited

Loading

poiana commented Aug 30, 2022

leogr commented Sep 1, 2022

poiana commented Nov 30, 2022

alacuku commented Nov 30, 2022

leogr commented Dec 16, 2022

FedeDP commented Dec 16, 2022

deepskyblue86 commented Dec 16, 2022

poiana commented Mar 16, 2023

alacuku commented Mar 17, 2023

poiana commented Jun 17, 2023

poiana commented Jul 17, 2023

poiana commented Aug 16, 2023

poiana commented Aug 16, 2023

abroglesc commented Aug 16, 2023

poiana commented Aug 16, 2023

Andreagit97 commented Aug 17, 2023

poiana commented Sep 30, 2023

poiana commented Sep 30, 2023

Andreagit97 commented Oct 2, 2023

Andreagit97 commented Jan 12, 2024

Andreagit97 commented Jan 30, 2024

Missed events in a short-lived container running as a CronJob #2013

Missed events in a short-lived container running as a CronJob #2013

Comments

abroglesc commented May 23, 2022

leogr commented May 24, 2022

cpungasc commented May 24, 2022

leogr commented May 25, 2022

FedeDP commented May 25, 2022

cpungasc commented May 25, 2022 • edited Loading

leogr commented May 30, 2022

FedeDP commented May 30, 2022

cpungasc commented May 30, 2022

cpungasc commented May 30, 2022

FedeDP commented May 30, 2022

cpungasc commented May 30, 2022

FedeDP commented May 30, 2022 • edited Loading

leogr commented Jun 1, 2022 • edited Loading

poiana commented Aug 30, 2022

leogr commented Sep 1, 2022

poiana commented Nov 30, 2022

alacuku commented Nov 30, 2022

leogr commented Dec 16, 2022

FedeDP commented Dec 16, 2022

deepskyblue86 commented Dec 16, 2022

poiana commented Mar 16, 2023

alacuku commented Mar 17, 2023

poiana commented Jun 17, 2023

poiana commented Jul 17, 2023

poiana commented Aug 16, 2023

poiana commented Aug 16, 2023

abroglesc commented Aug 16, 2023

poiana commented Aug 16, 2023

Andreagit97 commented Aug 17, 2023

poiana commented Sep 30, 2023

poiana commented Sep 30, 2023

Andreagit97 commented Oct 2, 2023

Andreagit97 commented Jan 12, 2024

Andreagit97 commented Jan 30, 2024

cpungasc commented May 25, 2022 •

edited

Loading

FedeDP commented May 30, 2022 •

edited

Loading

leogr commented Jun 1, 2022 •

edited

Loading