`kubernetes_events` input seems to create high cpu usage #9787

applike-ss · 2025-01-02T10:30:21Z

Bug Report

Describe the bug
I have deployed a fluent-bit via a Deployment which' only job is to gather kubernetes_events and output them somewhere.
This fluent-bit seems to have an issue where sometimes over the timespan of a few minutes to sometimes multiple hours the cpu usage goes to 1 (100% on 1 core).
The deployment only has a request of >1, no limit set, and the node has a lot of spare cpu capacity (32 core system).
My other fluent-bits which are gathering logs and outputting to the same output do not seem to have this issue.
There is no custom parsers in custom_parsers.conf.
I do use the helm chart of fluent-bit with these values:

    kind: Deployment
    autoscaling:
      vpa:
        enabled: true
    config:
      hotReload:
        enabled: true
      inputs: |
        [Input]
            Name    kubernetes_events
            db      /var/sync/db
            kube_retention_time 15m
            Tag     k8s-events
      customParsers: ""
      filters: ""
      outputs: |
        [Output]
            Name    forward
            Match    k8s-events
            Retry_Limit    5
            Host    my-external-fluentd-hostname
            Port    15000
    extraVolumes:
      - name: sync
        persistentVolumeClaim:
          claimName: fluent-bit-k8s-events-sync
    extraVolumeMounts:
      - name: sync
        mountPath: /var/sync
    image:
      tag: 3.2.4
    rbac:
      create: true
      eventsAccess: true
    replicaCount: 1
    serviceMonitor:
      enabled: true
    updateStrategy:
      type: RollingUpdate
      rollingUpdate:
        maxUnavailable: 1
        maxSurge: 1

I can also see on the node that it is fluent-bit itself causing the cpu usage and not the config watcher or hot-reload mechanism:

# ps aux | grep fluent
root        3751  0.0  0.0 1226304 2164 ?        Ssl  09:50   0:00 /fluent-bit/bin/fluent-bit-watcher
root        3778  0.1  0.1 125872 19508 ?        Sl   09:50   0:02 /fluent-bit/bin/fluent-bit --enable-hot-reload -c /fluent-bit/etc/fluent-bit.conf
root       54668 99.6  0.5 295496 92072 ?        Ssl  10:14   5:25 /fluent-bit/bin/fluent-bit --workdir=/fluent-bit/etc --config=/fluent-bit/etc/conf/fluent-bit.conf

To Reproduce

Run fluent-bit with the given chart values for some days (ensure to create a pvc fluent-bit-k8s-events-sync first)
observe cpu usage

Expected behavior
cpu usage should correlate to event amount produced

Screenshots

Your Environment

Version used: 3.2.4
Configuration: as can be seen above, manually create a pvc with name fluent-bit-k8s-events-sync that can be used to create the db sync
Environment name and version (e.g. Kubernetes? What version?): Kubernetes - AWS EKS v1.31.1-eks-1b3e656
Server type and version: AWS EC2 Instance
Operating System and version: Bottlerocket OS 1.29.0 (aws-k8s-1.31)
Filters and plugins: none

Additional context
It seems that fluent-bit is still processing events and writing them to the output, but i haven't checked if they are complete.
I do see this behavior across all our clusters, except those where the output is running inside the same cluster (the outputs hostname is an internal kubernetes service in this case).

The text was updated successfully, but these errors were encountered:

Akila-I · 2025-01-20T01:55:17Z

I have been experiencing the same behaviour. It goes away when you are not using the db for checkpointing. However, it may lead to some data loss as I understand. A proper solution from maintainers for this issue is much appreciated.

applike-ss · 2025-01-20T08:01:58Z

I have been experiencing the same behaviour. It goes away when you are not using the db for checkpointing. However, it may lead to some data loss as I understand. A proper solution from maintainers for this issue is much appreciated.

This is indeed a workaround, but one i would not like to have. Thanks for sharing it though, i wasn't aware that it was caused by the db.

cm-rudolph · 2025-01-23T12:24:26Z

We are encountering the same issue and analyzed it by looking into the sqlite db. We saw that the entries don't get deleted by the cleanup code.

The cpu usage seems to be high as there is no index in the uid column that gets used and the duplicate checking (for each processed event) takes really long when the database grows.

I guess that the bug is located in the calculation of the retention_time_ago:

fluent-bit/plugins/in_kubernetes_events/kubernetes_events.c

Line 658 in 81f62b9

retention_time_ago = now - (ctx->retention_time);

The stored timestamp has nanoseconds precision, but in line 652 the timestamp gets divided by 1 billion. Instead the retention time should be multiplied by 1 billion in line 658, as it is done here:

fluent-bit/plugins/in_kubernetes_events/kubernetes_events.c

Line 305 in 81f62b9

outdated = cfl_time_now() - (ctx->retention_time * 1000000000L);

applike-ss · 2025-01-24T07:14:09Z

We are encountering the same issue and analyzed it by looking into the sqlite db. We saw that the entries don't get deleted by the cleanup code.

The cpu usage seems to be high as there is no index in the uid column that gets used and the duplicate checking (for each processed event) takes really long when the database grows.

I guess that the bug is located in the calculation of the retention_time_ago:

fluent-bit/plugins/in_kubernetes_events/kubernetes_events.c

Line 658 in 81f62b9

retention_time_ago = now - (ctx->retention_time);
The stored timestamp has nanoseconds precision, but in line 652 the timestamp gets divided by 1 billion. Instead the retention time should be multiplied by 1 billion in line 658, as it is done here:

fluent-bit/plugins/in_kubernetes_events/kubernetes_events.c

Line 305 in 81f62b9

outdated = cfl_time_now() - (ctx->retention_time * 1000000000L);

Nice find @cm-rudolph !

Can you create a PR so the fluent team gets awareness of this issue (and its fix)?

applike-ss added the status: waiting-for-triage label Jan 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`kubernetes_events` input seems to create high cpu usage #9787

`kubernetes_events` input seems to create high cpu usage #9787

applike-ss commented Jan 2, 2025 •

edited

Loading

Akila-I commented Jan 20, 2025

applike-ss commented Jan 20, 2025

cm-rudolph commented Jan 23, 2025 •

edited

Loading

applike-ss commented Jan 24, 2025

kubernetes_events input seems to create high cpu usage #9787

kubernetes_events input seems to create high cpu usage #9787

Comments

applike-ss commented Jan 2, 2025 • edited Loading

Bug Report

Akila-I commented Jan 20, 2025

applike-ss commented Jan 20, 2025

cm-rudolph commented Jan 23, 2025 • edited Loading

applike-ss commented Jan 24, 2025

`kubernetes_events` input seems to create high cpu usage #9787

`kubernetes_events` input seems to create high cpu usage #9787

applike-ss commented Jan 2, 2025 •

edited

Loading

cm-rudolph commented Jan 23, 2025 •

edited

Loading