Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubernetes_events input seems to create high cpu usage #9787

Open
applike-ss opened this issue Jan 2, 2025 · 4 comments
Open

kubernetes_events input seems to create high cpu usage #9787

applike-ss opened this issue Jan 2, 2025 · 4 comments

Comments

@applike-ss
Copy link

applike-ss commented Jan 2, 2025

Bug Report

Describe the bug
I have deployed a fluent-bit via a Deployment which' only job is to gather kubernetes_events and output them somewhere.
This fluent-bit seems to have an issue where sometimes over the timespan of a few minutes to sometimes multiple hours the cpu usage goes to 1 (100% on 1 core).
The deployment only has a request of >1, no limit set, and the node has a lot of spare cpu capacity (32 core system).
My other fluent-bits which are gathering logs and outputting to the same output do not seem to have this issue.
There is no custom parsers in custom_parsers.conf.
I do use the helm chart of fluent-bit with these values:

    kind: Deployment
    autoscaling:
      vpa:
        enabled: true
    config:
      hotReload:
        enabled: true
      inputs: |
        [Input]
            Name    kubernetes_events
            db      /var/sync/db
            kube_retention_time 15m
            Tag     k8s-events
      customParsers: ""
      filters: ""
      outputs: |
        [Output]
            Name    forward
            Match    k8s-events
            Retry_Limit    5
            Host    my-external-fluentd-hostname
            Port    15000
    extraVolumes:
      - name: sync
        persistentVolumeClaim:
          claimName: fluent-bit-k8s-events-sync
    extraVolumeMounts:
      - name: sync
        mountPath: /var/sync
    image:
      tag: 3.2.4
    rbac:
      create: true
      eventsAccess: true
    replicaCount: 1
    serviceMonitor:
      enabled: true
    updateStrategy:
      type: RollingUpdate
      rollingUpdate:
        maxUnavailable: 1
        maxSurge: 1

I can also see on the node that it is fluent-bit itself causing the cpu usage and not the config watcher or hot-reload mechanism:

# ps aux | grep fluent
root        3751  0.0  0.0 1226304 2164 ?        Ssl  09:50   0:00 /fluent-bit/bin/fluent-bit-watcher
root        3778  0.1  0.1 125872 19508 ?        Sl   09:50   0:02 /fluent-bit/bin/fluent-bit --enable-hot-reload -c /fluent-bit/etc/fluent-bit.conf
root       54668 99.6  0.5 295496 92072 ?        Ssl  10:14   5:25 /fluent-bit/bin/fluent-bit --workdir=/fluent-bit/etc --config=/fluent-bit/etc/conf/fluent-bit.conf

To Reproduce

  • Run fluent-bit with the given chart values for some days (ensure to create a pvc fluent-bit-k8s-events-sync first)
  • observe cpu usage

Expected behavior
cpu usage should correlate to event amount produced

Screenshots
Bildschirmfoto 2025-01-02 um 11 23 44

Your Environment

  • Version used: 3.2.4
  • Configuration: as can be seen above, manually create a pvc with name fluent-bit-k8s-events-sync that can be used to create the db sync
  • Environment name and version (e.g. Kubernetes? What version?): Kubernetes - AWS EKS v1.31.1-eks-1b3e656
  • Server type and version: AWS EC2 Instance
  • Operating System and version: Bottlerocket OS 1.29.0 (aws-k8s-1.31)
  • Filters and plugins: none

Additional context
It seems that fluent-bit is still processing events and writing them to the output, but i haven't checked if they are complete.
I do see this behavior across all our clusters, except those where the output is running inside the same cluster (the outputs hostname is an internal kubernetes service in this case).

@Akila-I
Copy link

Akila-I commented Jan 20, 2025

I have been experiencing the same behaviour. It goes away when you are not using the db for checkpointing. However, it may lead to some data loss as I understand. A proper solution from maintainers for this issue is much appreciated.

@applike-ss
Copy link
Author

I have been experiencing the same behaviour. It goes away when you are not using the db for checkpointing. However, it may lead to some data loss as I understand. A proper solution from maintainers for this issue is much appreciated.

This is indeed a workaround, but one i would not like to have. Thanks for sharing it though, i wasn't aware that it was caused by the db.

@cm-rudolph
Copy link

cm-rudolph commented Jan 23, 2025

We are encountering the same issue and analyzed it by looking into the sqlite db. We saw that the entries don't get deleted by the cleanup code.

The cpu usage seems to be high as there is no index in the uid column that gets used and the duplicate checking (for each processed event) takes really long when the database grows.

I guess that the bug is located in the calculation of the retention_time_ago:

retention_time_ago = now - (ctx->retention_time);

The stored timestamp has nanoseconds precision, but in line 652 the timestamp gets divided by 1 billion. Instead the retention time should be multiplied by 1 billion in line 658, as it is done here:

outdated = cfl_time_now() - (ctx->retention_time * 1000000000L);

@applike-ss
Copy link
Author

We are encountering the same issue and analyzed it by looking into the sqlite db. We saw that the entries don't get deleted by the cleanup code.

The cpu usage seems to be high as there is no index in the uid column that gets used and the duplicate checking (for each processed event) takes really long when the database grows.

I guess that the bug is located in the calculation of the retention_time_ago:

fluent-bit/plugins/in_kubernetes_events/kubernetes_events.c

Line 658 in 81f62b9

retention_time_ago = now - (ctx->retention_time);
The stored timestamp has nanoseconds precision, but in line 652 the timestamp gets divided by 1 billion. Instead the retention time should be multiplied by 1 billion in line 658, as it is done here:

fluent-bit/plugins/in_kubernetes_events/kubernetes_events.c

Line 305 in 81f62b9

outdated = cfl_time_now() - (ctx->retention_time * 1000000000L);

Nice find @cm-rudolph !

Can you create a PR so the fluent team gets awareness of this issue (and its fix)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants