-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
possible memory leak: config with hostmetrics, kubeletstats, prometheus recievers + transform/k8sattributes processors #36351
Comments
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
@gord1anknot can you post pictures of the profiles graphs? Something like https://pprof.me/ is an easy way. |
Pinging code owners for pkg/ottl: @TylerHelmuth @kentquirk @bogdandrutu @evan-bradley. See Adding Labels via Comments if you do not have permissions to add labels yourself. For example, comment '/label priority:p2 -needs-triaged' to set the priority and remove the needs-triaged label. |
Pinging code owners for receiver/prometheus: @Aneurysm9 @dashpole. See Adding Labels via Comments if you do not have permissions to add labels yourself. For example, comment '/label priority:p2 -needs-triaged' to set the priority and remove the needs-triaged label. |
It seems like the commonality between this issue and #36574 is the prometheus receiver. I don't see anything in the profile pictures that point to OTTL, it seems to point to k8sattributes, but that component is known for using a lot of memory bc it has to keep a lot of data in memory. |
Pinging code owners for processor/k8sattributes: @dmitryax @fatsheep9146 @TylerHelmuth. See Adding Labels via Comments if you do not have permissions to add labels yourself. For example, comment '/label priority:p2 -needs-triaged' to set the priority and remove the needs-triaged label. |
Can you look at the The other thing I would change is that your prometheus SD seems to be watching pods on all nodes. In particular: - action: keep
regex: ${env:K8S_NODE_NAME}
source_labels:
- __meta_kubernetes_pod_node_name That will still watch all pods, and then later filter out pods that are on other nodes when choosing targets. Instead, I would strongly recommend switching to using the
|
Component(s)
processor/transform
What happened?
Description
Hello! My organization has a helm deployment of opentelemetry collector, and we are seeing what I would describe as a memory leak with one particular daemonset tasked with ingesting prometheus, kubelet, and host metrics from it's node. We have worked around this issue by periodically restarting this workload.
The memory usage comes on very gradually; it takes about two weeks to build up, at which point CPU usage maxes out from a constant loop of garbage collection. At that point, metrics are refused due to this contention.
On August 2nd, we tried splitting the configuration into two daemonsets to isolate log forwarding from metrics when it reaches this condition. The log forwarding configuration does not have this problem.
We observed this issue both before an upgrade to
0.107.0
from0.92.0
and after a rollback back to0.92.0
to confirm that the memory issue was unrelated to the upgrade.I suspect but do not know that this issue comes out of our use of the transform processor, which is why I labeled the component that way. The reason I suspect is because we expanded our usage of it greatly on about July 13th, and the chart I believe shows that the memory issue rises to a problem level faster after this date.
Please see the chart below, going back to May 1st, for a visual on memory usage of our opentelemetry workloads. The
cluster-reciever
is a singleton pod for k8s cluster metrics and some high memory scrapes,logs-agent
is the split logs configuration, andcollector
is a gateway, but do not have issues.promql query for the chart seen below
Steps to Reproduce
We are able to reproduce this issue in lower environments, however, since the issue takes at least 14 days to show up, we cannot iterate very quickly here. Please find a complete configuration for the
metrics-agent
daemonset below.Details
I noticed that other memory leak issues usually require the reporter to post a heap pprof. I added pprof to our lower environments. Please find a heap dump of the oldest pod so instrumented (12 days old), unfortunately, it's not churning garbage collection yet though it's getting close.
Unfortunately, I'm running out of time to look at this issue, and I don't have much go experience to understand what I'm looking at in the heap dump. To work around, we have implemented an automatic restart on Mondays, hoping you can help.
Thank you so very much!
pprof.otelcol-contrib.samples.cpu.003.pb.gz
pprof.otelcol-contrib.alloc_objects.alloc_space.inuse_objects.inuse_space.012.pb.gz
Expected Result
Garbage collection fully reclaims memory from routine operations
Actual Result
Garbage collection doesn't seem to affect some part of overall memory consumption.
Collector version
v0.92.0
Environment information
Environment
OS: GKE / ContainerOS
Compiler(if manually compiled): using public docker image
OpenTelemetry Collector configuration
Log output
Additional context
Although the
metrics-agent
is configured to receive logs, metrics, and traces over OTLP, it does not do so in practice at this time. None of our services emit otlp metrics to the metrics-agent, only to the gateway deployment, which does not have this issue. On the metrics agent, the ports aren't even exposed. It collects metric signals using hostmetrics, kubeletstats, and prometheus only.The text was updated successfully, but these errors were encountered: