-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
memory leak on system with 128 x86_64 cores #36574
Comments
When transforms are removed (#36351) the leak is drastically reduced - but not resolved. |
Pinging code owners for processor/transform: @TylerHelmuth @kentquirk @bogdandrutu @evan-bradley. See Adding Labels via Comments if you do not have permissions to add labels yourself. For example, comment '/label priority:p2 -needs-triaged' to set the priority and remove the needs-triaged label. |
@jcpunk if possible, can you enable pprofextension and attach the memory profile? |
Is there a particular profile you'd like me to extract? I'm not super familiar with go and pprof and there seem to be a lot of possible urls.... |
You can refer to README and enable the extension. Then follow these steps:
Collect a few heap profiles and attach them. |
pprof files attached here |
Interestingly, I've confirmed the leak persists without the transform in the pipeline. Config:
And some pprof's : pprof-notransform.tar.gz |
This is the second issue that mentions OTTL having a memory leak. @jcpunk like in that issue, can you post screenshots of the profile indicating where in the transformprocessor/ottl the issue is happening? It would be really helpful if someone can provide a reproducible test case locally. |
Pinging code owners for pkg/ottl: @TylerHelmuth @kentquirk @bogdandrutu @evan-bradley. See Adding Labels via Comments if you do not have permissions to add labels yourself. For example, comment '/label priority:p2 -needs-triaged' to set the priority and remove the needs-triaged label. |
I misread this comment, which seems to exonerate OTTL |
My guess is that there is something with the prometheus components to blame. |
I'll confess I'm not sure how to parse out the attached pprof files. I was able to load them into https://pprof.me/ but I'd think an expert would be better served by the raw files rather than my cropped screenshots. |
Pinging code owners for receiver/prometheus: @Aneurysm9 @dashpole. See Adding Labels via Comments if you do not have permissions to add labels yourself. For example, comment '/label priority:p2 -needs-triaged' to set the priority and remove the needs-triaged label. |
Pinging code owners for exporter/prometheus: @Aneurysm9 @dashpole @ArthurSens. See Adding Labels via Comments if you do not have permissions to add labels yourself. For example, comment '/label priority:p2 -needs-triaged' to set the priority and remove the needs-triaged label. |
I've also captured some pprof files of the process as the memory is growing with just the otel-core distribution. A large number of cores seems to be critical to replication. |
Describe the bug
I've got an x86_64 system with 128 cores. The otel collector adds about 5Mib to its working memory every time it scrapes a metrics endpoint. Eventually it hits up against the
memorylimiter
but the garbage collection never seems to really make headway and eventually fails to reclaim enough memory.My identically configured systems with 8 or 16 x86_64 cores do not appear to leak in this manner.
My aarch64 system with a similar config and with 64 cores does also appear to leak in this manner.
Steps to reproduce
Run the otel-collector on a system with a lot of processing cores
What did you expect to see?
Memory usage eventually stabilize
What did you see instead?
Memory usage grows to fill space allotted - tested up to 4Gib (take 6 days)
What version did you use?
otelcol-contrib version 0.114.0 (memory code is probably in the base collector)
What config did you use?
Environment
OS: Almalinux 9
Platform: podman
Podman Quadlet file: /etc/containers/systemd/otel-collector.container
Additional context
endpoints:
logs
The text was updated successfully, but these errors were encountered: