Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

otelcol-contrib suddenly creates massive disk i/o #36180

Closed
jcpunk opened this issue Nov 4, 2024 · 5 comments
Closed

otelcol-contrib suddenly creates massive disk i/o #36180

jcpunk opened this issue Nov 4, 2024 · 5 comments
Labels
bug Something isn't working needs triage New item requiring triage

Comments

@jcpunk
Copy link

jcpunk commented Nov 4, 2024

Component(s)

No response

What happened?

Description

The otelcol-contrib binary suddenly is doing a whole lot of disk IO with no real reason/cause

Steps to Reproduce

Unknown, it just sorta happens after the container has been up for a week or so since v0.107.0 (first version I ever used)

Expected Result

Actual Result

Total DISK READ :     272.86 M/s | Total DISK WRITE :     240.21 K/s
Actual DISK READ:     272.39 M/s | Actual DISK WRITE:     272.08 M/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND                              
2564304 be/7 monitor    23.19 M/s    0.00 B/s 88.38 % 84.87 % otelcol-contrib --c~tor/otel-config.yaml
2566875 be/7 monitor    22.57 M/s    0.00 B/s 88.26 % 84.77 % otelcol-contrib --c~tor/otel-config.yaml
2566878 be/7 monitor    23.14 M/s    0.00 B/s 87.83 % 84.57 % otelcol-contrib --c~tor/otel-config.yaml
2569174 be/7 monitor    22.75 M/s    0.00 B/s 87.50 % 84.21 % otelcol-contrib --c~tor/otel-config.yaml
2562628 be/7 monitor    22.76 M/s    0.00 B/s 87.55 % 84.12 % otelcol-contrib --c~tor/otel-config.yaml
2564303 be/7 monitor    22.81 M/s    0.00 B/s 87.38 % 83.85 % otelcol-contrib --c~tor/otel-config.yaml
2559375 be/7 monitor    23.09 M/s    0.00 B/s 87.10 % 83.83 % otelcol-contrib --c~tor/otel-config.yaml
2562623 be/7 monitor    22.39 M/s    0.00 B/s 87.50 % 83.78 % otelcol-contrib --c~tor/otel-config.yaml
2562622 be/7 monitor    22.97 M/s    0.00 B/s 87.07 % 83.72 % otelcol-contrib --c~tor/otel-config.yaml
2566879 be/7 monitor    22.53 M/s    0.00 B/s 86.88 % 83.33 % otelcol-contrib --c~tor/otel-config.yaml
2566874 be/7 monitor    22.86 M/s    0.00 B/s 87.11 % 83.21 % otelcol-contrib --c~tor/otel-config.yaml
2569170 be/7 monitor    21.75 M/s    0.00 B/s 85.48 % 82.03 % otelcol-contrib --c~tor/otel-config.yaml
2569175 be/7 monitor    38.29 K/s    0.00 B/s  0.14 %  0.13 % otelcol-contrib --c~tor/otel-config.yaml
2559388 be/7 monitor     6.96 K/s    0.00 B/s  0.07 %  0.07 % otelcol-contrib --c~tor/otel-config.yaml

Collector version

v0.112.0

Environment information

Environment

OS: Almalinux 9
Podman: ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib:latest

OpenTelemetry Collector configuration

---
processors:
  batch: {}
  transform/hostname:
    metric_statements:
    - context: datapoint
      statements:
      - set(attributes["nodename"], "mu2e-dcs-01.fnal.gov")
      - set(resource.attributes["nodename"], "mu2e-dcs-01.fnal.gov")
exporters:
  prometheus:
    endpoint: "[::]:9299"
    enable_open_metrics: true
    metric_expiration: 2m
service:
  pipelines:
    metrics:
      receivers:
      - prometheus
      processors:
      - transform/hostname
      - batch
      exporters:
      - prometheus
receivers:
  prometheus:
    config:
      scrape_configs:
      - job_name: node-exporter
        scrape_interval: 45s
        static_configs:
        - targets:
          - localhost:9100
          labels:
            instance: mu2e-dcs-01.fnal.gov:9100
      - job_name: systemd-exporter
        scrape_interval: 45s
        static_configs:
        - targets:
          - localhost:9558
          labels:
            instance: mu2e-dcs-01.fnal.gov:9558

Log output

<30>1 2024-11-04T00:12:42.604170-06:00 mu2e-dcs-01.fnal.gov podman 805977 - -             otel-collector.service               2fbf0ebdf1a2 (systemd-otel-collector)               ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib:latest  registry    false
<11>1 2024-11-04T15:24:51.816655-06:00 mu2e-dcs-01.fnal.gov systemd-otel-collector 2559367 - - 2024-11-04T21:24:51.757Z#011warn#011internal/transaction.go:128#011Failed to scrape Prometheus endpoint#011{"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_timestamp": 1730755481211, "target_labels": "{__name__=\"up\", instance=\"mu2e-dcs-01.fnal.gov:9100\", job=\"node-exporter\"}"}

Additional context

[root@mu2e-dcs-01 ~]# cat /etc/containers/systemd/otel-collector.container

[Service]
TimeoutStartSec=900
TimeoutStopSec=30
CPUWeight=30
MemoryMax=256M
IOSchedulingClass=best-effort
IOSchedulingPriority=7
IOWeight=30
Restart=always

[Container]
AutoUpdate=registry
DropCapability=ALL
User=5219
Group=8247
HostName=%H
LogDriver=journald
NoNewPrivileges=true
Pull=missing
ReadOnly=true
Volume=/etc/otel-collector:/etc/otel-collector:ro,rslave,z
Exec=--config /etc/otel-collector/otel-config.yaml

Image=ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib:latest
Network=host
PublishPort=[::]:9299:9299

[Install]
WantedBy=default.target

strace of process for little bit:

[root@mu2e-dcs-01 ~]# strace -f -p 2559369
strace: Process 2559369 attached with 53 threads
[pid 2571898] futex(0xc002400848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2569175] futex(0xc00129d248, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2569174] futex(0xc004000f48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2569173] futex(0xc003f81648, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2569172] futex(0xc003f01648, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2569171] futex(0xc003e81648, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2569169] futex(0xc003d01648, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2569168] futex(0xc00320d648, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2569167] futex(0xc003e00848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2566881] futex(0xc001381d48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2566880] futex(0xc00129c448, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2566879] futex(0xc004000848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2566878] futex(0xc003f80f48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2566877] futex(0xc003f00f48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2566876] futex(0xc003e80f48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2566875] futex(0xc003d80f48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2566874] futex(0xc003d00f48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2566873] futex(0xc00320cf48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2564304] futex(0xc003f80848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2564303] futex(0xc003f00848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2564302] futex(0xc003e80848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2564301] futex(0xc003d80848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2564300] futex(0xc003d00848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2564299] futex(0xc00320c148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2564298] futex(0xc001202448, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2562628] futex(0xc004000148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2562627] futex(0xc003f80148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2562626] futex(0xc003f00148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2562625] futex(0xc003e80148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2562624] futex(0xc003e00148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2562623] futex(0xc003d80148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2562622] futex(0xc003d00148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559396] futex(0x1000f5b8, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559390] futex(0xc002600148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559389] futex(0xc002480148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559388] futex(0xc00129a148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559387] futex(0xc002400148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559386] futex(0xc002380848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559385] futex(0xc002380148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559384] futex(0xc000147948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559383] futex(0xc001400848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559381] futex(0xc001400148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559380] futex(0xc001300848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559379] futex(0xc001380148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559378] futex(0xc001300148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559377] futex(0xc001280148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559376] futex(0x1000f860, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559375] futex(0xc000180148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559374] futex(0xc000144f48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559373] futex(0xc000144848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559372] restart_syscall(<... resuming interrupted read ...> <unfinished ...>
[pid 2559369] futex(0xffdae20, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 0, NULL, 0) = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559372] <... restart_syscall resumed>) = 0
[pid 2559372] futex(0xffe4680, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=190216951} <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 190, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] futex(0xc00129a148, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559372] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 2569170] <... epoll_pwait resumed>[], 128, 0, NULL, 0) = 0
[pid 2559388] <... futex resumed>)      = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559372] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 2559388] futex(0xc00129a148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559372] <... nanosleep resumed>NULL) = 0
[pid 2559372] futex(0xffe4680, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=189637237} <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 199, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559372] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 2559372] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 1, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] futex(0xc00129a148, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559388] <... futex resumed>)      = 0
[pid 2559388] futex(0xc00129a148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559372] <... nanosleep resumed>NULL) = 0
[pid 2559372] futex(0xffe4680, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=190156915} <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 199, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559372] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 2559372] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 1, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] futex(0xc00129a148, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 2559388] <... futex resumed>)      = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559388] epoll_pwait(4,  <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 0, NULL, 0) = 0
[pid 2559388] <... epoll_pwait resumed>[], 128, 0, NULL, 0) = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559388] futex(0xc00129a148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559372] <... nanosleep resumed>NULL) = 0
[pid 2559372] futex(0xffe4680, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=190100964} <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 199, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559372] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 2559372] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 1, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] futex(0xc00129a148, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2559388] <... futex resumed>)      = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559388] futex(0xc00129a148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559372] <... nanosleep resumed>NULL) = 0
[pid 2559372] futex(0xffe4680, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=190091615} <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 199, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559372] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 2559372] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 1, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] futex(0xc00129a148, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559388] <... futex resumed>)      = 0
[pid 2569170] <... epoll_pwait resumed>[], 128, 0, NULL, 0) = 0
[pid 2559388] epoll_pwait(4,  <unfinished ...>
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559388] <... epoll_pwait resumed>[], 128, 0, NULL, 0) = 0
[pid 2559388] futex(0xc00129a148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559372] <... nanosleep resumed>NULL) = 0
[pid 2559372] futex(0xffe4680, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=190280753} <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 199, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559372] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 2559372] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 1, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] futex(0xc00129a148, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2559388] <... futex resumed>)      = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559388] futex(0xc00129a148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559372] <... nanosleep resumed>NULL) = 0
[pid 2559372] futex(0xffe4680, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=190496487} <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 199, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559372] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 2559372] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 1, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] futex(0xc00129a148, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559388] <... futex resumed>)      = 0
[pid 2559388] futex(0xc00129a148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559372] <... nanosleep resumed>NULL) = 0
[pid 2559372] futex(0xffe4680, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=190536362} <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 199, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559372] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 2559372] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 1, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] futex(0xc00129a148, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 2559388] <... futex resumed>)      = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559388] epoll_pwait(4,  <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 0, NULL, 0) = 0
[pid 2559388] <... epoll_pwait resumed>[], 128, 0, NULL, 0) = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559388] futex(0xc00129a148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559372] <... nanosleep resumed>NULL) = 0
[pid 2559372] futex(0xffe4680, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=190177556} <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 199, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559372] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 2559372] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 1, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] futex(0xc00129a148, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2559388] <... futex resumed>)      = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559388] futex(0xc00129a148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559372] <... nanosleep resumed>NULL) = 0
[pid 2559372] futex(0xffe4680, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=190050015} <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 199, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559372] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 2559372] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 1, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] futex(0xc00129a148, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559388] <... futex resumed>)      = 0
[pid 2559388] futex(0xc00129a148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559372] <... nanosleep resumed>NULL) = 0
[pid 2559372] futex(0xffe4680, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=190192720} <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 199, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559372] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 2559372] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 1, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] futex(0xc00129a148, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 2569170] epoll_pwait(4,  <unfinished ...>

strace a little later

[root@mu2e-dcs-01 ~]# strace -f -p 2559369
strace: Process 2559369 attached with 53 threads
[pid 2571898] futex(0xc002400848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2569173] futex(0xc003f81648, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2569172] futex(0xc003f01648, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2569171] futex(0xc003e81648, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2569169] futex(0xc003d01648, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2569168] futex(0xc00320d648, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2569167] futex(0xc003e00848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2566881] futex(0xc001381d48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2566880] futex(0xc00129c448, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2566879] futex(0xc004000848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2566877] futex(0xc003f00f48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2566876] futex(0xc003e80f48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2566873] futex(0xc00320cf48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2564302] futex(0xc003e80848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2564301] futex(0xc003d80848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2564300] futex(0xc003d00848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2564299] futex(0xc00320c148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2564298] futex(0xc001202448, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2562627] futex(0xc003f80148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2562626] futex(0xc003f00148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2562624] futex(0xc003e00148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2562622] futex(0xc003d00148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559396] futex(0x1000f5b8, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559390] futex(0xc002600148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559389] futex(0xc002480148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559387] futex(0xc002400148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559386] futex(0xc002380848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559385] futex(0xc002380148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559384] futex(0xc000147948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559383] futex(0xc001400848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559381] futex(0xc001400148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559380] futex(0xc001300848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559379] futex(0xc001380148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559378] futex(0xc001300148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559377] futex(0xc001280148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559376] futex(0x1000f860, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559374] futex(0xc000144f48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559373] futex(0xc000144848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559372] restart_syscall(<... resuming interrupted read ...> <unfinished ...>
[pid 2559369] futex(0xffdae20, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 0, NULL, 0) = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559372] <... restart_syscall resumed>) = 0
[pid 2559372] getpid()                  = 1
[pid 2559372] tgkill(1, 5, SIGURG)      = 0
[pid 2559372] getpid()                  = 1
[pid 2559375] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1, si_uid=5219} ---
[pid 2559372] tgkill(1, 24, SIGURG <unfinished ...>
[pid 2559375] rt_sigreturn({mask=[]} <unfinished ...>
[pid 2559372] <... tgkill resumed>)     = 0
[pid 2562625] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1, si_uid=5219} ---
[pid 2559375] <... rt_sigreturn resumed>) = 825556758272
[pid 2559372] getpid( <unfinished ...>
[pid 2562625] rt_sigreturn({mask=[]} <unfinished ...>
[pid 2559372] <... getpid resumed>)     = 1
[pid 2562625] <... rt_sigreturn resumed>) = 825193762800
[pid 2559372] tgkill(1, 52, SIGURG)     = 0
[pid 2569175] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1, si_uid=5219} ---
[pid 2559372] getpid( <unfinished ...>
[pid 2569175] rt_sigreturn({mask=[]} <unfinished ...>
[pid 2559372] <... getpid resumed>)     = 1
[pid 2569175] <... rt_sigreturn resumed>) = 825300078528
[pid 2559372] tgkill(1, 33, SIGURG)     = 0
[pid 2559372] getpid()                  = 1
[pid 2559372] tgkill(1, 34, SIGURG)     = 0
[pid 2559372] getpid( <unfinished ...>
[pid 2564304] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1, si_uid=5219} ---
[pid 2559372] <... getpid resumed>)     = 1
[pid 2564304] rt_sigreturn({mask=[]} <unfinished ...>
[pid 2559372] tgkill(1, 27, SIGURG <unfinished ...>
[pid 2564304] <... rt_sigreturn resumed>) = 139719454742824
[pid 2559372] <... tgkill resumed>)     = 0
[pid 2559372] getpid()                  = 1
[pid 2559372] tgkill(1, 37, SIGURG)     = 0
[pid 2559372] getpid( <unfinished ...>
[pid 2566875] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1, si_uid=5219} ---
[pid 2559372] <... getpid resumed>)     = 1
[pid 2566875] rt_sigreturn({mask=[]} <unfinished ...>
[pid 2559372] tgkill(1, 36, SIGURG <unfinished ...>
[pid 2566875] <... rt_sigreturn resumed>) = 139719506348232
[pid 2559372] <... tgkill resumed>)     = 0
[pid 2559372] getpid()                  = 1
[pid 2559372] tgkill(1, 17, SIGURG)     = 0
[pid 2559372] getpid()                  = 1
[pid 2559372] tgkill(1, 22, SIGURG)     = 0
[pid 2559372] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 6, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] futex(0xc003e00148, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 2562624] <... futex resumed>)      = 0
[pid 2562624] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2562624] epoll_pwait(4,  <unfinished ...>
[pid 2564303] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1, si_uid=5219} ---
[pid 2564303] rt_sigreturn({mask=[]})   = 139719544750728
[pid 2566874] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1, si_uid=5219} ---
[pid 2566874] rt_sigreturn({mask=[]} <unfinished ...>
[pid 2562623] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1, si_uid=5219} ---
[pid 2562623] rt_sigreturn({mask=[]} <unfinished ...>
[pid 2559388] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1, si_uid=5219} ---
[pid 2562623] <... rt_sigreturn resumed>) = 826248941504
[pid 2559388] rt_sigreturn({mask=[]} <unfinished ...>
[pid 2566874] <... rt_sigreturn resumed>) = 825087975040
[pid 2559388] <... rt_sigreturn resumed>) = 826291204000
[pid 2562628] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1, si_uid=5219} ---
[pid 2562628] rt_sigreturn({mask=[]})   = 827566583712
[pid 2569170] futex(0xc003d81648, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559372] <... nanosleep resumed>NULL) = 0
[pid 2559372] getpid()                  = 1
[pid 2559372] tgkill(1, 5, SIGURG)      = 0
[pid 2559372] getpid()                  = 1
[pid 2559372] tgkill(1, 24, SIGURG)     = 0
[pid 2559372] getpid()                  = 1
[pid 2559372] tgkill(1, 52, SIGURG)     = 0

lsof for dedicated user:

[root@mu2e-dcs-01 ~]# lsof -u monitor
COMMAND       PID    USER   FD      TYPE    DEVICE  SIZE/OFF      NODE NAME
otelcol-c 2559369 monitor  cwd       DIR     0,117        84  33554591 /
otelcol-c 2559369 monitor  rtd       DIR     0,117        84  33554591 /
otelcol-c 2559369 monitor  txt       REG     0,117 263930008 100672926 /otelcol-contrib
otelcol-c 2559369 monitor  mem       REG     253,2           100672926 /otelcol-contrib (stat: No such file or directory)
otelcol-c 2559369 monitor    0u      CHR       1,3       0t0         5 /dev/null
otelcol-c 2559369 monitor    1w     FIFO      0,13       0t0 108577923 pipe
otelcol-c 2559369 monitor    2w     FIFO      0,13       0t0 108577924 pipe
otelcol-c 2559369 monitor    3u     IPv4 108567751       0t0       TCP localhost:ddi-tcp-1 (LISTEN)
otelcol-c 2559369 monitor    4u  a_inode      0,14         0      2079 [eventpoll:3,5,6,7,8]
otelcol-c 2559369 monitor    5u  a_inode      0,14         0      2079 [eventfd:584]
otelcol-c 2559369 monitor    6u     IPv6 145586142       0t0       TCP localhost:56780->localhost:jetdirect (ESTABLISHED)
otelcol-c 2559369 monitor    7u     IPv6 108595225       0t0       TCP *:9299 (LISTEN)
otelcol-c 2559369 monitor    8u     IPv6 108595350       0t0       TCP localhost:33310->localhost:9558 (ESTABLISHED)

podman ps:

[root@mu2e-dcs-01 ~]# podman ps
CONTAINER ID  IMAGE                                                                                           COMMAND               CREATED     STATUS      PORTS       NAMES
353d29fa2f5d  quay.io/prometheuscommunity/systemd-exporter:latest                                             --systemd.collect...  3 days ago  Up 3 days               systemd-prometheus-systemd-exporter
77768f62f72e  quay.io/prometheus/node-exporter:latest                                                         --path.procfs=/ho...  3 days ago  Up 3 days               systemd-prometheus-node-exporter
2fbf0ebdf1a2  ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib:latest  --config /etc/ote...  3 days ago  Up 3 days               systemd-otel-collector
@jcpunk jcpunk added bug Something isn't working needs triage New item requiring triage labels Nov 4, 2024
@atoulme
Copy link
Contributor

atoulme commented Nov 10, 2024

Do you use swap on your machine?

@jcpunk
Copy link
Author

jcpunk commented Nov 10, 2024

The system does have a linux swap partition, but I'm not showing any memory pressure or pages swapped out.

@daniel-hoefer
Copy link

@jcpunk According to the iotop usage help a SWAPIN of 88% means that 88% of time was spent on swapping in memory pages from disk. If not from memory pressure then what other explanation could there be for swapped out memory pages?

@jcpunk
Copy link
Author

jcpunk commented Nov 12, 2024

I wonder if it hit the pod memory limit and started swapping rather than oom kill? I'll add the memory_limiter to my config to see if that makes a difference. Not that I can trigger this bug on purpose...

@jcpunk
Copy link
Author

jcpunk commented Nov 26, 2024

I think this might be related to the garbage collection errors noted in : #36351

That would explain why the memory grew enough to hit the limit.

@jcpunk jcpunk closed this as completed Nov 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage New item requiring triage
Projects
None yet
Development

No branches or pull requests

3 participants