`otelcol-contrib` suddenly creates massive disk i/o #36180

jcpunk · 2024-11-04T22:04:07Z

Component(s)

No response

What happened?

Description

The otelcol-contrib binary suddenly is doing a whole lot of disk IO with no real reason/cause

Steps to Reproduce

Unknown, it just sorta happens after the container has been up for a week or so since v0.107.0 (first version I ever used)

Expected Result

Actual Result

Total DISK READ :     272.86 M/s | Total DISK WRITE :     240.21 K/s
Actual DISK READ:     272.39 M/s | Actual DISK WRITE:     272.08 M/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND                              
2564304 be/7 monitor    23.19 M/s    0.00 B/s 88.38 % 84.87 % otelcol-contrib --c~tor/otel-config.yaml
2566875 be/7 monitor    22.57 M/s    0.00 B/s 88.26 % 84.77 % otelcol-contrib --c~tor/otel-config.yaml
2566878 be/7 monitor    23.14 M/s    0.00 B/s 87.83 % 84.57 % otelcol-contrib --c~tor/otel-config.yaml
2569174 be/7 monitor    22.75 M/s    0.00 B/s 87.50 % 84.21 % otelcol-contrib --c~tor/otel-config.yaml
2562628 be/7 monitor    22.76 M/s    0.00 B/s 87.55 % 84.12 % otelcol-contrib --c~tor/otel-config.yaml
2564303 be/7 monitor    22.81 M/s    0.00 B/s 87.38 % 83.85 % otelcol-contrib --c~tor/otel-config.yaml
2559375 be/7 monitor    23.09 M/s    0.00 B/s 87.10 % 83.83 % otelcol-contrib --c~tor/otel-config.yaml
2562623 be/7 monitor    22.39 M/s    0.00 B/s 87.50 % 83.78 % otelcol-contrib --c~tor/otel-config.yaml
2562622 be/7 monitor    22.97 M/s    0.00 B/s 87.07 % 83.72 % otelcol-contrib --c~tor/otel-config.yaml
2566879 be/7 monitor    22.53 M/s    0.00 B/s 86.88 % 83.33 % otelcol-contrib --c~tor/otel-config.yaml
2566874 be/7 monitor    22.86 M/s    0.00 B/s 87.11 % 83.21 % otelcol-contrib --c~tor/otel-config.yaml
2569170 be/7 monitor    21.75 M/s    0.00 B/s 85.48 % 82.03 % otelcol-contrib --c~tor/otel-config.yaml
2569175 be/7 monitor    38.29 K/s    0.00 B/s  0.14 %  0.13 % otelcol-contrib --c~tor/otel-config.yaml
2559388 be/7 monitor     6.96 K/s    0.00 B/s  0.07 %  0.07 % otelcol-contrib --c~tor/otel-config.yaml

Collector version

v0.112.0

Environment information

Environment

OS: Almalinux 9
Podman: ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib:latest

OpenTelemetry Collector configuration

---
processors:
  batch: {}
  transform/hostname:
    metric_statements:
    - context: datapoint
      statements:
      - set(attributes["nodename"], "mu2e-dcs-01.fnal.gov")
      - set(resource.attributes["nodename"], "mu2e-dcs-01.fnal.gov")
exporters:
  prometheus:
    endpoint: "[::]:9299"
    enable_open_metrics: true
    metric_expiration: 2m
service:
  pipelines:
    metrics:
      receivers:
      - prometheus
      processors:
      - transform/hostname
      - batch
      exporters:
      - prometheus
receivers:
  prometheus:
    config:
      scrape_configs:
      - job_name: node-exporter
        scrape_interval: 45s
        static_configs:
        - targets:
          - localhost:9100
          labels:
            instance: mu2e-dcs-01.fnal.gov:9100
      - job_name: systemd-exporter
        scrape_interval: 45s
        static_configs:
        - targets:
          - localhost:9558
          labels:
            instance: mu2e-dcs-01.fnal.gov:9558

Log output

<30>1 2024-11-04T00:12:42.604170-06:00 mu2e-dcs-01.fnal.gov podman 805977 - -             otel-collector.service               2fbf0ebdf1a2 (systemd-otel-collector)               ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib:latest  registry    false
<11>1 2024-11-04T15:24:51.816655-06:00 mu2e-dcs-01.fnal.gov systemd-otel-collector 2559367 - - 2024-11-04T21:24:51.757Z#011warn#011internal/transaction.go:128#011Failed to scrape Prometheus endpoint#011{"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_timestamp": 1730755481211, "target_labels": "{__name__=\"up\", instance=\"mu2e-dcs-01.fnal.gov:9100\", job=\"node-exporter\"}"}

Additional context

[root@mu2e-dcs-01 ~]# cat /etc/containers/systemd/otel-collector.container

[Service]
TimeoutStartSec=900
TimeoutStopSec=30
CPUWeight=30
MemoryMax=256M
IOSchedulingClass=best-effort
IOSchedulingPriority=7
IOWeight=30
Restart=always

[Container]
AutoUpdate=registry
DropCapability=ALL
User=5219
Group=8247
HostName=%H
LogDriver=journald
NoNewPrivileges=true
Pull=missing
ReadOnly=true
Volume=/etc/otel-collector:/etc/otel-collector:ro,rslave,z
Exec=--config /etc/otel-collector/otel-config.yaml

Image=ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib:latest
Network=host
PublishPort=[::]:9299:9299

[Install]
WantedBy=default.target

strace of process for little bit:

[root@mu2e-dcs-01 ~]# strace -f -p 2559369
strace: Process 2559369 attached with 53 threads
[pid 2571898] futex(0xc002400848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2569175] futex(0xc00129d248, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2569174] futex(0xc004000f48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2569173] futex(0xc003f81648, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2569172] futex(0xc003f01648, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2569171] futex(0xc003e81648, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2569169] futex(0xc003d01648, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2569168] futex(0xc00320d648, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2569167] futex(0xc003e00848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2566881] futex(0xc001381d48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2566880] futex(0xc00129c448, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2566879] futex(0xc004000848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2566878] futex(0xc003f80f48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2566877] futex(0xc003f00f48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2566876] futex(0xc003e80f48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2566875] futex(0xc003d80f48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2566874] futex(0xc003d00f48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2566873] futex(0xc00320cf48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2564304] futex(0xc003f80848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2564303] futex(0xc003f00848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2564302] futex(0xc003e80848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2564301] futex(0xc003d80848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2564300] futex(0xc003d00848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2564299] futex(0xc00320c148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2564298] futex(0xc001202448, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2562628] futex(0xc004000148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2562627] futex(0xc003f80148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2562626] futex(0xc003f00148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2562625] futex(0xc003e80148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2562624] futex(0xc003e00148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2562623] futex(0xc003d80148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2562622] futex(0xc003d00148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559396] futex(0x1000f5b8, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559390] futex(0xc002600148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559389] futex(0xc002480148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559388] futex(0xc00129a148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559387] futex(0xc002400148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559386] futex(0xc002380848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559385] futex(0xc002380148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559384] futex(0xc000147948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559383] futex(0xc001400848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559381] futex(0xc001400148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559380] futex(0xc001300848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559379] futex(0xc001380148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559378] futex(0xc001300148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559377] futex(0xc001280148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559376] futex(0x1000f860, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559375] futex(0xc000180148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559374] futex(0xc000144f48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559373] futex(0xc000144848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559372] restart_syscall(<... resuming interrupted read ...> <unfinished ...>
[pid 2559369] futex(0xffdae20, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 0, NULL, 0) = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559372] <... restart_syscall resumed>) = 0
[pid 2559372] futex(0xffe4680, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=190216951} <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 190, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] futex(0xc00129a148, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559372] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 2569170] <... epoll_pwait resumed>[], 128, 0, NULL, 0) = 0
[pid 2559388] <... futex resumed>)      = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559372] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 2559388] futex(0xc00129a148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559372] <... nanosleep resumed>NULL) = 0
[pid 2559372] futex(0xffe4680, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=189637237} <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 199, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559372] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 2559372] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 1, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] futex(0xc00129a148, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559388] <... futex resumed>)      = 0
[pid 2559388] futex(0xc00129a148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559372] <... nanosleep resumed>NULL) = 0
[pid 2559372] futex(0xffe4680, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=190156915} <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 199, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559372] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 2559372] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 1, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] futex(0xc00129a148, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 2559388] <... futex resumed>)      = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559388] epoll_pwait(4,  <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 0, NULL, 0) = 0
[pid 2559388] <... epoll_pwait resumed>[], 128, 0, NULL, 0) = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559388] futex(0xc00129a148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559372] <... nanosleep resumed>NULL) = 0
[pid 2559372] futex(0xffe4680, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=190100964} <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 199, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559372] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 2559372] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 1, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] futex(0xc00129a148, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2559388] <... futex resumed>)      = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559388] futex(0xc00129a148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559372] <... nanosleep resumed>NULL) = 0
[pid 2559372] futex(0xffe4680, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=190091615} <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 199, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559372] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 2559372] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 1, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] futex(0xc00129a148, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559388] <... futex resumed>)      = 0
[pid 2569170] <... epoll_pwait resumed>[], 128, 0, NULL, 0) = 0
[pid 2559388] epoll_pwait(4,  <unfinished ...>
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559388] <... epoll_pwait resumed>[], 128, 0, NULL, 0) = 0
[pid 2559388] futex(0xc00129a148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559372] <... nanosleep resumed>NULL) = 0
[pid 2559372] futex(0xffe4680, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=190280753} <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 199, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559372] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 2559372] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 1, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] futex(0xc00129a148, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2559388] <... futex resumed>)      = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559388] futex(0xc00129a148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559372] <... nanosleep resumed>NULL) = 0
[pid 2559372] futex(0xffe4680, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=190496487} <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 199, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559372] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 2559372] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 1, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] futex(0xc00129a148, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559388] <... futex resumed>)      = 0
[pid 2559388] futex(0xc00129a148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559372] <... nanosleep resumed>NULL) = 0
[pid 2559372] futex(0xffe4680, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=190536362} <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 199, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559372] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 2559372] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 1, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] futex(0xc00129a148, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 2559388] <... futex resumed>)      = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559388] epoll_pwait(4,  <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 0, NULL, 0) = 0
[pid 2559388] <... epoll_pwait resumed>[], 128, 0, NULL, 0) = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559388] futex(0xc00129a148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559372] <... nanosleep resumed>NULL) = 0
[pid 2559372] futex(0xffe4680, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=190177556} <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 199, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559372] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 2559372] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 1, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] futex(0xc00129a148, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2559388] <... futex resumed>)      = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559388] futex(0xc00129a148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559372] <... nanosleep resumed>NULL) = 0
[pid 2559372] futex(0xffe4680, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=190050015} <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 199, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559372] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 2559372] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 1, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] futex(0xc00129a148, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559388] <... futex resumed>)      = 0
[pid 2559388] futex(0xc00129a148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559372] <... nanosleep resumed>NULL) = 0
[pid 2559372] futex(0xffe4680, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=190192720} <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 199, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559372] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 2559372] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 1, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] futex(0xc00129a148, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 2569170] epoll_pwait(4,  <unfinished ...>

strace a little later

[root@mu2e-dcs-01 ~]# strace -f -p 2559369
strace: Process 2559369 attached with 53 threads
[pid 2571898] futex(0xc002400848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2569173] futex(0xc003f81648, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2569172] futex(0xc003f01648, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2569171] futex(0xc003e81648, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2569169] futex(0xc003d01648, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2569168] futex(0xc00320d648, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2569167] futex(0xc003e00848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2566881] futex(0xc001381d48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2566880] futex(0xc00129c448, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2566879] futex(0xc004000848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2566877] futex(0xc003f00f48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2566876] futex(0xc003e80f48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2566873] futex(0xc00320cf48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2564302] futex(0xc003e80848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2564301] futex(0xc003d80848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2564300] futex(0xc003d00848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2564299] futex(0xc00320c148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2564298] futex(0xc001202448, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2562627] futex(0xc003f80148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2562626] futex(0xc003f00148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2562624] futex(0xc003e00148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2562622] futex(0xc003d00148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559396] futex(0x1000f5b8, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559390] futex(0xc002600148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559389] futex(0xc002480148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559387] futex(0xc002400148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559386] futex(0xc002380848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559385] futex(0xc002380148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559384] futex(0xc000147948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559383] futex(0xc001400848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559381] futex(0xc001400148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559380] futex(0xc001300848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559379] futex(0xc001380148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559378] futex(0xc001300148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559377] futex(0xc001280148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559376] futex(0x1000f860, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559374] futex(0xc000144f48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559373] futex(0xc000144848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559372] restart_syscall(<... resuming interrupted read ...> <unfinished ...>
[pid 2559369] futex(0xffdae20, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 0, NULL, 0) = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559372] <... restart_syscall resumed>) = 0
[pid 2559372] getpid()                  = 1
[pid 2559372] tgkill(1, 5, SIGURG)      = 0
[pid 2559372] getpid()                  = 1
[pid 2559375] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1, si_uid=5219} ---
[pid 2559372] tgkill(1, 24, SIGURG <unfinished ...>
[pid 2559375] rt_sigreturn({mask=[]} <unfinished ...>
[pid 2559372] <... tgkill resumed>)     = 0
[pid 2562625] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1, si_uid=5219} ---
[pid 2559375] <... rt_sigreturn resumed>) = 825556758272
[pid 2559372] getpid( <unfinished ...>
[pid 2562625] rt_sigreturn({mask=[]} <unfinished ...>
[pid 2559372] <... getpid resumed>)     = 1
[pid 2562625] <... rt_sigreturn resumed>) = 825193762800
[pid 2559372] tgkill(1, 52, SIGURG)     = 0
[pid 2569175] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1, si_uid=5219} ---
[pid 2559372] getpid( <unfinished ...>
[pid 2569175] rt_sigreturn({mask=[]} <unfinished ...>
[pid 2559372] <... getpid resumed>)     = 1
[pid 2569175] <... rt_sigreturn resumed>) = 825300078528
[pid 2559372] tgkill(1, 33, SIGURG)     = 0
[pid 2559372] getpid()                  = 1
[pid 2559372] tgkill(1, 34, SIGURG)     = 0
[pid 2559372] getpid( <unfinished ...>
[pid 2564304] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1, si_uid=5219} ---
[pid 2559372] <... getpid resumed>)     = 1
[pid 2564304] rt_sigreturn({mask=[]} <unfinished ...>
[pid 2559372] tgkill(1, 27, SIGURG <unfinished ...>
[pid 2564304] <... rt_sigreturn resumed>) = 139719454742824
[pid 2559372] <... tgkill resumed>)     = 0
[pid 2559372] getpid()                  = 1
[pid 2559372] tgkill(1, 37, SIGURG)     = 0
[pid 2559372] getpid( <unfinished ...>
[pid 2566875] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1, si_uid=5219} ---
[pid 2559372] <... getpid resumed>)     = 1
[pid 2566875] rt_sigreturn({mask=[]} <unfinished ...>
[pid 2559372] tgkill(1, 36, SIGURG <unfinished ...>
[pid 2566875] <... rt_sigreturn resumed>) = 139719506348232
[pid 2559372] <... tgkill resumed>)     = 0
[pid 2559372] getpid()                  = 1
[pid 2559372] tgkill(1, 17, SIGURG)     = 0
[pid 2559372] getpid()                  = 1
[pid 2559372] tgkill(1, 22, SIGURG)     = 0
[pid 2559372] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 6, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] futex(0xc003e00148, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 2562624] <... futex resumed>)      = 0
[pid 2562624] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2562624] epoll_pwait(4,  <unfinished ...>
[pid 2564303] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1, si_uid=5219} ---
[pid 2564303] rt_sigreturn({mask=[]})   = 139719544750728
[pid 2566874] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1, si_uid=5219} ---
[pid 2566874] rt_sigreturn({mask=[]} <unfinished ...>
[pid 2562623] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1, si_uid=5219} ---
[pid 2562623] rt_sigreturn({mask=[]} <unfinished ...>
[pid 2559388] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1, si_uid=5219} ---
[pid 2562623] <... rt_sigreturn resumed>) = 826248941504
[pid 2559388] rt_sigreturn({mask=[]} <unfinished ...>
[pid 2566874] <... rt_sigreturn resumed>) = 825087975040
[pid 2559388] <... rt_sigreturn resumed>) = 826291204000
[pid 2562628] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1, si_uid=5219} ---
[pid 2562628] rt_sigreturn({mask=[]})   = 827566583712
[pid 2569170] futex(0xc003d81648, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559372] <... nanosleep resumed>NULL) = 0
[pid 2559372] getpid()                  = 1
[pid 2559372] tgkill(1, 5, SIGURG)      = 0
[pid 2559372] getpid()                  = 1
[pid 2559372] tgkill(1, 24, SIGURG)     = 0
[pid 2559372] getpid()                  = 1
[pid 2559372] tgkill(1, 52, SIGURG)     = 0

lsof for dedicated user:

[root@mu2e-dcs-01 ~]# lsof -u monitor
COMMAND       PID    USER   FD      TYPE    DEVICE  SIZE/OFF      NODE NAME
otelcol-c 2559369 monitor  cwd       DIR     0,117        84  33554591 /
otelcol-c 2559369 monitor  rtd       DIR     0,117        84  33554591 /
otelcol-c 2559369 monitor  txt       REG     0,117 263930008 100672926 /otelcol-contrib
otelcol-c 2559369 monitor  mem       REG     253,2           100672926 /otelcol-contrib (stat: No such file or directory)
otelcol-c 2559369 monitor    0u      CHR       1,3       0t0         5 /dev/null
otelcol-c 2559369 monitor    1w     FIFO      0,13       0t0 108577923 pipe
otelcol-c 2559369 monitor    2w     FIFO      0,13       0t0 108577924 pipe
otelcol-c 2559369 monitor    3u     IPv4 108567751       0t0       TCP localhost:ddi-tcp-1 (LISTEN)
otelcol-c 2559369 monitor    4u  a_inode      0,14         0      2079 [eventpoll:3,5,6,7,8]
otelcol-c 2559369 monitor    5u  a_inode      0,14         0      2079 [eventfd:584]
otelcol-c 2559369 monitor    6u     IPv6 145586142       0t0       TCP localhost:56780->localhost:jetdirect (ESTABLISHED)
otelcol-c 2559369 monitor    7u     IPv6 108595225       0t0       TCP *:9299 (LISTEN)
otelcol-c 2559369 monitor    8u     IPv6 108595350       0t0       TCP localhost:33310->localhost:9558 (ESTABLISHED)

podman ps:

[root@mu2e-dcs-01 ~]# podman ps
CONTAINER ID  IMAGE                                                                                           COMMAND               CREATED     STATUS      PORTS       NAMES
353d29fa2f5d  quay.io/prometheuscommunity/systemd-exporter:latest                                             --systemd.collect...  3 days ago  Up 3 days               systemd-prometheus-systemd-exporter
77768f62f72e  quay.io/prometheus/node-exporter:latest                                                         --path.procfs=/ho...  3 days ago  Up 3 days               systemd-prometheus-node-exporter
2fbf0ebdf1a2  ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib:latest  --config /etc/ote...  3 days ago  Up 3 days               systemd-otel-collector

The text was updated successfully, but these errors were encountered:

atoulme · 2024-11-10T05:05:38Z

Do you use swap on your machine?

jcpunk · 2024-11-10T05:14:28Z

The system does have a linux swap partition, but I'm not showing any memory pressure or pages swapped out.

daniel-hoefer · 2024-11-12T16:45:42Z

@jcpunk According to the iotop usage help a SWAPIN of 88% means that 88% of time was spent on swapping in memory pages from disk. If not from memory pressure then what other explanation could there be for swapped out memory pages?

jcpunk · 2024-11-12T17:10:26Z

I wonder if it hit the pod memory limit and started swapping rather than oom kill? I'll add the memory_limiter to my config to see if that makes a difference. Not that I can trigger this bug on purpose...

jcpunk · 2024-11-26T21:01:38Z

I think this might be related to the garbage collection errors noted in : #36351

That would explain why the memory grew enough to hit the limit.

jcpunk added bug Something isn't working needs triage New item requiring triage labels Nov 4, 2024

github-actions bot mentioned this issue Nov 5, 2024

Weekly Report: 2024-10-29 - 2024-11-05 #36187

Closed

github-actions bot mentioned this issue Nov 12, 2024

Weekly Report: 2024-11-05 - 2024-11-12 #36302

Closed

github-actions bot mentioned this issue Nov 19, 2024

Weekly Report: 2024-11-12 - 2024-11-19 #36426

Closed

github-actions bot mentioned this issue Nov 26, 2024

Weekly Report: 2024-11-19 - 2024-11-26 #36533

Closed

jcpunk closed this as completed Nov 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`otelcol-contrib` suddenly creates massive disk i/o #36180

`otelcol-contrib` suddenly creates massive disk i/o #36180

jcpunk commented Nov 4, 2024

atoulme commented Nov 10, 2024

jcpunk commented Nov 10, 2024

daniel-hoefer commented Nov 12, 2024

jcpunk commented Nov 12, 2024 •

edited

Loading

jcpunk commented Nov 26, 2024

otelcol-contrib suddenly creates massive disk i/o #36180

otelcol-contrib suddenly creates massive disk i/o #36180

Comments

jcpunk commented Nov 4, 2024

Component(s)

What happened?

Description

Steps to Reproduce

Expected Result

Actual Result

Collector version

Environment information

Environment

OpenTelemetry Collector configuration

Log output

Additional context

atoulme commented Nov 10, 2024

jcpunk commented Nov 10, 2024

daniel-hoefer commented Nov 12, 2024

jcpunk commented Nov 12, 2024 • edited Loading

jcpunk commented Nov 26, 2024

`otelcol-contrib` suddenly creates massive disk i/o #36180

`otelcol-contrib` suddenly creates massive disk i/o #36180

jcpunk commented Nov 12, 2024 •

edited

Loading