Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sidecars not terminated in a timely manner at scale when main container completes #10612

Open
2 of 3 tasks
bencompton opened this issue Feb 28, 2023 · 13 comments
Open
2 of 3 tasks
Labels
area/controller Controller issues, panics P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important solution/workaround There's a workaround, might not be great, but exists type/bug

Comments

@bencompton
Copy link

bencompton commented Feb 28, 2023

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issues exists when I tested with :latest
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

What happened:

When running 10 instances of a workflow that spins up 200 parallel pods with sidecars (2000 total pods), some of the pods don't complete until several minutes after the main container completes (witnessed up to 30+ minute delay). Instead, the main and wait containers complete, but the sidecars continue running afterwards.

Expectation:

The sidecars should be terminated (or at least receive a terminate signal) within seconds after the main container completes.

Version

3.4.5

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: sidecar-test-
spec:
  entrypoint: main
  templates:
    - name: main
      dag:
        tasks:
          - name: sidecar-test
            template: sidecar-test
            withSequence:
              count: "200"
    - name: sidecar-test
      container:
        image: debian:bullseye-slim
        resources:
          requests:
            memory: 500Mi
            cpu: 100m
        command: [bash, -c]
        args:
          - |
            sleep 10
      sidecars:
        - name: sidecar-1
          image: debian:bullseye-slim
          resources:
            requests:
              memory: 500Mi
              cpu: 100m
          command: [bash, -c]
          args:
            - |
              while :
              do
                sleep 5
              done
        - name: sidecar-2
          image: debian:bullseye-slim
          resources:
            requests:
              memory: 500Mi
              cpu: 100m
          command: [bash, -c]
          args:
            - |
              while :
              do
                sleep 5
              done
        - name: sidecar-3
          image: debian:bullseye-slim
          resources:
            requests:
              memory: 500Mi
              cpu: 100m
          command: [bash, -c]
          args:
            - |
              while :
              do
                sleep 5
              done

Logs from the workflow controller

workflow-controller.log

Logs from in your workflow's wait container

time="2023-02-27T18:00:21 UTC" level=info msg="Starting deadline monitor"
time="2023-02-27T18:01:22 UTC" level=info msg="Main container completed" error="<nil>"
time="2023-02-27T18:01:22 UTC" level=info msg="Deadline monitor stopped"
time="2023-02-27T18:01:22 UTC" level=info msg="stopping progress monitor (context done)" error="context canceled"
time="2023-02-27T18:01:22 UTC" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2023-02-27T18:01:22 UTC" level=info msg="No output parameters"
time="2023-02-27T18:01:22 UTC" level=info msg="No output artifacts"
time="2023-02-27T18:01:22 UTC" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: sidecar-test-h5f6f/sidecar-test-h5f6f-sidecar-test-1545037156/main.log"
time="2023-02-27T18:01:22 UTC" level=info msg="Creating minio client using AWS SDK credentials"
time="2023-02-27T18:01:22 UTC" level=info msg="Saving file to s3" bucket=ci-platform-us01-ci-platform-prod-artifacts endpoint=s3.amazonaws.com key=sidecar-test-h5f6f/sidecar-tes
t-h5f6f-sidecar-test-1545037156/main.log path=/tmp/argo/outputs/logs/main.log
time="2023-02-27T18:01:22 UTC" level=info msg="Save artifact" artifactName=main-logs duration=95.712696ms error="<nil>" key=sidecar-test-h5f6f/sidecar-test-h5f6f-sidecar-test-15
45037156/main.log
time="2023-02-27T18:01:22 UTC" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/logs/main.log
time="2023-02-27T18:01:22 UTC" level=info msg="Successfully saved file: /tmp/argo/outputs/logs/main.log"
time="2023-02-27T18:01:22 UTC" level=info msg="Create workflowtaskresults 201"
time="2023-02-27T18:01:22 UTC" level=info msg="Alloc=11112 TotalAlloc=18408 Sys=35282 NumGC=4 Goroutines=11"

Additional context

Environment: AWS EKS, running Karpenter with c6i.32xlarge instances

When the issue occurs, pods look like this:

...
sidecar-test-njbkt-sidecar-test-1683720170                        3/5     NotReady      0               15m
sidecar-test-njbkt-sidecar-test-1749094478                        3/5     NotReady      0               16m
sidecar-test-njbkt-sidecar-test-2256319574                        3/5     NotReady      0               15m
sidecar-test-njbkt-sidecar-test-3413532318                        3/5     NotReady      0               15m
sidecar-test-njbkt-sidecar-test-616410606                         3/5     NotReady      0               16m
sidecar-test-njbkt-sidecar-test-839388366                         3/5     NotReady      0               15m
sidecar-test-r9wpq-sidecar-test-1116727110                        3/5     NotReady      0               15m
sidecar-test-r9wpq-sidecar-test-1866103240                        3/5     NotReady      0               16m
sidecar-test-r9wpq-sidecar-test-1911252918                        3/5     NotReady      0               15m
sidecar-test-r9wpq-sidecar-test-2550845942                        3/5     NotReady      0               15m
sidecar-test-r9wpq-sidecar-test-269508782                         3/5     NotReady      0               15m
sidecar-test-r9wpq-sidecar-test-2869810302                        3/5     NotReady      0               15m
sidecar-test-r9wpq-sidecar-test-3276810206                        3/5     NotReady      0               16m
...
Containers:
  wait:
...
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 28 Feb 2023 08:47:04 -0700
      Finished:     Tue, 28 Feb 2023 08:47:15 -0700
    Ready:          False
...
  main:
...   
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 28 Feb 2023 08:47:04 -0700
      Finished:     Tue, 28 Feb 2023 08:47:14 -0700
    Ready:          False
...
  sidecar-1:
...   
    State:          Running
      Started:      Tue, 28 Feb 2023 08:47:05 -0700
    Ready:          True
    Restart Count:  0
...
  sidecar-2:
...   
    State:          Running
      Started:      Tue, 28 Feb 2023 08:47:06 -0700
    Ready:          True
...
  sidecar-3:
...      
    State:          Running
      Started:      Tue, 28 Feb 2023 08:47:07 -0700
    Ready:          True

describe-pod.log

Notes

  • Can reproduce this issue by running a single instance of the workflow. I tested a single instance in a completely separate, smaller cluster and noted that some pods have a duration of < 1m while others run over 3m, with with main container usually completing within seconds and the sidecars still running for minutes afterwards.

  • Just had a single instance of this workflow take 21m when running in the original, larger cluster. These absurdly long runtimes seem to occur after running 10 concurrent instances (runtime is usually ~1m). The pods were all running within 30s, and the main containers were completing quickly.

@bencompton
Copy link
Author

I discovered that having the main container write signal files for the sidecars with SIGTERM messages is a way to work around this issue. I updated the main container with the following code to mimic what the wait container does:

sleep 10

echo MTU= | base64 -d > /var/run/argo/ctr/sidecar-1/signal
echo MTU= | base64 -d > /var/run/argo/ctr/sidecar-2/signal
echo MTU= | base64 -d > /var/run/argo/ctr/sidecar-3/signal

With this workaround, 10 instances of the example workflow are finishing within a few minutes instead of 20m, with most pods having a duration of < 20s and few having a duration slightly over 1m.

It would appear that without this hack, something is preventing the wait container from writing the signal files in a timely manner after the main container completes.

@caelan-io
Copy link
Member

Thanks a lot for posting this issue w/ logs, a repro workflow, and even proposing a workaround for the time being. ❤️

We're doing some digging of our own into workflow hanging that might be related this (see: #10491). We'll post an update on here if we run into a cause or fix.

@caelan-io
Copy link
Member

Also, FYI for SIGTERM related issues:
#10518
PR: #10523

@bencompton
Copy link
Author

Thanks for the info @caelan-io! Hmm, I wonder if #10523 fixes this issue.

@caelan-io
Copy link
Member

If you weren't having this issue until 3.4.5, then that will likely fix it. If you're able to test that PR or master with your repro workflow, please let us know the results.

@JPZ13
Copy link
Member

JPZ13 commented Mar 16, 2023

Hey @bencompton - have you had a chance to test out if #10523 fixes this issue? If it does, we'll go ahead and close it and see when we can get another patch release out

@JPZ13 JPZ13 added the problem/more information needed Not enough information has been provide to diagnose this issue. label Mar 16, 2023
@bencompton
Copy link
Author

My team just updated to 3.4.7 and I re-tested. Unfortunately, I’m still seeing the same issue with the sidecars not terminating in a timely manner. In my team’s workflows, I saw pods with the main containers completing after 20m and continuing until hitting our 1h deadline while the sidecars failed to stop. When re-testing with the minimal reproduction above, I see the same results as before:

…
sidecar-test-vsf8r-sidecar-test-3583951812                       3/5     NotReady    0          14m
sidecar-test-vsf8r-sidecar-test-3593789946                       3/5     NotReady    0          13m
sidecar-test-vsf8r-sidecar-test-359583538                        3/5     NotReady    0          13m
sidecar-test-vsf8r-sidecar-test-364746458                        3/5     NotReady    0          12m
sidecar-test-vsf8r-sidecar-test-3674744714                       3/5     NotReady    0          13m
sidecar-test-vsf8r-sidecar-test-3714523468                       3/5     NotReady    0          13m
sidecar-test-vsf8r-sidecar-test-3745816844                       3/5     NotReady    0          14m
sidecar-test-vsf8r-sidecar-test-3748422672                       3/5     NotReady    0          14m
sidecar-test-vsf8r-sidecar-test-3749200898                       3/5     NotReady    0          13m
sidecar-test-vsf8r-sidecar-test-3758682656                       3/5     NotReady    0          14m
sidecar-test-vsf8r-sidecar-test-3773040474                       3/5     NotReady    0          12m
sidecar-test-vsf8r-sidecar-test-3804027916                       3/5     NotReady    0          13m
sidecar-test-vsf8r-sidecar-test-3866083034                       3/5     NotReady    0          13m
sidecar-test-vsf8r-sidecar-test-386948370                        3/5     NotReady    0          13m
sidecar-test-vsf8r-sidecar-test-3873634500                       3/5     NotReady    0          13m
sidecar-test-vsf8r-sidecar-test-3876696458                       3/5     NotReady    0          13m
sidecar-test-vsf8r-sidecar-test-3880561570                       3/5     NotReady    0          14m
sidecar-test-vsf8r-sidecar-test-3926852610                       3/5     NotReady    0          12m
…
  main:
    Container ID:  containerd://2f38531a9cd7e0d9213240be0899692cef963d1c7ced83cc926e7baff156f4fb
    Image:         debian:bullseye-slim
    Image ID:      docker.io/library/debian@sha256:f4da3f9b18fc242b739807a0fb3e77747f644f2fb3f67f4403fafce2286b431a
    Port:          <none>
    Host Port:     <none>
    Command:
      /var/run/argo/argoexec
      emissary
      --loglevel
      info
      --log-format
      text
      --
      bash
      -c
    Args:
      sleep 10
      
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 03 May 2023 11:38:35 -0600
      Finished:     Wed, 03 May 2023 11:38:45 -0600
    Ready:          False
…
  sidecar-1:
    Container ID:  containerd://afedb72e64ef9ec369e3bf2a89382216e7b83d030bb28310b2d9c41096220a4f
    Image:         debian:bullseye-slim
    Image ID:      docker.io/library/debian@sha256:f4da3f9b18fc242b739807a0fb3e77747f644f2fb3f67f4403fafce2286b431a
    Port:          <none>
    Host Port:     <none>
    Command:
      /var/run/argo/argoexec
      emissary
      --loglevel
      info
      --log-format
      text
      --
      bash
      -c
    Args:
      while :
      do
        sleep 5
      done
      
    State:          Running
      Started:      Wed, 03 May 2023 11:38:35 -0600
    Ready:          True
…
  sidecar-2:
    Container ID:  containerd://38d014aef53551e1e1d58b310e0640783ab42347282d789a517a68d7fa292ac7
    Image:         debian:bullseye-slim
    Image ID:      docker.io/library/debian@sha256:f4da3f9b18fc242b739807a0fb3e77747f644f2fb3f67f4403fafce2286b431a
    Port:          <none>
    Host Port:     <none>
    Command:
      /var/run/argo/argoexec
      emissary
      --loglevel
      info
      --log-format
      text
      --
      bash
      -c
    Args:
      while :
      do
        sleep 5
      done
      
    State:          Running
      Started:      Wed, 03 May 2023 11:38:35 -0600
    Ready:          True
…
  sidecar-3:
    Container ID:  containerd://924cd7afe76ec251b3100862b9b5a8e85d822636c86ab1a1d24e234f402e1577
    Image:         debian:bullseye-slim
    Image ID:      docker.io/library/debian@sha256:f4da3f9b18fc242b739807a0fb3e77747f644f2fb3f67f4403fafce2286b431a
    Port:          <none>
    Host Port:     <none>
    Command:
      /var/run/argo/argoexec
      emissary
      --loglevel
      info
      --log-format
      text
      --
      bash
      -c
    Args:
      while :
      do
        sleep 5
      done
      
    State:          Running
      Started:      Wed, 03 May 2023 11:38:35 -0600
    Ready:          True

FYI: @JPZ13 @caelan-io @jmeridth

@McPonolith
Copy link

McPonolith commented May 30, 2023

Also running into this issue on 3.3.10, 3.4.0, 3.4.7, 3.4.8 (Ive ran into multiple issues to do with permissions when installing from fresh, so I wonder if this is a similar issue)

@bencompton workaround, works, but obviously not ideal (Thanks alot though!), and leads credence that its a permission issue. Are any of the maintainers running into this issue as well, as its very easily reproducible for me?

Ive noticed the pods are being created outside of the argo namespace. This created an issue for me when setting up an artifact repo as the documentation was creating credentials in the argo namespace, but they weren't accessible. Theres also been a few other similar issues. (Im new to K8S so this may be perfectly normal, and the documentation was wrong). Though that wouldn't explain why it works outside of DAG/Steps.

This works, as theres no DAG or Steps

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: sidecar-
spec:
  entrypoint: sidecar-example
  templates:
  - name: sidecar-example
    container:
      image: alpine:latest
      command: [sh, -c]
      args: ["
        apk update &&
        apk add curl &&
        until curl -XPOST 'http://127.0.0.1:8086/query' --data-urlencode 'q=CREATE DATABASE mydb' ; do sleep .5; done &&
        for i in $(seq 1 20); 
          do curl -XPOST 'http://127.0.0.1:8086/write?db=mydb' -d \"cpu,host=server01,region=uswest load=$i\" ; 
          sleep .5 ; 
        done
      "]
    sidecars:
    - name: influxdb
      image: influxdb:1.2
      command: [influxd]

The following does not work

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: workflow-example-
spec:
  entrypoint: workflow
  templates:
  - name: workflow
    dag:
      tasks:
      - name: sidecar-test
        inline:
          container:
            image: alpine:latest
            command: [sh, -c]
            args: ["
              apk update &&
              apk add curl &&
              until curl -XPOST 'http://127.0.0.1:8086/query' --data-urlencode 'q=CREATE DATABASE mydb' ; do sleep .5; done &&
              for i in $(seq 1 20); 
                do curl -XPOST 'http://127.0.0.1:8086/write?db=mydb' -d \"cpu,host=server01,region=uswest load=$i\" ; 
                sleep .5 ; 
              done
            "]
          sidecars:
            - name: influxdb
              image: influxdb:1.2
              command: [influxd]

This does work, due to the workaround

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: workflow-example-
spec:
  entrypoint: workflow
  templates:
  - name: workflow
    dag:
      tasks:
      - name: sidecar-test
        inline:
          container:
            image: alpine:latest
            command: [sh, -c]
            args: ["
              apk update &&
              apk add curl &&
              until curl -XPOST 'http://127.0.0.1:8086/query' --data-urlencode 'q=CREATE DATABASE mydb' ; do sleep .5; done &&
              for i in $(seq 1 20); 
                do curl -XPOST 'http://127.0.0.1:8086/write?db=mydb' -d \"cpu,host=server01,region=uswest load=$i\" ; 
                sleep .5 ; 
              done
              && echo MTU= | base64 -d > /var/run/argo/ctr/influxdb/signal # WITHOUT THIS COMMAND, THE SIDECAR REMAINS RUNNING
            "]
          sidecars:
            - name: influxdb
              image: influxdb:1.2
              command: [influxd]

@caelan-io caelan-io added solution/workaround There's a workaround, might not be great, but exists P3 Low priority labels Jun 1, 2023
@caelan-io
Copy link
Member

Thank you for posting updates on this issue and confirming the workaround works @bencompton @McPonolith

We have several other bug fixes ahead of this in the priority queue. If anyone has further suggestions for a solution, please comment and/or take this on in the meantime.

@juliev0 juliev0 added P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important and removed P3 Low priority labels Jun 8, 2023
@stale

This comment was marked as resolved.

@stale stale bot added the problem/stale This has not had a response in some time label Sep 17, 2023
@terrytangyuan terrytangyuan removed the problem/stale This has not had a response in some time label Sep 20, 2023
@agilgur5 agilgur5 removed the problem/more information needed Not enough information has been provide to diagnose this issue. label Oct 27, 2023
@terrytangyuan
Copy link
Member

terrytangyuan commented Feb 18, 2024

Not stale. This is a real bug. Contributions welcomed!

@agilgur5 agilgur5 added the area/controller Controller issues, panics label May 11, 2024
@henrywangx
Copy link
Contributor

henrywangx commented Sep 7, 2024

We face this issue also.
Looks like we change "Kill Sidecar" from distributed to centralized, may be easier to face the scale issue.

https://github.com/argoproj/argo-workflows/pull/8475/files

@gmweaver
Copy link

gmweaver commented Sep 15, 2024

I believe I have also encountered this issue with an auto-injected custom istio sidecar we use, call it custom-istio.

When I try the suggested solution:

sleep 10
echo MTU= | base64 -d > /var/run/argo/ctr/custom-istio/signal

I get /var/run/argo/ctr/custom-istio/ folder does not exist. Is this expected?

EDIT: Ah, injected == Argo not aware / explicit in yaml, which means that this workaround most likely will not work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important solution/workaround There's a workaround, might not be great, but exists type/bug
Projects
None yet
Development

No branches or pull requests

9 participants