-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sidecars not terminated in a timely manner at scale when main container completes #10612
Comments
I discovered that having the main container write signal files for the sidecars with SIGTERM messages is a way to work around this issue. I updated the main container with the following code to mimic what the wait container does:
With this workaround, 10 instances of the example workflow are finishing within a few minutes instead of 20m, with most pods having a duration of < 20s and few having a duration slightly over 1m. It would appear that without this hack, something is preventing the wait container from writing the signal files in a timely manner after the main container completes. |
Thanks a lot for posting this issue w/ logs, a repro workflow, and even proposing a workaround for the time being. ❤️ We're doing some digging of our own into workflow hanging that might be related this (see: #10491). We'll post an update on here if we run into a cause or fix. |
Thanks for the info @caelan-io! Hmm, I wonder if #10523 fixes this issue. |
If you weren't having this issue until 3.4.5, then that will likely fix it. If you're able to test that PR or master with your repro workflow, please let us know the results. |
Hey @bencompton - have you had a chance to test out if #10523 fixes this issue? If it does, we'll go ahead and close it and see when we can get another patch release out |
My team just updated to 3.4.7 and I re-tested. Unfortunately, I’m still seeing the same issue with the sidecars not terminating in a timely manner. In my team’s workflows, I saw pods with the main containers completing after 20m and continuing until hitting our 1h deadline while the sidecars failed to stop. When re-testing with the minimal reproduction above, I see the same results as before:
FYI: @JPZ13 @caelan-io @jmeridth |
Also running into this issue on 3.3.10, 3.4.0, 3.4.7, 3.4.8 (Ive ran into multiple issues to do with permissions when installing from fresh, so I wonder if this is a similar issue) @bencompton workaround, works, but obviously not ideal (Thanks alot though!), and leads credence that its a permission issue. Are any of the maintainers running into this issue as well, as its very easily reproducible for me? Ive noticed the pods are being created outside of the argo namespace. This created an issue for me when setting up an artifact repo as the documentation was creating credentials in the argo namespace, but they weren't accessible. Theres also been a few other similar issues. (Im new to K8S so this may be perfectly normal, and the documentation was wrong). Though that wouldn't explain why it works outside of DAG/Steps. This works, as theres no DAG or Steps apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: sidecar-
spec:
entrypoint: sidecar-example
templates:
- name: sidecar-example
container:
image: alpine:latest
command: [sh, -c]
args: ["
apk update &&
apk add curl &&
until curl -XPOST 'http://127.0.0.1:8086/query' --data-urlencode 'q=CREATE DATABASE mydb' ; do sleep .5; done &&
for i in $(seq 1 20);
do curl -XPOST 'http://127.0.0.1:8086/write?db=mydb' -d \"cpu,host=server01,region=uswest load=$i\" ;
sleep .5 ;
done
"]
sidecars:
- name: influxdb
image: influxdb:1.2
command: [influxd] The following does not work apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: workflow-example-
spec:
entrypoint: workflow
templates:
- name: workflow
dag:
tasks:
- name: sidecar-test
inline:
container:
image: alpine:latest
command: [sh, -c]
args: ["
apk update &&
apk add curl &&
until curl -XPOST 'http://127.0.0.1:8086/query' --data-urlencode 'q=CREATE DATABASE mydb' ; do sleep .5; done &&
for i in $(seq 1 20);
do curl -XPOST 'http://127.0.0.1:8086/write?db=mydb' -d \"cpu,host=server01,region=uswest load=$i\" ;
sleep .5 ;
done
"]
sidecars:
- name: influxdb
image: influxdb:1.2
command: [influxd]
This does work, due to the workaround apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: workflow-example-
spec:
entrypoint: workflow
templates:
- name: workflow
dag:
tasks:
- name: sidecar-test
inline:
container:
image: alpine:latest
command: [sh, -c]
args: ["
apk update &&
apk add curl &&
until curl -XPOST 'http://127.0.0.1:8086/query' --data-urlencode 'q=CREATE DATABASE mydb' ; do sleep .5; done &&
for i in $(seq 1 20);
do curl -XPOST 'http://127.0.0.1:8086/write?db=mydb' -d \"cpu,host=server01,region=uswest load=$i\" ;
sleep .5 ;
done
&& echo MTU= | base64 -d > /var/run/argo/ctr/influxdb/signal # WITHOUT THIS COMMAND, THE SIDECAR REMAINS RUNNING
"]
sidecars:
- name: influxdb
image: influxdb:1.2
command: [influxd]
|
Thank you for posting updates on this issue and confirming the workaround works @bencompton @McPonolith We have several other bug fixes ahead of this in the priority queue. If anyone has further suggestions for a solution, please comment and/or take this on in the meantime. |
This comment was marked as resolved.
This comment was marked as resolved.
Not stale. This is a real bug. Contributions welcomed! |
We face this issue also. |
I believe I have also encountered this issue with an auto-injected custom istio sidecar we use, call it When I try the suggested solution:
I get EDIT: Ah, injected == Argo not aware / explicit in yaml, which means that this workaround most likely will not work |
Pre-requisites
:latest
What happened/what you expected to happen?
What happened:
When running 10 instances of a workflow that spins up 200 parallel pods with sidecars (2000 total pods), some of the pods don't complete until several minutes after the main container completes (witnessed up to 30+ minute delay). Instead, the main and wait containers complete, but the sidecars continue running afterwards.
Expectation:
The sidecars should be terminated (or at least receive a terminate signal) within seconds after the main container completes.
Version
3.4.5
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Logs from the workflow controller
workflow-controller.log
Logs from in your workflow's wait container
Additional context
Environment: AWS EKS, running Karpenter with
c6i.32xlarge
instancesWhen the issue occurs, pods look like this:
describe-pod.log
Notes
Can reproduce this issue by running a single instance of the workflow. I tested a single instance in a completely separate, smaller cluster and noted that some pods have a duration of < 1m while others run over 3m, with with main container usually completing within seconds and the sidecars still running for minutes afterwards.
Just had a single instance of this workflow take 21m when running in the original, larger cluster. These absurdly long runtimes seem to occur after running 10 concurrent instances (runtime is usually ~1m). The pods were all running within 30s, and the main containers were completing quickly.
The text was updated successfully, but these errors were encountered: