-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
signaled container: wait error: unable to upgrade connection: container not found
. node Succeeded but wf not progressing
#13627
Comments
I explicitly left the warning around: in the logs, as I wonder if this could the reason we're hitting this, and so resolving this, might solve our problem, but as I'm not sure, I did wanted to file the issue if there is a race issue between controller and wait container somehow. |
might be related to #13454 |
looking at #13496 (comment) it seems that this must be related to the missing RBAC config according to Alex Peelman, but @jswxstw seems to have some doubts... |
signaled container: wait error: unable to upgrade connection: container not found
. node Succeeded but wf not progressing
last timestamp in wait container logs is:
on controller side we see:
On the host kubelet log, I see the container died at that moment (afaiu)
So for some reason the container was not reachable at the moment controller wanted. But I repeat, that is why I opened the issue: if this happens, a workflow can get stuck, but it might also just be because of the missing RBAC configuration. |
The most common reason for workflow stuck is abnormal exit of wait container, and it is fixed by #13537, #13491.
I don't think your issue is caused by missing RBAC configuration, since this issue should be reproducible consistently if so.
This log shows wait container is missed, so I want to confirm whether the controller can observe that the wait container exits abnormally, which may cause #13491 not working. |
unfortunately, I don't have the status of the workflow, only the contorller and wait log messages + kubelet output of host running the pod. Nor do I further info on the pod(wf node) reason of final removal. I only see the pod (and so must be the wait container) was still alive at the moment controller had that "unable to upgrade connection: container not found ", for unknown reasons atm (api communication/network glitch/...). But the fact it results in the workflow to stay in running state, is not ok imho. |
I recommend upgrading to a newer version of argo that contains the fixes above, which can solve most stuck issues. |
You really should fix your RBAC. @heyleke I'm intrigued by your |
Hi @Joibel , this is on a private kubespray provisioned cluster with K8s v1.23.7, running on top of kernel 5.4 and CRE containerd://1.6.4 |
The 3.6.0-rc1 issue is potentially related to #13012 as that is an executor request through SPDY. It didn't make it into 3.5.x as it requires newer k8s, which might be vaguely related to:
IIRC oldest k8s that 3.5 was tested against was 1.24, but you can double-check the history on that one
That's why you have staging, dev, and local 😉 |
we have staging and dev, but it is so rare issue, that it has not been seen in those environments |
Pre-requisites
:latest
image tag (i.e.quay.io/argoproj/workflow-controller:latest
) and can confirm the issue still exists on:latest
. If not, I have explained why, in detail, in my description below.What happened? What did you expect to happen?
main container of a pod finished, wait container uploaded artifact, workflow controller proceeded to 'Succeeded', but workflow didn't continue. Probably due to wait container not reachable.
Problem is hard to reproduce, but we keep logs of all workflows in loki/promtail, so those are available.
Reproduction not done with 'latests' as that is not possible in production environment atm.
Version(s)
v3.5.10
Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Logs from the workflow controller
Logs from in your workflow's wait container
The text was updated successfully, but these errors were encountered: