-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting ResourceExhausted EOF
when running argo wait
for too long
#12075
Comments
I believe this error is from the underlying gRPC library, but not really sure what's causing the error right now
What version did you update from? That would be very helpful in narrowing down a potential regression |
Could you paste an example workflow? |
Any logs from argo server? |
Before we was on |
Unfortunately I can't but like I said the workflow itself is running successful ( |
I checked the logs but nothing suspicious found! It did just show the fetch calls to the server every few seconds to get the latest status and then suddenly it fails :( |
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as duplicate.
This comment was marked as duplicate.
ResourceExhausted EOF
when running argo wait
for too long
Facing a similar issue on version 3.4.9 |
This might not be something Argo can fix; might be a limitation of gRPC. Did some searching and something related I found was this Grafana forum post about increasing the gRPC message size. You can do that by setting |
We started encountering this issue after upgrading Argo Workflows from 3.4.10 to 3.5.7 and argo cli from 3.3.8 to 3.5.7
We were not encountering this issue at all for combination: Argo Workflows 3.4.10 + argo cli 3.3.8. |
Your argo server may well be crashing on 3.5.7, see #13166. See if 3.5.8 helps with that. |
@Joibel We're aware of the issue you mentioned, we had crashes initially after upgrading to 3.5.7 but solved it by increasing CPU and MEM limits (3 server replicas, went from 300m/256Mi to 2000m/1Gi). We haven't observed any crashes recently, however the issue with Though if you suspect 3.5.8 may help or if it would help you narrow down the issue, then we can give it a try and let you know. |
If you're running the 3.5.7 server you should upgrade to 3.5.8, it has a memory corruption bug. What happens is uncertain beyond crashing, it could be fine or it could be causing issues. |
For the record, we have upgraded Argo CLI and server to 3.5.8 and are still observing this issue. |
Pre-requisites
:latest
What happened/what you expected to happen?
Dear all,
We are running some rather big (> 50 min) workflows in our CI pipelines on Github testing out some of our argo-based ML training.
Only after upgrading argo to
v3.4.10
lately we've gotten someResourceExhausted EOF
error when waiting for the jobs to succeed:level=fatal msg="rpc error: code = ResourceExhausted desc = EOF"
We figured that at some point after 50 mins the argo server would respond badly to our
argo wait
which should substantially wait an infinite time until the workflow either succeeds or fails. We looked for the specific error response but didn't find anything. Finally we ended up running theargo wait
in an try-except loop likewhich would end in the CI job two times receiving the timeout EOF error and only in the third
argo wait
to finally reach the end of the job with the awaited statussucceeded
.CC / @Depaccu
Version
v3.4.10
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
None
Logs from the workflow controller
Logs from in your workflow's wait container
The text was updated successfully, but these errors were encountered: