Getting `ResourceExhausted EOF` when running `argo wait` for too long #12075

gutzbenj · 2023-10-24T13:54:49Z

Pre-requisites

I have double-checked my configuration
I can confirm the issues exists when I tested with :latest
I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

Dear all,

We are running some rather big (> 50 min) workflows in our CI pipelines on Github testing out some of our argo-based ML training.

Only after upgrading argo to v3.4.10 lately we've gotten some ResourceExhausted EOF error when waiting for the jobs to succeed: level=fatal msg="rpc error: code = ResourceExhausted desc = EOF"

We figured that at some point after 50 mins the argo server would respond badly to our argo wait which should substantially wait an infinite time until the workflow either succeeds or fails. We looked for the specific error response but didn't find anything. Finally we ended up running the argo wait in an try-except loop like

# submit argo workflow
  workflow_name=$(
    argo submit \
      --from "workflows/$WORKFLOW_NAME" \
      -n namespaceAbc \
      -o name
  )

  # wait for argo wait (retries) to finish
  retries=3
  for ((i=1; i<=retries; i++)); do
    echo "Waiting for workflow to complete (Attempt $i)..."
    # Run the argo wait command and capture the exit status
    exit_status=0
    argo wait "$workflow_name" -n namespaceAbc || exit_status=1
    # Check if the workflow was successful (exit status 0) or if we've reached the maximum retries
    if [ $exit_status -eq 0 ]; then
        echo "Workflow completed successfully."
        break  # Exit the loop
    elif [ "$i" -eq $retries ]; then
        echo "Maximum retries reached. Workflow did not complete successfully."
        exit 1  # Exit with an error code
    else
        echo "Workflow did not complete successfully. Retrying in 30 seconds..."
        sleep 30  # Wait for a period before the next retry
    fi
  done

  # sleep for 30 seconds to be sure that metadata is updated
  echo "Sleeping for 30 seconds to be sure that metadata is updated"
  sleep 30

  # get workflow status from json
  workflow_status=$(
    argo get "$workflow_name" \
      -n namespaceAbc \
      -o json | jq -r '.status.phase'
  )

  # if workflow status is not succeeded, exit with 1
  if [ "$workflow_status" != "Succeeded" ]; then
    echo "Workflow $workflow_name failed with status $workflow_status"
    exit 1
  fi

which would end in the CI job two times receiving the timeout EOF error and only in the third argo wait to finally reach the end of the job with the awaited status succeeded.

CC / @Depaccu

Version

v3.4.10

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

None

Logs from the workflow controller

None

Logs from in your workflow's wait container

None

The text was updated successfully, but these errors were encountered:

agilgur5 · 2023-10-25T17:02:27Z

We looked for the specific error response but didn't find anything.

I believe this error is from the underlying gRPC library, but not really sure what's causing the error right now

Only after upgrading argo to v3.4.10

What version did you update from? That would be very helpful in narrowing down a potential regression

terrytangyuan · 2023-10-25T23:12:44Z

Could you paste an example workflow?

terrytangyuan · 2023-10-25T23:13:31Z

Any logs from argo server?

gutzbenj · 2023-11-06T07:52:36Z

What version did you update from? That would be very helpful in narrowing down a potential regression

Before we was on 3.4.4.

gutzbenj · 2023-11-06T07:54:56Z

Could you paste an example workflow?

Unfortunately I can't but like I said the workflow itself is running successful (succeeded) and I think its related to the messaging service of the server.

gutzbenj · 2023-11-06T12:38:03Z

Any logs from argo server?

I checked the logs but nothing suspicious found! It did just show the fetch calls to the server every few seconds to get the latest status and then suddenly it fails :(

shindeshubham10 · 2024-02-21T06:33:38Z

Facing a similar issue on version 3.4.9
Tried the latest version 3.5.4 but issue still exist

agilgur5 · 2024-02-21T16:06:08Z

This might not be something Argo can fix; might be a limitation of gRPC. Did some searching and something related I found was this Grafana forum post about increasing the gRPC message size. You can do that by setting GRPC_MESSAGE_SIZE on the Server.
You could also try configuring your client to use HTTP/1

tomix86 · 2024-06-21T04:54:01Z

We started encountering this issue after upgrading Argo Workflows from 3.4.10 to 3.5.7 and argo cli from 3.3.8 to 3.5.7
It reproduces rather frequently for any workflows running >1h and the error appears roughly every 30 - 50 minutes, we have worked around it by implementing a loop similar to the one mentioned in the issue description.
e.g.

time="2024-06-18T07:06:59 UTC" level=fatal msg="rpc error: code = ResourceExhausted desc = EOF"
time="2024-06-18T07:43:08 UTC" level=fatal msg="rpc error: code = ResourceExhausted desc = EOF"
time="2024-06-18T08:26:06 UTC" level=fatal msg="rpc error: code = ResourceExhausted desc = EOF"
<workflow> Succeeded at 2024-06-18 08:56:07 +0000 UTC

We were not encountering this issue at all for combination: Argo Workflows 3.4.10 + argo cli 3.3.8.

Joibel · 2024-06-21T08:15:15Z

Your argo server may well be crashing on 3.5.7, see #13166. See if 3.5.8 helps with that.

tomix86 · 2024-06-21T10:09:59Z

@Joibel We're aware of the issue you mentioned, we had crashes initially after upgrading to 3.5.7 but solved it by increasing CPU and MEM limits (3 server replicas, went from 300m/256Mi to 2000m/1Gi). We haven't observed any crashes recently, however the issue with argo wait keeps occurring regularly.

Though if you suspect 3.5.8 may help or if it would help you narrow down the issue, then we can give it a try and let you know.

Joibel · 2024-06-21T13:06:01Z

If you're running the 3.5.7 server you should upgrade to 3.5.8, it has a memory corruption bug. What happens is uncertain beyond crashing, it could be fine or it could be causing issues.

tomix86 · 2024-06-27T05:20:05Z

For the record, we have upgraded Argo CLI and server to 3.5.8 and are still observing this issue.

gutzbenj added type/bug type/regression Regression from previous behavior (a specific type of bug) labels Oct 24, 2023

agilgur5 added solution/workaround There's a workaround, might not be great, but exists P3 Low priority area/server labels Oct 25, 2023

agilgur5 added the problem/more information needed Not enough information has been provide to diagnose this issue. label Oct 28, 2023

This comment was marked as resolved.

Sign in to view

github-actions bot added the problem/stale This has not had a response in some time label Jan 10, 2024

agilgur5 removed the problem/more information needed Not enough information has been provide to diagnose this issue. label Jan 13, 2024

This comment was marked as duplicate.

Sign in to view

agilgur5 changed the title ~~Getting an ResourceExhausted EOF when running argo wait for too long~~ Getting ResourceExhausted EOF when running argo wait for too long Feb 5, 2024

agilgur5 removed the problem/stale This has not had a response in some time label Jul 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting `ResourceExhausted EOF` when running `argo wait` for too long #12075

Getting `ResourceExhausted EOF` when running `argo wait` for too long #12075

gutzbenj commented Oct 24, 2023

agilgur5 commented Oct 25, 2023 •

edited

Loading

terrytangyuan commented Oct 25, 2023

terrytangyuan commented Oct 25, 2023

gutzbenj commented Nov 6, 2023

gutzbenj commented Nov 6, 2023

gutzbenj commented Nov 6, 2023

This comment was marked as resolved.

This comment was marked as duplicate.

shindeshubham10 commented Feb 21, 2024

agilgur5 commented Feb 21, 2024

tomix86 commented Jun 21, 2024

Joibel commented Jun 21, 2024

tomix86 commented Jun 21, 2024 •

edited

Loading

Joibel commented Jun 21, 2024

tomix86 commented Jun 27, 2024

Getting ResourceExhausted EOF when running argo wait for too long #12075

Getting ResourceExhausted EOF when running argo wait for too long #12075

Comments

gutzbenj commented Oct 24, 2023

Pre-requisites

What happened/what you expected to happen?

Version

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Logs from the workflow controller

Logs from in your workflow's wait container

agilgur5 commented Oct 25, 2023 • edited Loading

terrytangyuan commented Oct 25, 2023

terrytangyuan commented Oct 25, 2023

gutzbenj commented Nov 6, 2023

gutzbenj commented Nov 6, 2023

gutzbenj commented Nov 6, 2023

This comment was marked as resolved.

This comment was marked as duplicate.

shindeshubham10 commented Feb 21, 2024

agilgur5 commented Feb 21, 2024

tomix86 commented Jun 21, 2024

Joibel commented Jun 21, 2024

tomix86 commented Jun 21, 2024 • edited Loading

Joibel commented Jun 21, 2024

tomix86 commented Jun 27, 2024

Getting `ResourceExhausted EOF` when running `argo wait` for too long #12075

Getting `ResourceExhausted EOF` when running `argo wait` for too long #12075

agilgur5 commented Oct 25, 2023 •

edited

Loading

tomix86 commented Jun 21, 2024 •

edited

Loading