Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting ResourceExhausted EOF when running argo wait for too long #12075

Open
2 of 3 tasks
gutzbenj opened this issue Oct 24, 2023 · 15 comments
Open
2 of 3 tasks

Getting ResourceExhausted EOF when running argo wait for too long #12075

gutzbenj opened this issue Oct 24, 2023 · 15 comments
Labels
area/server P3 Low priority solution/workaround There's a workaround, might not be great, but exists type/bug type/regression Regression from previous behavior (a specific type of bug)

Comments

@gutzbenj
Copy link

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issues exists when I tested with :latest
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

Dear all,

We are running some rather big (> 50 min) workflows in our CI pipelines on Github testing out some of our argo-based ML training.

Only after upgrading argo to v3.4.10 lately we've gotten some ResourceExhausted EOF error when waiting for the jobs to succeed: level=fatal msg="rpc error: code = ResourceExhausted desc = EOF"

We figured that at some point after 50 mins the argo server would respond badly to our argo wait which should substantially wait an infinite time until the workflow either succeeds or fails. We looked for the specific error response but didn't find anything. Finally we ended up running the argo wait in an try-except loop like

# submit argo workflow
  workflow_name=$(
    argo submit \
      --from "workflows/$WORKFLOW_NAME" \
      -n namespaceAbc \
      -o name
  )

  # wait for argo wait (retries) to finish
  retries=3
  for ((i=1; i<=retries; i++)); do
    echo "Waiting for workflow to complete (Attempt $i)..."
    # Run the argo wait command and capture the exit status
    exit_status=0
    argo wait "$workflow_name" -n namespaceAbc || exit_status=1
    # Check if the workflow was successful (exit status 0) or if we've reached the maximum retries
    if [ $exit_status -eq 0 ]; then
        echo "Workflow completed successfully."
        break  # Exit the loop
    elif [ "$i" -eq $retries ]; then
        echo "Maximum retries reached. Workflow did not complete successfully."
        exit 1  # Exit with an error code
    else
        echo "Workflow did not complete successfully. Retrying in 30 seconds..."
        sleep 30  # Wait for a period before the next retry
    fi
  done

  # sleep for 30 seconds to be sure that metadata is updated
  echo "Sleeping for 30 seconds to be sure that metadata is updated"
  sleep 30

  # get workflow status from json
  workflow_status=$(
    argo get "$workflow_name" \
      -n namespaceAbc \
      -o json | jq -r '.status.phase'
  )

  # if workflow status is not succeeded, exit with 1
  if [ "$workflow_status" != "Succeeded" ]; then
    echo "Workflow $workflow_name failed with status $workflow_status"
    exit 1
  fi

which would end in the CI job two times receiving the timeout EOF error and only in the third argo wait to finally reach the end of the job with the awaited status succeeded.

CC / @Depaccu

Version

v3.4.10

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

None

Logs from the workflow controller

None

Logs from in your workflow's wait container

None
@gutzbenj gutzbenj added type/bug type/regression Regression from previous behavior (a specific type of bug) labels Oct 24, 2023
@agilgur5 agilgur5 added solution/workaround There's a workaround, might not be great, but exists P3 Low priority area/server labels Oct 25, 2023
@agilgur5
Copy link

agilgur5 commented Oct 25, 2023

We looked for the specific error response but didn't find anything.

I believe this error is from the underlying gRPC library, but not really sure what's causing the error right now

Only after upgrading argo to v3.4.10

What version did you update from? That would be very helpful in narrowing down a potential regression

@terrytangyuan
Copy link
Member

Could you paste an example workflow?

@terrytangyuan
Copy link
Member

Any logs from argo server?

@agilgur5 agilgur5 added the problem/more information needed Not enough information has been provide to diagnose this issue. label Oct 28, 2023
@gutzbenj
Copy link
Author

gutzbenj commented Nov 6, 2023

What version did you update from? That would be very helpful in narrowing down a potential regression

Before we was on 3.4.4.

@gutzbenj
Copy link
Author

gutzbenj commented Nov 6, 2023

Could you paste an example workflow?

Unfortunately I can't but like I said the workflow itself is running successful (succeeded) and I think its related to the messaging service of the server.

@gutzbenj
Copy link
Author

gutzbenj commented Nov 6, 2023

Any logs from argo server?

I checked the logs but nothing suspicious found! It did just show the fetch calls to the server every few seconds to get the latest status and then suddenly it fails :(

This comment was marked as resolved.

@github-actions github-actions bot added the problem/stale This has not had a response in some time label Jan 10, 2024
@agilgur5 agilgur5 removed the problem/more information needed Not enough information has been provide to diagnose this issue. label Jan 13, 2024
@EladDolev

This comment was marked as duplicate.

@agilgur5 agilgur5 changed the title Getting an ResourceExhausted EOF when running argo wait for too long Getting ResourceExhausted EOF when running argo wait for too long Feb 5, 2024
@shindeshubham10
Copy link

Facing a similar issue on version 3.4.9
Tried the latest version 3.5.4 but issue still exist

@agilgur5
Copy link

This might not be something Argo can fix; might be a limitation of gRPC. Did some searching and something related I found was this Grafana forum post about increasing the gRPC message size. You can do that by setting GRPC_MESSAGE_SIZE on the Server.
You could also try configuring your client to use HTTP/1

@tomix86
Copy link

tomix86 commented Jun 21, 2024

We started encountering this issue after upgrading Argo Workflows from 3.4.10 to 3.5.7 and argo cli from 3.3.8 to 3.5.7
It reproduces rather frequently for any workflows running >1h and the error appears roughly every 30 - 50 minutes, we have worked around it by implementing a loop similar to the one mentioned in the issue description.
e.g.

time="2024-06-18T07:06:59 UTC" level=fatal msg="rpc error: code = ResourceExhausted desc = EOF"
time="2024-06-18T07:43:08 UTC" level=fatal msg="rpc error: code = ResourceExhausted desc = EOF"
time="2024-06-18T08:26:06 UTC" level=fatal msg="rpc error: code = ResourceExhausted desc = EOF"
<workflow> Succeeded at 2024-06-18 08:56:07 +0000 UTC

We were not encountering this issue at all for combination: Argo Workflows 3.4.10 + argo cli 3.3.8.

@Joibel
Copy link
Member

Joibel commented Jun 21, 2024

Your argo server may well be crashing on 3.5.7, see #13166. See if 3.5.8 helps with that.

@tomix86
Copy link

tomix86 commented Jun 21, 2024

@Joibel We're aware of the issue you mentioned, we had crashes initially after upgrading to 3.5.7 but solved it by increasing CPU and MEM limits (3 server replicas, went from 300m/256Mi to 2000m/1Gi). We haven't observed any crashes recently, however the issue with argo wait keeps occurring regularly.

Though if you suspect 3.5.8 may help or if it would help you narrow down the issue, then we can give it a try and let you know.

@Joibel
Copy link
Member

Joibel commented Jun 21, 2024

If you're running the 3.5.7 server you should upgrade to 3.5.8, it has a memory corruption bug. What happens is uncertain beyond crashing, it could be fine or it could be causing issues.

@tomix86
Copy link

tomix86 commented Jun 27, 2024

For the record, we have upgraded Argo CLI and server to 3.5.8 and are still observing this issue.

@agilgur5 agilgur5 removed the problem/stale This has not had a response in some time label Jul 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/server P3 Low priority solution/workaround There's a workaround, might not be great, but exists type/bug type/regression Regression from previous behavior (a specific type of bug)
Projects
None yet
Development

No branches or pull requests

7 participants