Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

During upgrade: Maximum recursion depth exceeded error when not using recursion #12162

Closed
2 of 3 tasks
tooptoop4 opened this issue Nov 7, 2023 · 36 comments · Fixed by #13504
Closed
2 of 3 tasks

During upgrade: Maximum recursion depth exceeded error when not using recursion #12162

tooptoop4 opened this issue Nov 7, 2023 · 36 comments · Fixed by #13504
Assignees
Labels
area/controller Controller issues, panics P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important solution/workaround There's a workaround, might not be great, but exists type/bug type/regression Regression from previous behavior (a specific type of bug)

Comments

@tooptoop4
Copy link
Contributor

tooptoop4 commented Nov 7, 2023

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issues exists when I tested with :latest
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

never saw this error in 3.4.11

but after upgrading to 3.5.1
got error in entry template execution: Maximum recursion depth exceeded. See https://argoproj.github.io/argo-workflows/scaling/#maximum-recursion-depth on 2 different small (1 step) workflows

Version

3.5.1

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

n/a

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded
@agilgur5
Copy link

agilgur5 commented Nov 7, 2023

never saw this error in 3.4.11

Correct, that's a new feature in 3.5, not a bug. Implemented by #11646.
It's mentioned in the 3.5 blog post as well (search "recursion").

on 2 different small (1 step) workflows

You'll need to provide an example workflow, which is missing from your issue. I can't reproduce this and we've had many contributors and companies using 3.5+

In general, more details are almost always better than few details, which is why they are asked for in issue templates.

@agilgur5 agilgur5 added area/controller Controller issues, panics type/support User support issue - likely not a bug problem/more information needed Not enough information has been provide to diagnose this issue. and removed type/bug labels Nov 7, 2023
@tico24
Copy link
Member

tico24 commented Nov 13, 2023

There does appear to be an issue with this specifically in 3.5.1 (not in 3.5.0). Our CI is failing on it, and definitely uses fewer than 100 recursions.

I'm working on something small and replicatable.

@agilgur5 agilgur5 added type/bug P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important type/regression Regression from previous behavior (a specific type of bug) problem/more information needed Not enough information has been provide to diagnose this issue. and removed problem/more information needed Not enough information has been provide to diagnose this issue. type/support User support issue - likely not a bug labels Nov 13, 2023
@Joibel Joibel self-assigned this Nov 13, 2023
@Joibel
Copy link
Member

Joibel commented Nov 13, 2023

I'm looking into it, workflows that reproduce it would be helpful though.

@Joibel
Copy link
Member

Joibel commented Nov 13, 2023

Working with @tico24 here, we've not managed to reproduce this on the same installation with the same workflow. He'd disabled the feature with DISABLE_MAX_RECURSION but re-enabling it and re-running the same workloads doesn't cause the error.

@tico24 had updated to 3.5.1 from 3.5.0 this morning and it started happening near immediately. He's left it so it 'should' fail again.

I really need a runnable workflow @tooptoop4. Does restarting the controller alleviate the issue?

@terrytangyuan
Copy link
Member

terrytangyuan commented Nov 13, 2023

Should we disable this feature by default instead?

@agilgur5
Copy link

It's a security, HA, multi-tenancy, etc feature, so I do think it should be on by default. But we should ofc fix any bugs there may be

@Joibel
Copy link
Member

Joibel commented Nov 14, 2023

I agree with @agilgur5, it would be bad to try and hide the bugs this way, we should either have it as a working feature or get rid of it. It's still working fine for @tico24 at the moment.

@guruprasathTT
Copy link

We tried to upgrade to v3.5.2 and faced this error in the controller logs. We are thinking to DISABLE_MAX_RECURSION and then upgrade to v3.5.2. Any suggestion or anything to be considered around this would be helpful. Thanks !

@agilgur5
Copy link

@guruprasathTT can you please provide a Workflow that reproduces this? We asked for one above a few times and without that, we wouldn't be able to find any possible bugs

@guruprasathTT
Copy link

guruprasathTT commented Dec 21, 2023

@agilgur5
Hi, This is not happened with a particular WF or the Workflows using the particular Workflow template. After the upgrade immediately the workflow controller throws this error message for the running workflows (multiple workflows) but after that we don't see this error in the dev env for sometime.

@Joibel
Copy link
Member

Joibel commented Dec 21, 2023

This seems to be the case now with several upgrades to 3.5, in flight upgrades seem to cause this to happen, but afterwards it is fine.

@roelarents
Copy link
Contributor

roelarents commented Jan 3, 2024

I think the problem is that not actually recursion is counted, but rather any "open" call to executeTemplate. So a workflow that doesn't recurse (that much) but does fan out into a lot of parallel steps can also hit this limit. (At least this seems to be the case for us, we've disabled the feature now.)

edit: not true, fan out only increases concurrency by 1

This comment was marked as resolved.

@github-actions github-actions bot added the problem/stale This has not had a response in some time label Jan 18, 2024
@biochimia
Copy link

I don't have a reproducible test case either, but we've been seeing the Maximum recursion depth exceeded. randomly show up in various workflows after recently upgrading to 3.5.2.

I'm not aware that we make use of recursion in the failing workflows. Our workflows are dynamic but have fixed numbers of nodes. Usually they fan out to fixed number of tasks based on configuration. Restarting the workflows seems to "fix" the issue, even though the number of steps remains the same.

@Joibel
Copy link
Member

Joibel commented Jan 24, 2024

@biochimia, can you see if it reoccurs after a workflow-controller restart? All cases I've had reported have been on a fresh upgrade only, but a controller restart has made them go away.

@Joibel
Copy link
Member

Joibel commented Jan 24, 2024

@roelarents - I didn't see your message before today. I'll try to reproduce that.

@roelarents
Copy link
Contributor

O, yeah, I have to retract that statement. It's not true. We tried to reproduce it too and a fan out does only increase it by 1. Sorry. Still trying to find out what does cause it. I tried to sneak in an extra log line for the concurrency so far. But I hadn't time to look at this after anymore (just disabled the max recursion check).

@github-actions github-actions bot removed problem/stale This has not had a response in some time problem/more information needed Not enough information has been provide to diagnose this issue. labels Jan 26, 2024
@agilgur5 agilgur5 changed the title error in entry template execution: Maximum recursion depth exceeded. See https://argoproj.github.io/argo-workflows/scaling/#maximum-recursion-depth Maximum recursion depth exceeded error when not using recursion Feb 5, 2024
@agilgur5 agilgur5 added the solution/workaround There's a workaround, might not be great, but exists label Feb 5, 2024
@ekesken
Copy link

ekesken commented May 17, 2024

I don't recommend anybody setting DISABLE_MAX_RECURSION=true, because that will end up with stack overflow panics when you start hitting the problem in a workflow, and the argo workflow controller will keep restarting until you delete the problematic workflow.

We were still facing this issue in Argo 3.5.2, to debug the problem, I baked an image to do the mentioned log changed above and promoted its level from debug to info. Then in a cluster where we were having this issue, I switched to this image, setting DISABLE_MAX_RECURSION=false at the same time, and I got the following logs:

# while it was `DISABLE_MAX_RECURSION=true`
2024-05-14T00:04:35.20947308Z,"time=""2024-05-14T00:04:35.209Z"" level=info msg=""Processing workflow"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T00:04:35.252261778Z,"time=""2024-05-14T00:04:35.252Z"" level=info msg=""Task-result reconciliation"" namespace=default numObjs=0 workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T00:04:35.252807048Z,"time=""2024-05-14T00:04:35.252Z"" level=info msg=""Pod failed: Error (exit code 1)"" displayName=""run-train(0)"" namespace=default pod=rr-ml-k-module-xxx-74gmh-1-1379393644-3334131527 templateName=run-train workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T00:04:35.252924701Z,"time=""2024-05-14T00:04:35.252Z"" level=warning msg=""workflow uses legacy/insecure pod patch, see https://argoproj.github.io/argo-workflows/workflow-rbac/"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T00:04:35.252990251Z,"time=""2024-05-14T00:04:35.252Z"" level=warning msg=""workflow uses legacy/insecure pod patch, see https://argoproj.github.io/argo-workflows/workflow-rbac/"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T00:04:35.253160097Z,"time=""2024-05-14T00:04:35.253Z"" level=info msg=""node changed"" namespace=default new.message=""Error (exit code 1)"" new.phase=Failed new.progress=0/1 nodeID=rr-ml-k-module-xxx-74gmh-1-1379393644-3334131527 old.message= old.phase=Running old.progress=0/1 workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T00:04:35.253304318Z,"time=""2024-05-14T00:04:35.253Z"" level=info msg=""Workflow pod is missing"" namespace=default nodeName=""rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(4...<LONG PARAM LIST>...).run-train(0)"" nodePhase=Running recentlyStarted=false workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T00:04:35.253395075Z,"time=""2024-05-14T00:04:35.253Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-794976774 phase Running -> Error"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T00:04:35.253462751Z,"time=""2024-05-14T00:04:35.253Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-794976774 message: pod deleted"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T00:04:35.253543399Z,"time=""2024-05-14T00:04:35.253Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-794976774 finished: 2024-05-14 00:04:35.253496244 +0000 UTC"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
# ... repeating the same pattern until we set `DISABLE_MAX_RECURSION` from true to false at 2024-05-14T22:12:57, then:
2024-05-14T22:12:57.868454091Z,"time=""2024-05-14T22:12:57.864Z"" level=info msg=""Processing workflow"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:57.905766307Z,"time=""2024-05-14T22:12:57.905Z"" level=info msg=""Task-result reconciliation"" namespace=default numObjs=0 workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:57.90807108Z,"time=""2024-05-14T22:12:57.906Z"" level=info msg=""Pod failed: Error (exit code 1)"" displayName=""run-train(0)"" namespace=default pod=rr-ml-k-module-xxx-74gmh-1-1379393644-3334131527 templateName=run-train workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:57.912209745Z,"time=""2024-05-14T22:12:57.912Z"" level=warning msg=""workflow uses legacy/insecure pod patch, see https://argoproj.github.io/argo-workflows/workflow-rbac/"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:57.912220182Z,"time=""2024-05-14T22:12:57.912Z"" level=warning msg=""workflow uses legacy/insecure pod patch, see https://argoproj.github.io/argo-workflows/workflow-rbac/"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:57.912294118Z,"time=""2024-05-14T22:12:57.912Z"" level=info msg=""node changed"" namespace=default new.message=""Error (exit code 1)"" new.phase=Failed new.progress=0/1 nodeID=rr-ml-k-module-xxx-74gmh-1-1379393644-3334131527 old.message= old.phase=Running old.progress=0/1 workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:57.912446978Z,"time=""2024-05-14T22:12:57.912Z"" level=info msg=""Workflow pod is missing"" namespace=default nodeName=""rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(4...<LONG PARAM LIST>...).run-train(0)"" nodePhase=Running recentlyStarted=false workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:57.912465037Z,"time=""2024-05-14T22:12:57.912Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-794976774 phase Running -> Error"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:57.912484951Z,"time=""2024-05-14T22:12:57.912Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-794976774 message: pod deleted"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:57.912544446Z,"time=""2024-05-14T22:12:57.912Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-794976774 finished: 2024-05-14 22:12:57.912498416 +0000 UTC"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:57.912617332Z,"time=""2024-05-14T22:12:57.912Z"" level=info msg=""Evaluating node rr-ml-k-module-xxx-74gmh-1-1379393644: template: *v1alpha1.WorkflowStep (ml-k-module-xxx-ads-categorize-train), boundaryID: , currentStackDepth: 0"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:57.912878499Z,"time=""2024-05-14T22:12:57.912Z"" level=info msg=""Evaluating node rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1: template: *v1alpha1.DAGTask (exit-handler-1), boundaryID: rr-ml-k-module-xxx-74gmh-1-1379393644, currentStackDepth: 1"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:57.913926903Z,"time=""2024-05-14T22:12:57.913Z"" level=warning msg=""was unable to obtain the node for rr-ml-k-module-xxx-74gmh-1-1379393644-2682863345, taskName run-mlflow-log"""
2024-05-14T22:12:57.914737228Z,"time=""2024-05-14T22:12:57.914Z"" level=warning msg=""was unable to obtain the node for rr-ml-k-module-xxx-74gmh-1-1379393644-3044613789, taskName run-finalize"""
2024-05-14T22:12:57.914752884Z,"time=""2024-05-14T22:12:57.914Z"" level=warning msg=""was unable to obtain the node for rr-ml-k-module-xxx-74gmh-1-1379393644-2946699175, taskName list-all-trainings"""
2024-05-14T22:12:57.920401028Z,"time=""2024-05-14T22:12:57.920Z"" level=info msg=""Evaluating node rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(0...<LONG PARAM LIST>...): template: *v1alpha1.DAGTask (for-loop-5), boundaryID: rr-ml-k-module-xxx-74gmh-1-1379393644-2150580619, currentStackDepth: 2"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:57.921250032Z,"time=""2024-05-14T22:12:57.921Z"" level=info msg=""Evaluating node rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(1...<LONG PARAM LIST>...): template: *v1alpha1.DAGTask (for-loop-5), boundaryID: rr-ml-k-module-xxx-74gmh-1-1379393644-2150580619, currentStackDepth: 2"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:57.922065485Z,"time=""2024-05-14T22:12:57.921Z"" level=info msg=""Evaluating node rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(2...<LONG PARAM LIST>...): template: *v1alpha1.DAGTask (for-loop-5), boundaryID: rr-ml-k-module-xxx-74gmh-1-1379393644-2150580619, currentStackDepth: 2"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:57.922882744Z,"time=""2024-05-14T22:12:57.922Z"" level=info msg=""Evaluating node rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(3...<LONG PARAM LIST>...): template: *v1alpha1.DAGTask (for-loop-5), boundaryID: rr-ml-k-module-xxx-74gmh-1-1379393644-2150580619, currentStackDepth: 2"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:57.926746957Z,"time=""2024-05-14T22:12:57.923Z"" level=info msg=""Evaluating node rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(3...<LONG PARAM LIST>...).run-train: template: *v1alpha1.DAGTask (run-train), boundaryID: rr-ml-k-module-xxx-74gmh-1-1379393644-2035740812, currentStackDepth: 3"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
# ... repeating ~100 times
2024-05-14T22:12:58.46761812Z,"time=""2024-05-14T22:12:58.453Z"" level=info msg=""Evaluating node rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(3...<LONG PARAM LIST>...).run-train: template: *v1alpha1.DAGTask (run-train), boundaryID: rr-ml-k-module-xxx-74gmh-1-1379393644-2035740812, currentStackDepth: 99"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:58.472900557Z,"time=""2024-05-14T22:12:58.471Z"" level=error msg=""Mark error node"" error=""Maximum recursion depth exceeded. See https://argoproj.github.io/argo-workflows/scaling/#maximum-recursion-depth"" namespace=default nodeName=""rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(3...<LONG PARAM LIST>...).run-train"" workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:58.472931908Z,"time=""2024-05-14T22:12:58.471Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-562365516 phase Running -> Error"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:58.472935313Z,"time=""2024-05-14T22:12:58.471Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-562365516 message: Maximum recursion depth exceeded. See https://argoproj.github.io/argo-workflows/scaling/#maximum-recursion-depth"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:58.47293766Z,"time=""2024-05-14T22:12:58.471Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-562365516 finished: 2024-05-14 22:12:58.471853015 +0000 UTC"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:58.473096379Z,"time=""2024-05-14T22:12:58.471Z"" level=error msg=""Mark error node"" error=""Maximum recursion depth exceeded. See https://argoproj.github.io/argo-workflows/scaling/#maximum-recursion-depth"" namespace=default nodeName=""rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(3...<LONG PARAM LIST>...).run-train(0)"" workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:58.473106159Z,"time=""2024-05-14T22:12:58.471Z"" level=error msg=""node is already fulfilled"" fromPhase=Failed namespace=default nodeName=""rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(3...<LONG PARAM LIST>...).run-train(0)"" toPhase=Error workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:58.473112838Z,"time=""2024-05-14T22:12:58.472Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-3334131527 phase Failed -> Error"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:58.473115521Z,"time=""2024-05-14T22:12:58.472Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-3334131527 message: Maximum recursion depth exceeded. See https://argoproj.github.io/argo-workflows/scaling/#maximum-recursion-depth"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:58.473162192Z,"time=""2024-05-14T22:12:58.472Z"" level=error msg=""Mark error node"" error=""Maximum recursion depth exceeded. See https://argoproj.github.io/argo-workflows/scaling/#maximum-recursion-depth"" namespace=default nodeName=""rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(3...<LONG PARAM LIST>...).run-train(0)"" workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
# ... repeating ~100 times
2024-05-14T22:12:58.517350212Z,"time=""2024-05-14T22:12:58.517Z"" level=error msg=""Mark error node"" error=""Maximum recursion depth exceeded. See https://argoproj.github.io/argo-workflows/scaling/#maximum-recursion-depth"" namespace=default nodeName=""rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(3...<LONG PARAM LIST>...).run-train(0)"" workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:58.517624319Z,"time=""2024-05-14T22:12:58.517Z"" level=info msg=""Outbound nodes of rr-ml-k-module-xxx-74gmh-1-1379393644-2035740812 set to [rr-ml-k-module-xxx-74gmh-1-1379393644-3334131527]"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:58.517658723Z,"time=""2024-05-14T22:12:58.517Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-2035740812 phase Running -> Error"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:58.517662915Z,"time=""2024-05-14T22:12:58.517Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-2035740812 finished: 2024-05-14 22:12:58.517618731 +0000 UTC"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:58.52270363Z,"time=""2024-05-14T22:12:58.522Z"" level=info msg=""Evaluating node rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(4...<LONG PARAM LIST>...): template: *v1alpha1.DAGTask (for-loop-5), boundaryID: rr-ml-k-module-xxx-74gmh-1-1379393644-2150580619, currentStackDepth: 2"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:58.523771669Z,"time=""2024-05-14T22:12:58.523Z"" level=info msg=""Evaluating node rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(4...<LONG PARAM LIST>...).run-train: template: *v1alpha1.DAGTask (run-train), boundaryID: rr-ml-k-module-xxx-74gmh-1-1379393644-2396377327, currentStackDepth: 3"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:58.864898413Z,"time=""2024-05-14T22:12:58.864Z"" level=info msg=""Created pod: rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(4...<LONG PARAM LIST>...).run-train(0) (rr-ml-k-module-xxx-74gmh-1-1379393644-794976774)"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:58.865015582Z,"time=""2024-05-14T22:12:58.864Z"" level=info msg=""Evaluating node rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(4...<LONG PARAM LIST>...).run-train: template: *v1alpha1.DAGTask (run-train), boundaryID: rr-ml-k-module-xxx-74gmh-1-1379393644-2396377327, currentStackDepth: 4"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.418885046Z,"time=""2024-05-14T22:12:59.418Z"" level=info msg=""Failed pod rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(4...<LONG PARAM LIST>...).run-train(0) (rr-ml-k-module-xxx-74gmh-1-1379393644-794976774) creation: already exists"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.41905412Z,"time=""2024-05-14T22:12:59.418Z"" level=info msg=""Evaluating node rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(4...<LONG PARAM LIST>...).run-train: template: *v1alpha1.DAGTask (run-train), boundaryID: rr-ml-k-module-xxx-74gmh-1-1379393644-2396377327, currentStackDepth: 5"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.671258908Z,"time=""2024-05-14T22:12:59.671Z"" level=info msg=""Evaluating node rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(4...<LONG PARAM LIST>...).run-train: template: *v1alpha1.DAGTask (run-train), boundaryID: rr-ml-k-module-xxx-74gmh-1-1379393644-2396377327, currentStackDepth: 99"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
# ... repeating ~100 times
2024-05-14T22:12:59.671361012Z,"time=""2024-05-14T22:12:59.671Z"" level=error msg=""Mark error node"" error=""Maximum recursion depth exceeded. See https://argoproj.github.io/argo-workflows/scaling/#maximum-recursion-depth"" namespace=default nodeName=""rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(4...<LONG PARAM LIST>...).run-train"" workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.671370981Z,"time=""2024-05-14T22:12:59.671Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-4042243083 phase Running -> Error"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.671377684Z,"time=""2024-05-14T22:12:59.671Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-4042243083 message: Maximum recursion depth exceeded. See https://argoproj.github.io/argo-workflows/scaling/#maximum-recursion-depth"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.671380425Z,"time=""2024-05-14T22:12:59.671Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-4042243083 finished: 2024-05-14 22:12:59.671315687 +0000 UTC"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.671445639Z,"time=""2024-05-14T22:12:59.671Z"" level=error msg=""Mark error node"" error=""Maximum recursion depth exceeded. See https://argoproj.github.io/argo-workflows/scaling/#maximum-recursion-depth"" namespace=default nodeName=""rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(4...<LONG PARAM LIST>...).run-train(0)"" workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.671451251Z,"time=""2024-05-14T22:12:59.671Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-794976774 message: Maximum recursion depth exceeded. See https://argoproj.github.io/argo-workflows/scaling/#maximum-recursion-depth"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.671485433Z,"time=""2024-05-14T22:12:59.671Z"" level=error msg=""Mark error node"" error=""Maximum recursion depth exceeded. See https://argoproj.github.io/argo-workflows/scaling/#maximum-recursion-depth"" namespace=default nodeName=""rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(4...<LONG PARAM LIST>...).run-train(0)"" workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
# ... repeating ~100 times
024-05-14T22:12:59.693476433Z,"time=""2024-05-14T22:12:59.678Z"" level=error msg=""Mark error node"" error=""Maximum recursion depth exceeded. See https://argoproj.github.io/argo-workflows/scaling/#maximum-recursion-depth"" namespace=default nodeName=""rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(4...<LONG PARAM LIST>...).run-train(0)"" workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693480732Z,"time=""2024-05-14T22:12:59.678Z"" level=info msg=""Outbound nodes of rr-ml-k-module-xxx-74gmh-1-1379393644-2396377327 set to [rr-ml-k-module-xxx-74gmh-1-1379393644-794976774]"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693483374Z,"time=""2024-05-14T22:12:59.678Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-2396377327 phase Running -> Error"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.69348586Z,"time=""2024-05-14T22:12:59.678Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-2396377327 finished: 2024-05-14 22:12:59.678392409 +0000 UTC"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693487846Z,"time=""2024-05-14T22:12:59.678Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-2704183877 phase Running -> Error"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693489799Z,"time=""2024-05-14T22:12:59.678Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-2704183877 finished: 2024-05-14 22:12:59.678511785 +0000 UTC"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693492096Z,"time=""2024-05-14T22:12:59.678Z"" level=warning msg=""was unable to obtain the node for rr-ml-k-module-xxx-74gmh-1-1379393644-2946699175, taskName list-all-trainings"""
2024-05-14T22:12:59.693494139Z,"time=""2024-05-14T22:12:59.678Z"" level=info msg=""Skipped node rr-ml-k-module-xxx-74gmh-1-1379393644-2946699175 initialized Omitted (message: omitted: depends condition not met)"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693496191Z,"time=""2024-05-14T22:12:59.678Z"" level=warning msg=""was unable to obtain the node for rr-ml-k-module-xxx-74gmh-1-1379393644-3044613789, taskName run-finalize"""
2024-05-14T22:12:59.693499875Z,"time=""2024-05-14T22:12:59.678Z"" level=info msg=""Skipped node rr-ml-k-module-xxx-74gmh-1-1379393644-3044613789 initialized Omitted (message: omitted: depends condition not met)"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693502123Z,"time=""2024-05-14T22:12:59.678Z"" level=warning msg=""was unable to obtain the node for rr-ml-k-module-xxx-74gmh-1-1379393644-2682863345, taskName run-mlflow-log"""
2024-05-14T22:12:59.693504133Z,"time=""2024-05-14T22:12:59.678Z"" level=info msg=""Skipped node rr-ml-k-module-xxx-74gmh-1-1379393644-2682863345 initialized Omitted (message: omitted: depends condition not met)"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693507768Z,"time=""2024-05-14T22:12:59.679Z"" level=info msg=""Outbound nodes of rr-ml-k-module-xxx-74gmh-1-1379393644-2150580619 set to [rr-ml-k-module-xxx-74gmh-1-1379393644-2682863345]"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693509795Z,"time=""2024-05-14T22:12:59.679Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-2150580619 phase Running -> Error"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693511871Z,"time=""2024-05-14T22:12:59.679Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644-2150580619 finished: 2024-05-14 22:12:59.679122842 +0000 UTC"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693584338Z,"time=""2024-05-14T22:12:59.679Z"" level=info msg=""Outbound nodes of rr-ml-k-module-xxx-74gmh-1-1379393644 set to [rr-ml-k-module-xxx-74gmh-1-1379393644-2682863345]"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693586783Z,"time=""2024-05-14T22:12:59.689Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644 phase Running -> Error"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693589171Z,"time=""2024-05-14T22:12:59.689Z"" level=info msg=""node rr-ml-k-module-xxx-74gmh-1-1379393644 finished: 2024-05-14 22:12:59.689573024 +0000 UTC"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693591599Z,"time=""2024-05-14T22:12:59.689Z"" level=info msg=""TaskSet Reconciliation"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693593733Z,"time=""2024-05-14T22:12:59.689Z"" level=info msg=reconcileAgentPod namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693595694Z,"time=""2024-05-14T22:12:59.689Z"" level=info msg=""Running OnExit handler: exit-handlers-train"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693598475Z,"time=""2024-05-14T22:12:59.689Z"" level=info msg=""Evaluating node rr-ml-k-module-xxx-74gmh-1-1379393644.onExit: template: *v1alpha1.WorkflowStep (exit-handlers-train), boundaryID: , currentStackDepth: 0"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693600994Z,"time=""2024-05-14T22:12:59.689Z"" level=warning msg=""Node was nil, will be initialized as type Skipped"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693604096Z,"time=""2024-05-14T22:12:59.691Z"" level=info msg=""was unable to obtain node for , letting display name to be nodeName"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693606295Z,"time=""2024-05-14T22:12:59.691Z"" level=info msg=""Retry node rr-ml-k-module-xxx-74gmh-1-1379393644-1646786213 initialized Running"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693608247Z,"time=""2024-05-14T22:12:59.692Z"" level=info msg=""was unable to obtain node for , letting display name to be nodeName"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:12:59.693610282Z,"time=""2024-05-14T22:12:59.692Z"" level=info msg=""Pod node rr-ml-k-module-xxx-74gmh-1-1379393644-1649482756 initialized Pending"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:13:00.291666651Z,"time=""2024-05-14T22:13:00.291Z"" level=info msg=""Created pod: rr-ml-k-module-xxx-74gmh-1-1379393644.onExit(0) (rr-ml-k-module-xxx-74gmh-1-1379393644-1649482756)"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:13:00.347960727Z,"time=""2024-05-14T22:13:00.347Z"" level=info msg=""Workflow update successful"" namespace=default phase=Running resourceVersion=2131354266 workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:13:00.351118572Z,"time=""2024-05-14T22:13:00Z"" level=info msg=""Processing object (rr-ml-k-module-xxx-74gmh-1-1379393644): object has no owner."" Workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:13:00.358677615Z,"time=""2024-05-14T22:13:00.358Z"" level=info msg=""cleaning up pod"" action=labelPodCompleted key=default/rr-ml-k-module-xxx-74gmh-1-1379393644-3334131527/labelPodCompleted"
2024-05-14T22:13:00.358858972Z,"time=""2024-05-14T22:13:00.358Z"" level=info msg=""cleaning up pod"" action=labelPodCompleted key=default/rr-ml-k-module-xxx-74gmh-1-1379393644-794976774/labelPodCompleted"
2024-05-14T22:13:06.379389483Z,"time=""2024-05-14T22:13:06.377Z"" level=info msg=""Processing workflow"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:13:06.441437157Z,"time=""2024-05-14T22:13:06.441Z"" level=info msg=""Task-result reconciliation"" namespace=default numObjs=0 workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:13:06.441726468Z,"time=""2024-05-14T22:13:06.441Z"" level=info msg=""node changed"" namespace=default new.message=""Unschedulable: 0/33 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/role: mem-8xlarge}, 11 node(s) had untolerated taint {example.com/dedicated: cluster-management}, 12 node(s) didn't match Pod's node affinity/selector, 2 node(s) had untolerated taint {example.com/dedicated: ml-mgmt}, 3 node(s) had untolerated taint {example.com/dedicated: airflow-jobs}, 4 node(s) had untolerated taint {example.com/dedicated: bd-mgmt}. preemption: 0/33 nodes are available: 33 Preemption is not helpful for scheduling.."" new.phase=Pending new.progress=0/1 nodeID=rr-ml-k-module-xxx-74gmh-1-1379393644-1649482756 old.message= old.phase=Pending old.progress=0/1 workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:13:06.441856674Z,"time=""2024-05-14T22:13:06.441Z"" level=info msg=""Evaluating node rr-ml-k-module-xxx-74gmh-1-1379393644: template: *v1alpha1.WorkflowStep (ml-k-module-xxx-ads-categorize-train), boundaryID: , currentStackDepth: 0"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:13:06.44220717Z,"time=""2024-05-14T22:13:06.442Z"" level=info msg=""TaskSet Reconciliation"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:13:06.44228502Z,"time=""2024-05-14T22:13:06.442Z"" level=info msg=reconcileAgentPod namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:13:06.442342693Z,"time=""2024-05-14T22:13:06.442Z"" level=info msg=""Running OnExit handler: exit-handlers-train"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:13:06.442414045Z,"time=""2024-05-14T22:13:06.442Z"" level=info msg=""Evaluating node rr-ml-k-module-xxx-74gmh-1-1379393644.onExit: template: *v1alpha1.WorkflowStep (exit-handlers-train), boundaryID: , currentStackDepth: 0"" namespace=default workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"
2024-05-14T22:13:06.557979115Z,"time=""2024-05-14T22:13:06.557Z"" level=info msg=""Workflow update successful"" namespace=default phase=Running resourceVersion=2131355212 workflow=rr-ml-k-module-xxx-74gmh-1-1379393644"

In status.nodes, that was the state for one of the failing retry step:

status:
  # ...
  nodes:
    # ...
    rr-ml-k-module-xxx-74gmh-1-1379393644-4042243083:
      boundaryID: rr-ml-k-module-xxx-74gmh-1-1379393644-2396377327
      children:
      - rr-ml-k-module-xxx-74gmh-1-1379393644-794976774
      displayName: run-train
      finishedAt: "2024-05-14T22:12:59Z"
      id: rr-ml-k-module-xxx-74gmh-1-1379393644-4042243083
      inputs:
        parameters:
        # ...
      message: Maximum recursion depth exceeded. See https://argoproj.github.io/argo-workflows/scaling/#maximum-recursion-depth
      name: rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(4...<LONG PARAM LIST>...).run-train
      phase: Error
      progress: 0/1
      startedAt: "2024-05-03T05:30:54Z"
      templateName: run-train
      templateScope: local/rr-ml-k-module-xxx-74gmh-1-1379393644
      type: Retry
    rr-ml-k-module-xxx-74gmh-1-1379393644-794976774:
      boundaryID: rr-ml-k-module-xxx-74gmh-1-1379393644-2396377327
      children:
      - rr-ml-k-module-xxx-74gmh-1-1379393644-2946699175
      displayName: run-train(0)
      finishedAt: "2024-05-14T22:12:57Z"
      hostNodeName: ip-10-0-30-148.eu-west-1.compute.internal
      id: rr-ml-k-module-xxx-74gmh-1-1379393644-794976774
      inputs:
        parameters:
        # ...
      message: Maximum recursion depth exceeded. See https://argoproj.github.io/argo-workflows/scaling/#maximum-recursion-depth
      name: rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(4...<LONG PARAM LIST>...).run-train(0)
      phase: Error
      progress: 0/1
      startedAt: "2024-05-03T05:30:54Z"
      templateName: run-train
      templateScope: local/rr-ml-k-module-xxx-74gmh-1-1379393644
      type: Pod

I noticed that we don't have the following expected part in the retried Pod step:

      nodeFlag:
        retried: true

To repeat the problem, I prepared a minimal workflow like that:

# reproduce-max-recursion-depth-exceeded.yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: reproduce-max-recursion-depth-exceeded
spec:
  entrypoint: sleep-long-and-exit-peacefully-on-sigint
  templates:
  - name: sleep-long-and-exit-peacefully-on-sigint
    container:
      args:
      - |
        trap 'echo exiting 0; exit 0' INT
        sleep 3600
      command:
      - ash
      - -c
      image: alpine:latest
      name: sleep-and-exit
    retryStrategy:
      limit: 1

then:

kubectl apply -f reproduce-max-recursion-depth-exceeded.yaml
kubectl edit workflow reproduce-max-recursion-depth-exceeded # in status, delete the `nodeFlag` field
kubectl exec -it "$(kubectl get pods | grep reproduce-max-recursion-depth-exceeded | grep Running | awk '{print $1}')" -- ash -c 'kill -INT 1'

and voila:

image

Unfortunately I couldn't find a natural way yet to repeat the same issue without manually changing the status field, but I believe this is a case that is hit during a coinciding argo workflow controller restart.

@agilgur5 agilgur5 changed the title Maximum recursion depth exceeded error when not using recursion During upgrade: Maximum recursion depth exceeded error when not using recursion May 17, 2024
@ekesken
Copy link

ekesken commented May 18, 2024

@agilgur5 you've added the During upgrade: prefix to the issue, but for us, that was a continuous issue, not just happened during upgrade.

When we faced these errors during upgrade first, we read the comments in this issue and applied the DISABLE_MAX_RECURSION=true as recommended here, but that ended up with more obscure issues, we began to observe workflows stuck in a retry step while the children were completed like the following one:

image

To overcome those Deadline exceeded issues, we tried to increase MAX_OPERATION_TIME from 30s to 120s, but then we started observing restarts due to stack overflow panics in controllers whenever this recursion bug is hit again.

The recursion occurs in this call here, because it returns here in processNodeRetries call.

I believe this is more a During restart bug than a During upgrade one.

@agilgur5
Copy link

I believe this is more a During restart bug than a During upgrade one.

As far as I can tell, per your report and others, it only occurred during the restart for the upgrade, and not future restarts.

but then we started observing restarts due to stack overflow panics in controllers whenever this recursion bug is hit again.

You didn't provide your Workflow, but that sounds like you did have a recursive template? Or are you saying there's some incorrect recursion in the Controller code, not your template?
The latter could be related to this issue, although I didn't really understand how the rest would be related to this issue? Since you don't get the error in the title if you disable the environment variable.

@ekesken
Copy link

ekesken commented May 18, 2024

it only occurred during the restart for the upgrade, and not future restarts.

for us, that's not true, it was a repeating issue, not just in the first restart after the upgrade.

You didn't provide your Workflow, but that sounds like you did have a recursive template?

unfortunately I can't share the workflow as it is, it's a big production workflow generated by kubeflow with sensitive parameter values inside, but I can ensure you that we don't have anything recursive in our template, and you can check the simplified workflow manifest I shared above where I can repeat the recursion problem.

Or are you saying there's some incorrect recursion in the Controller code, not your template?

yes, exactly, and this is the line where the recursion problem occurs, and it seems to be caused by lack of nodeFlag.retried: true mark in the pod node status, but I'm still not sure how that happens.

what I can say is we were hitting this issue in the clusters where argo controller keeps restarting continuously for different reasons, after stabilising the situation in our clusters, restarts stopped and we stopped observing stuck retry (Deadline exceeded) issues (or Max recursion depth exceeded issues if you have DISABLE_MAX_RECURSION=false).

The latter could be related to this issue, although I didn't really understand how the rest would be related to this issue? Since you don't get the error in the title if you disable the environment variable.

maybe this recursion problem was there in previous versions as well, but we were seeing it as a stuck retry step with Deadline exceeded messages, and deadline was getting exceeded because the executeTemplate method was starting to call itself forever. And in the new versions of Argo, the error message that we see just got converted into Max recursion depth exceeded instead.

and probably human beings do not complain openly when they see Deadline exceeded messages in a retry step, but they get alerted seeing Max recursion depth exceeded messages while there is nothing recursive in the template.

in short, that's how things got evolved in our side by time:

  • users started complaining about Max recursion depth exceeded errors when the bug is hit.
  • reacted by DISABLE_MAX_RECURSION=true configuration.
  • users started complaining about stuck retries with Deadline exceeded messages when the bug is hit.
  • reacted by MAX_OPERATION_TIME=120s configuration.
  • our e2e tests started failing with timeout due to stack overflow panics in argo workflow controllers when the bug is hit.
  • reacted by removing both configuration to re-enable the recursion safety check.

@tooptoop4
Copy link
Contributor Author

this maybe relevant, i think the recursion is being calculated wrong. i have a fanout node like nodeA with 8 nodes under it. so: node1 depends on nodeA, node2 depends on nodeA, node3 depends on nodeA, node4 depends on nodeA, node5 depends on nodeA, node6 depends on nodeA, node7 depends on nodeA, node8 depends on nodeA. nodes1-8 don't depend on each other. in controller logs i see this:

"log":"time=\"2024-06-14T00:37:07.013Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:07.014Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:07.016Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:07.017Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:07.019Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:07.020Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:07.021Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:07.023Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:07.024Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:17.022Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:17.023Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:17.025Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:17.026Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:17.027Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:17.029Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:17.030Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:17.031Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:17.033Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:27.072Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:27.073Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:27.075Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:27.076Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:27.077Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:27.079Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:27.080Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:27.082Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:37:27.083Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:38:09.680Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:38:09.682Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:38:09.683Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:38:09.685Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:38:09.686Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:38:09.687Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:38:09.689Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:38:09.690Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:38:19.686Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:38:19.688Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:38:19.689Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:38:19.691Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:38:19.692Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:38:19.693Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:38:19.695Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:38:19.696Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:02.888Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:02.889Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:02.890Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:02.892Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:02.893Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:02.894Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:02.896Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:12.900Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:12.902Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:12.903Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:12.904Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:12.906Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:12.907Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:12.908Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:54.096Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:54.098Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:54.100Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:54.101Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:54.103Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:39:54.104Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:40:04.114Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:40:04.115Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:40:04.116Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:40:04.118Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:40:04.119Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:40:04.120Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:40:44.306Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:40:44.307Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:40:44.309Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:40:44.310Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:40:44.312Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:40:54.332Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:40:54.333Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:40:54.335Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:40:54.336Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:40:54.338Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:41:47.646Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:41:47.649Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:41:47.652Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:41:47.655Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:42:41.913Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:42:41.914Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:42:41.916Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:42:51.930Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:42:51.931Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:42:51.933Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:43:33.102Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:43:33.103Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:43:43.126Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:43:43.127Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:44:22.333Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"
"log":"time=\"2024-06-14T00:44:32.355Z\" level=info msg=\"template (node redactwf-3732665551) active children parallelism exceeded 1\" namespace=ns workflow=redactwf"

which i believe comes from:

woc.currentStackDepth++
defer func() { woc.currentStackDepth-- }()

// Check if we exceeded template or workflow parallelism and immediately return if we did
if err := woc.checkParallelism(processedTmpl, node, opts.boundaryID); err != nil {
return node, err
}

func (woc *wfOperationCtx) checkParallelism(tmpl *wfv1.Template, node *wfv1.NodeStatus, boundaryID string) error {
if woc.execWf.Spec.Parallelism != nil && woc.activePods >= *woc.execWf.Spec.Parallelism {
woc.log.Infof("workflow active pod spec parallelism reached %d/%d", woc.activePods, *woc.execWf.Spec.Parallelism)
return ErrParallelismReached
}
// If we are a DAG or Steps template, check if we have active pods or unsuccessful children
if node != nil && (tmpl.GetType() == wfv1.TemplateTypeDAG || tmpl.GetType() == wfv1.TemplateTypeSteps) {
// Check failFast
if tmpl.IsFailFast() && woc.getUnsuccessfulChildren(node.ID) > 0 {
woc.markNodePhase(node.Name, wfv1.NodeFailed, "template has failed or errored children and failFast enabled")
return ErrParallelismReached
}
// Check parallelism
if tmpl.HasParallelism() && woc.getActivePods(node.ID) >= *tmpl.Parallelism {
woc.log.Infof("template (node %s) active children parallelism exceeded %d", node.ID, *tmpl.Parallelism)
return ErrParallelismReached
}
}
// if we are about to execute a pod, make sure our parent hasn't reached it's limit
if boundaryID != "" && (node == nil || (node.Phase != wfv1.NodePending && node.Phase != wfv1.NodeRunning)) {
boundaryNode, err := woc.wf.Status.Nodes.Get(boundaryID)
if err != nil {
return err
}
boundaryTemplate, templateStored, err := woc.GetTemplateByBoundaryID(boundaryID)
if err != nil {
return err
}
// A new template was stored during resolution, persist it
if templateStored {
woc.updated = true
}
// Check failFast
if boundaryTemplate.IsFailFast() && woc.getUnsuccessfulChildren(boundaryID) > 0 {
woc.markNodePhase(boundaryNode.Name, wfv1.NodeFailed, "template has failed or errored children and failFast enabled")
return ErrParallelismReached
}
// Check parallelism
if boundaryTemplate.HasParallelism() && woc.getActiveChildren(boundaryID) >= *boundaryTemplate.Parallelism {
woc.log.Infof("template (node %s) active children parallelism exceeded %d", boundaryID, *boundaryTemplate.Parallelism)
return ErrParallelismReached
}
}
return nil
}

@isubasinghe
Copy link
Member

Please note DISABLE_MAX_RECURSION=true is not a safe solution for this problem.
I ran into a stack overflow when I came across this, I also made some progress on RCA-ing this.

I am able to reproduce this issue. This is now in my todo list.

@isubasinghe isubasinghe self-assigned this Aug 22, 2024
@isubasinghe
Copy link
Member

Was able to RCA this to the exact line of code where this happens, not sure how to fix it yet unfortunately, will follow up.

Looking at the git blame right now, this seems to have been a bug from approx 2018, but this doesn't make sense given reports are coming in after 3.5.0 as @tico24 states.

I can reproduce this, will git bisect through the different versions and see if I can still reproduce.

@isubasinghe
Copy link
Member

isubasinghe commented Aug 22, 2024

Just to confirm, are people only getting this issue with steps? If anyone is running into this and not using steps could you please comment.

My assumption right now is this a bug in the steps.go logic introduced after 3.5.0 and not in the retry logic.

@tico24
Copy link
Member

tico24 commented Aug 22, 2024

We don't use steps at Pipekit mate :)

@Joibel
Copy link
Member

Joibel commented Aug 22, 2024

3.5 is where the recursion check was introduced. Prior to that infinite recursion would have had other symptoms.

@Joibel
Copy link
Member

Joibel commented Aug 22, 2024

3.5.1 introduced new nodeFlag to the node status block in #11839.

@isubasinghe has proved the recursion is happening in retry nodes. I suspect the 3.5.1+ code is making incorrect assumptions on retry nodes that cause this recursion, because they don't have the Retried nodeFlag set. Workflows started in 3.5.0 or 3.4 and continued in 3.5.1+ with retry nodes are probably exhibiting this issue.

@tooptoop4
Copy link
Contributor Author

tooptoop4 commented Aug 22, 2024 via email

@agilgur5 agilgur5 added this to the v3.5.x patches milestone Aug 28, 2024
Joibel pushed a commit that referenced this issue Sep 10, 2024
@heyleke
Copy link
Contributor

heyleke commented Sep 11, 2024

Shouldn't this be mentioned in the upgrading guide as "breaking change" until 3.5.11 is released? We were able to migrate to 3.5.10 with existing workflows running after using DISABLE_MAX_RECURSION=true (although its not safe, as logs are full with repeating messages).

@tooptoop4
Copy link
Contributor Author

not sure why title was changed, it affects new workflows post-upgrade, unrelated to 'during upgrade' so this is still an issue

@Joibel
Copy link
Member

Joibel commented Sep 16, 2024

This problem is about workflows running during an upgrade.

If you have non recursive workflows that you can start on 3.5.1 or later that erroneously create this error message please can you show me one that we can use to reproduce it?

@agilgur5
Copy link

Shouldn't this be mentioned in the upgrading guide as "breaking change" until 3.5.11 is released?

The recursion check was only released in 3.5, and Argo does not follow SemVer so minors can have breakage.
This issue is a bug with that check that affects a small percentage of Workflows (only 4 upvotes, very hard to reproduce). A bug is not a breaking change; that's exactly what patch releases are for.

@heyleke
Copy link
Contributor

heyleke commented Sep 17, 2024

Shouldn't this be mentioned in the upgrading guide as "breaking change" until 3.5.11 is released?

The recursion check was only released in 3.5, and Argo does not follow SemVer so minors can have breakage. This issue is a bug with that check that affects a small percentage of Workflows (only 4 upvotes, very hard to reproduce). A bug is not a breaking change; that's exactly what patch releases are for.

I follow the not follow semVer reason, and I also follow its a bug (I mentioned, until 3.5.11 is released). But I contest it is a very hard to reproduce issue, if you have any 'old' workflow running, (which in a 24/7 CI usecase in production, is hard to prevent without maintenance window), you will be facing this bug unless you DISABLE_MAX_RECURSION=true. It is not only for 3.5.1 or later.

We're upgraded now, but I believe with this wf running on e.g. 3.4.8 controller, and then upgrade to 3.5.10, you'll hit the issue:
(disclaimer, a test wf we use to test hostNodeName feature, I adapted and added more sleep)

# This example demonstrates the ability to pass and return
# parameters and artifacts into and out of nested workflows
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: nested-workflow-
  name: nested-wf-test
spec:
  entrypoint: nested-workflow-example
  templates:
  - name: nested-workflow-example
    steps:
    - - name: runtb
        template: sleepabit
        continueOn:
          failed: true
        arguments:
          parameters:
          - name: excluded_node
            value: ""

    - - name: checkinput
        template: checkinput
        when: "{{steps.runtb.status}} != Succeeded"
        arguments:
          parameters:
            - { name: excluded_node, value: "{{steps.runtb.hostNodeName}}" }

    - - name: retryrun
        template: sleepabit
        continueOn:
          failed: true
        arguments:
          parameters:
          - name: excluded_node
            value: "{{steps.checkinput.outputs.result}}"

  - name: checkinput
    inputs:
      parameters:
      - name: excluded_node
    script:
      image: python:alpine3.7
      command: [ python ]
      source: |
        if ("steps.runtb.outputs.parameters.nodename" in "{{inputs.parameters.excluded_node}}"):
          print("blablabla")
        else:
          print("{{inputs.parameters.excluded_node}}")


  # container template which consumes an input parameter and artifact
  - name: sleepabit
    inputs:
      parameters:
      - name: excluded_node
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: kubernetes.io/hostname
              operator: NotIn
              values:
              - '{{inputs.parameters.excluded_node}}'
    timeout: "120s"
    container:
      image: alpine:3.7
      command: [sh, -c]
      args: ["echo $NODE_NAME > nodname.txt && sleep 90 && false"]
      env:
       - name: NODE_NAME
         valueFrom:
           fieldRef:
             fieldPath: spec.nodeName
    outputs:
      parameters:
      - name: nodename
        globalName: nodename
        default: "none"
        valueFrom:
          path: nodename.txt

@isubasinghe
Copy link
Member

isubasinghe commented Sep 17, 2024

We're upgraded now, but I believe with this wf running on e.g. 3.4.8 controller, and then upgrade to 3.5.10, you'll hit the issue:

yeah its this upgrade that causes an infinite recursion in the controller, this will be fixed in the next 3.5 release.

Joibel pushed a commit that referenced this issue Sep 20, 2024
@ekesken
Copy link

ekesken commented Oct 31, 2024

Unfortunately I couldn't find a natural way yet to repeat the same issue without manually changing the status field

Eventually we found the issue causing this problem in our side, just wanted to share it for others who might be still hitting it using an old Argo version, finding themselves in this thread.

We have an in-house kfp-operator which is deleting ownerReferences of workflow manifests to detach them from parent ScheduledWorkflow, to avoid interrupting ongoing workflow runs in case of a delete or update (delete+create) operation.

We were using Update call in controller-runtime client like this:

workflow.SetOwnerReferences(nil)
controllerRuntimeClient.Update(ctx, workflow)

That was ending up with nodeFlag: {} fields in the existing status.nodes which was ending up with Max Recursion Depth Exceeded errors in Argo side.

To fix it, we just started using the Patch call:

patch := client.RawPatch(types.MergePatchType, []byte(`{"metadata": {"ownerReferences": []}}`))
controllerRuntimeClient.Patch(ctx, workflowInstance, patch)

I hope it would be useful for people who were hitting this issue for a similar reason.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important solution/workaround There's a workaround, might not be great, but exists type/bug type/regression Regression from previous behavior (a specific type of bug)
Projects
None yet