-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
During upgrade: Maximum recursion depth exceeded
error when not using recursion
#12162
Comments
Correct, that's a new feature in 3.5, not a bug. Implemented by #11646.
You'll need to provide an example workflow, which is missing from your issue. I can't reproduce this and we've had many contributors and companies using 3.5+ In general, more details are almost always better than few details, which is why they are asked for in issue templates. |
There does appear to be an issue with this specifically in 3.5.1 (not in 3.5.0). Our CI is failing on it, and definitely uses fewer than 100 recursions. I'm working on something small and replicatable. |
I'm looking into it, workflows that reproduce it would be helpful though. |
Working with @tico24 here, we've not managed to reproduce this on the same installation with the same workflow. He'd disabled the feature with @tico24 had updated to 3.5.1 from 3.5.0 this morning and it started happening near immediately. He's left it so it 'should' fail again. I really need a runnable workflow @tooptoop4. Does restarting the controller alleviate the issue? |
Should we disable this feature by default instead? |
It's a security, HA, multi-tenancy, etc feature, so I do think it should be on by default. But we should ofc fix any bugs there may be |
We tried to upgrade to |
@guruprasathTT can you please provide a Workflow that reproduces this? We asked for one above a few times and without that, we wouldn't be able to find any possible bugs |
@agilgur5 |
This seems to be the case now with several upgrades to 3.5, in flight upgrades seem to cause this to happen, but afterwards it is fine. |
I think the problem is that not actually recursion is counted, but rather any "open" call to edit: not true, fan out only increases concurrency by 1 |
This comment was marked as resolved.
This comment was marked as resolved.
I don't have a reproducible test case either, but we've been seeing the I'm not aware that we make use of recursion in the failing workflows. Our workflows are dynamic but have fixed numbers of nodes. Usually they fan out to fixed number of tasks based on configuration. Restarting the workflows seems to "fix" the issue, even though the number of steps remains the same. |
@biochimia, can you see if it reoccurs after a workflow-controller restart? All cases I've had reported have been on a fresh upgrade only, but a controller restart has made them go away. |
@roelarents - I didn't see your message before today. I'll try to reproduce that. |
O, yeah, I have to retract that statement. It's not true. We tried to reproduce it too and a fan out does only increase it by 1. Sorry. Still trying to find out what does cause it. I tried to sneak in an extra log line for the concurrency so far. But I hadn't time to look at this after anymore (just disabled the max recursion check). |
Maximum recursion depth exceeded
error when not using recursion
I don't recommend anybody setting We were still facing this issue in Argo 3.5.2, to debug the problem, I baked an image to do the mentioned log changed above and promoted its level from debug to info. Then in a cluster where we were having this issue, I switched to this image, setting
In status:
# ...
nodes:
# ...
rr-ml-k-module-xxx-74gmh-1-1379393644-4042243083:
boundaryID: rr-ml-k-module-xxx-74gmh-1-1379393644-2396377327
children:
- rr-ml-k-module-xxx-74gmh-1-1379393644-794976774
displayName: run-train
finishedAt: "2024-05-14T22:12:59Z"
id: rr-ml-k-module-xxx-74gmh-1-1379393644-4042243083
inputs:
parameters:
# ...
message: Maximum recursion depth exceeded. See https://argoproj.github.io/argo-workflows/scaling/#maximum-recursion-depth
name: rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(4...<LONG PARAM LIST>...).run-train
phase: Error
progress: 0/1
startedAt: "2024-05-03T05:30:54Z"
templateName: run-train
templateScope: local/rr-ml-k-module-xxx-74gmh-1-1379393644
type: Retry
rr-ml-k-module-xxx-74gmh-1-1379393644-794976774:
boundaryID: rr-ml-k-module-xxx-74gmh-1-1379393644-2396377327
children:
- rr-ml-k-module-xxx-74gmh-1-1379393644-2946699175
displayName: run-train(0)
finishedAt: "2024-05-14T22:12:57Z"
hostNodeName: ip-10-0-30-148.eu-west-1.compute.internal
id: rr-ml-k-module-xxx-74gmh-1-1379393644-794976774
inputs:
parameters:
# ...
message: Maximum recursion depth exceeded. See https://argoproj.github.io/argo-workflows/scaling/#maximum-recursion-depth
name: rr-ml-k-module-xxx-74gmh-1-1379393644.exit-handler-1.for-loop-5(4...<LONG PARAM LIST>...).run-train(0)
phase: Error
progress: 0/1
startedAt: "2024-05-03T05:30:54Z"
templateName: run-train
templateScope: local/rr-ml-k-module-xxx-74gmh-1-1379393644
type: Pod I noticed that we don't have the following expected part in the retried Pod step: nodeFlag:
retried: true To repeat the problem, I prepared a minimal workflow like that: # reproduce-max-recursion-depth-exceeded.yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
name: reproduce-max-recursion-depth-exceeded
spec:
entrypoint: sleep-long-and-exit-peacefully-on-sigint
templates:
- name: sleep-long-and-exit-peacefully-on-sigint
container:
args:
- |
trap 'echo exiting 0; exit 0' INT
sleep 3600
command:
- ash
- -c
image: alpine:latest
name: sleep-and-exit
retryStrategy:
limit: 1 then: kubectl apply -f reproduce-max-recursion-depth-exceeded.yaml
kubectl edit workflow reproduce-max-recursion-depth-exceeded # in status, delete the `nodeFlag` field
kubectl exec -it "$(kubectl get pods | grep reproduce-max-recursion-depth-exceeded | grep Running | awk '{print $1}')" -- ash -c 'kill -INT 1' and voila: Unfortunately I couldn't find a natural way yet to repeat the same issue without manually changing the status field, but I believe this is a case that is hit during a coinciding argo workflow controller restart. |
Maximum recursion depth exceeded
error when not using recursionMaximum recursion depth exceeded
error when not using recursion
@agilgur5 you've added the When we faced these errors during upgrade first, we read the comments in this issue and applied the To overcome those The recursion occurs in this call here, because it returns here in processNodeRetries call. I believe this is more a |
As far as I can tell, per your report and others, it only occurred during the restart for the upgrade, and not future restarts.
You didn't provide your Workflow, but that sounds like you did have a recursive template? Or are you saying there's some incorrect recursion in the Controller code, not your template? |
for us, that's not true, it was a repeating issue, not just in the first restart after the upgrade.
unfortunately I can't share the workflow as it is, it's a big production workflow generated by kubeflow with sensitive parameter values inside, but I can ensure you that we don't have anything recursive in our template, and you can check the simplified workflow manifest I shared above where I can repeat the recursion problem.
yes, exactly, and this is the line where the recursion problem occurs, and it seems to be caused by lack of what I can say is we were hitting this issue in the clusters where argo controller keeps restarting continuously for different reasons, after stabilising the situation in our clusters, restarts stopped and we stopped observing stuck retry (
maybe this recursion problem was there in previous versions as well, but we were seeing it as a stuck retry step with Deadline exceeded messages, and deadline was getting exceeded because the executeTemplate method was starting to call itself forever. And in the new versions of Argo, the error message that we see just got converted into Max recursion depth exceeded instead. and probably human beings do not complain openly when they see in short, that's how things got evolved in our side by time:
|
this maybe relevant, i think the recursion is being calculated wrong. i have a fanout node like nodeA with 8 nodes under it. so: node1 depends on nodeA, node2 depends on nodeA, node3 depends on nodeA, node4 depends on nodeA, node5 depends on nodeA, node6 depends on nodeA, node7 depends on nodeA, node8 depends on nodeA. nodes1-8 don't depend on each other. in controller logs i see this:
which i believe comes from: argo-workflows/workflow/controller/operator.go Lines 1890 to 1891 in 0ca0c0f
argo-workflows/workflow/controller/operator.go Lines 1960 to 1963 in 0ca0c0f
argo-workflows/workflow/controller/operator.go Lines 2750 to 2800 in 0ca0c0f
|
Please note I am able to reproduce this issue. This is now in my todo list. |
Was able to RCA this to the exact line of code where this happens, not sure how to fix it yet unfortunately, will follow up. Looking at the git blame right now, this seems to have been a bug from approx 2018, but this doesn't make sense given reports are coming in after 3.5.0 as @tico24 states. I can reproduce this, will git bisect through the different versions and see if I can still reproduce. |
Just to confirm, are people only getting this issue with steps? If anyone is running into this and not using steps could you please comment. My assumption right now is this a bug in the steps.go logic introduced after 3.5.0 and not in the retry logic. |
We don't use steps at Pipekit mate :) |
3.5 is where the recursion check was introduced. Prior to that infinite recursion would have had other symptoms. |
3.5.1 introduced new @isubasinghe has proved the recursion is happening in retry nodes. I suspect the 3.5.1+ code is making incorrect assumptions on retry nodes that cause this recursion, because they don't have the |
Could #13211 (comment) be related?
…________________________________
From: Alan Clucas ***@***.***>
Sent: Thursday, August 22, 2024 4:44:19 PM
To: argoproj/argo-workflows ***@***.***>
Cc: tooptoop4 ***@***.***>; Mention ***@***.***>
Subject: Re: [argoproj/argo-workflows] During upgrade: `Maximum recursion depth exceeded` error when not using recursion (Issue #12162)
3.5.1 introduced new nodeFlag to the node status block in #11839<#11839>.
@isubasinghe<https://github.com/isubasinghe> has proved the recursion is happening in retry nodes. I suspect the 3.5.1+ code is making incorrect assumptions on retry nodes that cause this recursion, because they don't have the Retried nodeFlag set. Workflows started in 3.5.0 or 3.4 and continued in 3.5.1+ with retry nodes are probably exhibiting this issue.
—
Reply to this email directly, view it on GitHub<#12162 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AH553KHFD6ZQ2PQ4IMV3TCLZSWJFHAVCNFSM6AAAAAA7ARYU2CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMBTHE4TSMBSGI>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Fixes #12162 (#13504) Signed-off-by: isubasinghe <[email protected]>
Shouldn't this be mentioned in the upgrading guide as "breaking change" until 3.5.11 is released? We were able to migrate to 3.5.10 with existing workflows running after using DISABLE_MAX_RECURSION=true (although its not safe, as logs are full with repeating messages). |
not sure why title was changed, it affects new workflows post-upgrade, unrelated to 'during upgrade' so this is still an issue |
This problem is about workflows running during an upgrade. If you have non recursive workflows that you can start on 3.5.1 or later that erroneously create this error message please can you show me one that we can use to reproduce it? |
The recursion check was only released in 3.5, and Argo does not follow SemVer so minors can have breakage. |
I follow the not follow semVer reason, and I also follow its a bug (I mentioned, until 3.5.11 is released). But I contest it is a very hard to reproduce issue, if you have any 'old' workflow running, (which in a 24/7 CI usecase in production, is hard to prevent without maintenance window), you will be facing this bug unless you DISABLE_MAX_RECURSION=true. It is not only for 3.5.1 or later. We're upgraded now, but I believe with this wf running on e.g. 3.4.8 controller, and then upgrade to 3.5.10, you'll hit the issue:
|
yeah its this upgrade that causes an infinite recursion in the controller, this will be fixed in the next 3.5 release. |
Fixes #12162 (#13504) Signed-off-by: isubasinghe <[email protected]>
Eventually we found the issue causing this problem in our side, just wanted to share it for others who might be still hitting it using an old Argo version, finding themselves in this thread. We have an in-house kfp-operator which is deleting We were using workflow.SetOwnerReferences(nil)
controllerRuntimeClient.Update(ctx, workflow) That was ending up with To fix it, we just started using the patch := client.RawPatch(types.MergePatchType, []byte(`{"metadata": {"ownerReferences": []}}`))
controllerRuntimeClient.Patch(ctx, workflowInstance, patch) I hope it would be useful for people who were hitting this issue for a similar reason. |
Pre-requisites
:latest
What happened/what you expected to happen?
never saw this error in 3.4.11
but after upgrading to 3.5.1
got
error in entry template execution: Maximum recursion depth exceeded. See https://argoproj.github.io/argo-workflows/scaling/#maximum-recursion-depth
on 2 different small (1 step) workflowsVersion
3.5.1
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
n/a
Logs from the workflow controller
Logs from in your workflow's wait container
The text was updated successfully, but these errors were encountered: