-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: change the logic of delete pod during retry. Fixes: #12538 #12734
base: main
Are you sure you want to change the base?
Conversation
65e080f
to
57c1676
Compare
workflow/util/util.go
Outdated
@@ -827,6 +827,29 @@ func resetConnectedParentGroupNodes(oldWF *wfv1.Workflow, newWF *wfv1.Workflow, | |||
return newWF, resetParentGroupNodes | |||
} | |||
|
|||
// RteryWorkflow retry a workflow by setting spec.retry and controller will retry it. | |||
func RetryWorkflow(ctx context.Context, wf *wfv1.Workflow, restartSuccessful bool, nodeFieldSelector string, parameters []string) (*wfv1.Workflow, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use a better name here, retryworkflow
can be confusing,
The word on top of my head is markWorkflowForRetry
, there must be a better term
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed, Thanks!
workflow/util/util.go
Outdated
Parameters: parameters, | ||
} | ||
|
||
delete(wf.Labels, common.LabelKeyCompleted) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the delete label looks formulation of workflow, I think we can let formulateRetryWorkflow
handle this instead.
so that server only
- check client request error
- and update for retry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If delete it from here, reconceil may not be triggered. Do you think change "reconciliationNeeded" function better than delete label here? If so, I will modify reconciliationNeeded function.
func reconciliationNeeded(wf metav1.Object) bool {
return wf.GetLabels()[common.LabelKeyCompleted] != "true" || slices.Contains(wf.GetFinalizers(), common.FinalizerArtifactGC)
}
workflow/util/util.go
Outdated
switch wf.Status.Phase { | ||
case wfv1.WorkflowFailed, wfv1.WorkflowError: | ||
case wfv1.WorkflowSucceeded: | ||
if !(restartSuccessful && len(nodeFieldSelector) > 0) { | ||
return nil, errors.Errorf(errors.CodeBadRequest, "To retry a succeeded workflow, set the options restartSuccessful and nodeFieldSelector") | ||
} | ||
default: | ||
return nil, errors.Errorf(errors.CodeBadRequest, "Cannot retry a workflow in phase %s", wf.Status.Phase) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove this from formulateRetryWorkflow
since it is already handled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, removed.
@@ -286,19 +285,11 @@ func (w *archivedWorkflowServer) RetryArchivedWorkflow(ctx context.Context, req | |||
_, err = wfClient.ArgoprojV1alpha1().Workflows(req.Namespace).Get(ctx, wf.Name, metav1.GetOptions{}) | |||
if apierr.IsNotFound(err) { | |||
|
|||
wf, podsToDelete, err := util.FormulateRetryWorkflow(ctx, wf, req.RestartSuccessful, req.NodeFieldSelector, req.Parameters) | |||
wf, _, err := util.FormulateRetryWorkflow(ctx, wf, req.RestartSuccessful, req.NodeFieldSelector, req.Parameters) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should use the new util function right? just like the workflow_server.
So that server only "label" workflow for retry, the actual formulation and processing is done by the controller
workflow/controller/operator.go
Outdated
return false | ||
} | ||
if woc.IsRetried() { | ||
if woc.controller.podCleanupQueue.Len() == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the purpose of if woc.controller.podCleanupQueue.Len() == 0 {
here?
could server set retry
to true then formulateRetryWorkflow set retry
to false
so we only checks if retry
exist and is true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just want to confirm that all the pod has been cleaned up. If exists is not reported when creating a new pod at this time, this should be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean something like
make sure all pods deleted before process retry workflow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is what I want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think maybe we should check nodestatus instead.
Other workflows would queue up the podCleanupQueue causing issue
Thank you for your reply. I would like to confirm a question. Do you think it is reasonable to pass the retry parameter through the spec? This will increase the amount of code changes. But at that time, I thought it was not appropriate to just use label to pass it, so I put it in the spec with reference to suspend. |
Here's some context Both
I would vote for Hi @agilgur5, what are your thoughts on this? |
Hi @agilgur5 , can you give me your vote so that I can better solve these comments? Thanks. |
2883edc
to
bbb619a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding spec is just one way, other good ways are also possible. For example, pass label to trigger retry. If you think it's better to pass the label, I will change it.
Hi @agilgur5 , can you give me your vote so that I can better solve these comments? Thanks.
Hey sorry, I've had a bit too much on my plate recently. (can you tell by how late I've been up in EDT? 🙃)
So I mentioned using a label in #12538 as we pick up on various labels already in the Controller, so that would be consistent. It would also make it possible for a user to just add a label and not need to go through the Server or CLI (such as in #12027 (comment)), which would also make for a consistent UX.
In #12624 (review), I also suggested against spec changes as those are, well, part of the spec, and so become (much) harder to change.
In particular, per #6490 (comment), we may want to create (optional?) CRs for imperative actions like retry and resubmit. Per #6490 (comment), I also discussed that in SIG Security during two meetings, and they were supportive of that and apparently Argo CD had been thinking of implementing something similar to that.
With that in mind, in particular, I wouldn't want us to implement a spec that will change that we then will have trouble to remove due to breaking spec changes.
Both
annotation
andspec
make sense in their own way.
- annotation Directives from the end-user to the implementations to modify behavior or engage non-standard features.
- The spec is a complete description of the desired state, including configuration settings provided by the user
I would vote for
spec
thinking in a way that, when we retry, we update the desired state of the workflow, and then the controller works on it towards the desired stateHi @agilgur5, what are your thoughts on this?
Thanks for summarizing @tczhao.
So a retry is an imperative action, not a declarative state. So "desired state" is not really accurate to describe a retry -- it doesn't describe state at all, it describes an action. It also only affects the status
and not the actual Workflow spec
.
"non-standard feature" could be considered as a roughly accurate description of how retries currently work, as they are not supported via k8s directly, only via the Argo API. And as they are imperative, that is also non-standard in k8s.
A "directive from the end-user [...] to modify behavior" could be considered indicative of an action and it does modify behavior; it tells the Workflow to reprocess some nodes.
Perhaps an example would be good -- you wouldn't specify a retry within your spec when you submit your Workflow initially, as it's something you do manually after the fact. Whereas you do specify a retryStrategy
beforehand for automatic retries. Also retryConfig
needing to be disambiguated even more from retryStrategy
I think would add confusion (it's a bit confusing even now, but at least it is relatively self-evident that one is manual whereas the other is automatic).
They are not entirely mutually exclusive definitions though, there is certainly some overlap.
implementation details
When I first looked at this implementation (it's been a month since then), my bigger question was actually, "can this even be properly implemented in a label or annotation?" as it's a primitive string -- with a limited length no less -- and not a more complex datatype.
parameters
in particular seem a bit non-trivial to marshal.
Although retrying with new parameters
actually seems to break the declarative state in general if I'm not mistaken -- and so may be an anti-pattern in and of itself. For instance, you could have some nodes with one set of parameters and others nodes with a different set. Meanwhile there is only one set of parameters in the actual Workflow spec
. EDIT: This appears to have been mentioned in #9141 (comment) as well
Signed-off-by: shuangkun <[email protected]>
Signed-off-by: shuangkun <[email protected]>
Signed-off-by: shuangkun <[email protected]>
Signed-off-by: shuangkun <[email protected]>
Signed-off-by: shuangkun <[email protected]>
Signed-off-by: shuangkun <[email protected]>
Signed-off-by: shuangkun <[email protected]>
Signed-off-by: shuangkun <[email protected]>
Signed-off-by: shuangkun <[email protected]>
Signed-off-by: shuangkun <[email protected]>
Signed-off-by: shuangkun <[email protected]>
Signed-off-by: shuangkun <[email protected]>
Signed-off-by: shuangkun <[email protected]>
Signed-off-by: shuangkun <[email protected]>
Signed-off-by: shuangkun <[email protected]>
Signed-off-by: shuangkun <[email protected]>
Signed-off-by: shuangkun <[email protected]>
2c0587d
to
d6ac764
Compare
Signed-off-by: shuangkun <[email protected]>
Signed-off-by: shuangkun <[email protected]>
Signed-off-by: shuangkun <[email protected]>
Signed-off-by: shuangkun <[email protected]>
Signed-off-by: shuangkun <[email protected]>
Signed-off-by: shuangkun <[email protected]>
Signed-off-by: shuangkun <[email protected]>
Co-authored-by: shuangkun <[email protected]> Co-authored-by: AlbeeSo <[email protected]> Signed-off-by: shuangkun <[email protected]>
Signed-off-by: shuangkun <[email protected]>
@agilgur5 Thank you for your reply. I have modified a version. Can you help me take a look? |
assert.Contains(t, output, "hello world") | ||
} | ||
}). | ||
Wait(3*time.Second). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what would be the potential issue without wait here
case labelBatchDeletePodsCompleted: | ||
// When running here, means that all pods that need to be deleted for the retry operation have been completed. | ||
workflowName := podName | ||
err := wfc.labelWorkflowRetried(ctx, namespace, workflowName) | ||
if err != nil { | ||
return err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a problem here,
the podCleanupQueue
is handled by multiple workers,
there's no guarantee all pods are cleaned when the worker sees labelBatchDeletePodsCompleted
.
We may end up in a situation where a retry workflow starts but still having pods that's yet to be cleaned up
This comment was marked as spam.
This comment was marked as spam.
This PR needs some updates |
Fixes: #12538
Refactor the logic of deleting pods during retry to speed up retry a workflow.
Motivation
Speed up retry a workflow. Let a large archived workflow (more than 8000 pods) can be successfully retryed within 1 minute.
Modifications
As Anton and Joibel suggested, I moved the delete logic to the controller.
Add spec retry to workflow spec.
Adding spec is just one way, other good ways are also possible. For example, pass label to trigger retry. If you think it's better to pass the label, I will change it.
Verification
e2e and units
reference:
#12624
#12419