Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: change the logic of delete pod during retry. Fixes: #12538 #12734

Open
wants to merge 26 commits into
base: main
Choose a base branch
from

Conversation

shuangkun
Copy link
Member

@shuangkun shuangkun commented Mar 4, 2024

Fixes: #12538
Refactor the logic of deleting pods during retry to speed up retry a workflow.

Motivation

Speed up retry a workflow. Let a large archived workflow (more than 8000 pods) can be successfully retryed within 1 minute.

Modifications

As Anton and Joibel suggested, I moved the delete logic to the controller.
Add spec retry to workflow spec.
Adding spec is just one way, other good ways are also possible. For example, pass label to trigger retry. If you think it's better to pass the label, I will change it.

Verification

e2e and units

reference:
#12624
#12419

@shuangkun shuangkun marked this pull request as draft March 4, 2024 09:10
@shuangkun shuangkun force-pushed the fix/RefactorRetryDeleteLogic branch from 65e080f to 57c1676 Compare March 4, 2024 10:37
@shuangkun shuangkun marked this pull request as ready for review March 4, 2024 11:06
@shuangkun shuangkun added the area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries label Mar 5, 2024
@shuangkun shuangkun changed the title feat: refactor the logic of delete pod during retry feat: refactor the logic of delete pod during retry. Fixes: ##12538 Mar 9, 2024
@shuangkun shuangkun changed the title feat: refactor the logic of delete pod during retry. Fixes: ##12538 feat: refactor the logic of delete pod during retry. Fixes: #12538 Mar 9, 2024
@tczhao tczhao self-assigned this Mar 20, 2024
@@ -827,6 +827,29 @@ func resetConnectedParentGroupNodes(oldWF *wfv1.Workflow, newWF *wfv1.Workflow,
return newWF, resetParentGroupNodes
}

// RteryWorkflow retry a workflow by setting spec.retry and controller will retry it.
func RetryWorkflow(ctx context.Context, wf *wfv1.Workflow, restartSuccessful bool, nodeFieldSelector string, parameters []string) (*wfv1.Workflow, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use a better name here, retryworkflow can be confusing,
The word on top of my head is markWorkflowForRetry, there must be a better term

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed, Thanks!

Parameters: parameters,
}

delete(wf.Labels, common.LabelKeyCompleted)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the delete label looks formulation of workflow, I think we can let formulateRetryWorkflow handle this instead.
so that server only

  • check client request error
  • and update for retry

Copy link
Member Author

@shuangkun shuangkun Mar 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If delete it from here, reconceil may not be triggered. Do you think change "reconciliationNeeded" function better than delete label here? If so, I will modify reconciliationNeeded function.

func reconciliationNeeded(wf metav1.Object) bool {
	return wf.GetLabels()[common.LabelKeyCompleted] != "true" || slices.Contains(wf.GetFinalizers(), common.FinalizerArtifactGC)
}

Comment on lines 832 to 840
switch wf.Status.Phase {
case wfv1.WorkflowFailed, wfv1.WorkflowError:
case wfv1.WorkflowSucceeded:
if !(restartSuccessful && len(nodeFieldSelector) > 0) {
return nil, errors.Errorf(errors.CodeBadRequest, "To retry a succeeded workflow, set the options restartSuccessful and nodeFieldSelector")
}
default:
return nil, errors.Errorf(errors.CodeBadRequest, "Cannot retry a workflow in phase %s", wf.Status.Phase)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove this from formulateRetryWorkflow since it is already handled

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, removed.

@@ -286,19 +285,11 @@ func (w *archivedWorkflowServer) RetryArchivedWorkflow(ctx context.Context, req
_, err = wfClient.ArgoprojV1alpha1().Workflows(req.Namespace).Get(ctx, wf.Name, metav1.GetOptions{})
if apierr.IsNotFound(err) {

wf, podsToDelete, err := util.FormulateRetryWorkflow(ctx, wf, req.RestartSuccessful, req.NodeFieldSelector, req.Parameters)
wf, _, err := util.FormulateRetryWorkflow(ctx, wf, req.RestartSuccessful, req.NodeFieldSelector, req.Parameters)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use the new util function right? just like the workflow_server.
So that server only "label" workflow for retry, the actual formulation and processing is done by the controller

return false
}
if woc.IsRetried() {
if woc.controller.podCleanupQueue.Len() == 0 {
Copy link
Member

@tczhao tczhao Mar 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the purpose of if woc.controller.podCleanupQueue.Len() == 0 { here?
could server set retry to true then formulateRetryWorkflow set retry to false
so we only checks if retry exist and is true

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just want to confirm that all the pod has been cleaned up. If exists is not reported when creating a new pod at this time, this should be removed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean something like
make sure all pods deleted before process retry workflow

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is what I want.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe we should check nodestatus instead.
Other workflows would queue up the podCleanupQueue causing issue

@shuangkun
Copy link
Member Author

Thank you for your reply. I would like to confirm a question. Do you think it is reasonable to pass the retry parameter through the spec? This will increase the amount of code changes. But at that time, I thought it was not appropriate to just use label to pass it, so I put it in the spec with reference to suspend.

@tczhao
Copy link
Member

tczhao commented Mar 25, 2024

Thank you for your reply. I would like to confirm a question. Do you think it is reasonable to pass the retry parameter through the spec? This will increase the amount of code changes. But at that time, I thought it was not appropriate to just use label to pass it, so I put it in the spec with reference to suspend.

Here's some context
spec and status
annotation
label

Both annotation and spec make sense in their own way.

  • annotation Directives from the end-user to the implementations to modify behavior or engage non-standard features.
  • The spec is a complete description of the desired state, including configuration settings provided by the user

I would vote for spec thinking in a way that, when we retry, we update the desired state of the workflow, and then the controller works on it towards the desired state

Hi @agilgur5, what are your thoughts on this?

@shuangkun
Copy link
Member Author

Hi @agilgur5 , can you give me your vote so that I can better solve these comments? Thanks.

@shuangkun shuangkun changed the title feat: refactor the logic of delete pod during retry. Fixes: #12538 refactor: change the logic of delete pod during retry. Fixes: #12538 Mar 30, 2024
@shuangkun shuangkun force-pushed the fix/RefactorRetryDeleteLogic branch 2 times, most recently from 2883edc to bbb619a Compare March 30, 2024 06:47
Copy link

@agilgur5 agilgur5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding spec is just one way, other good ways are also possible. For example, pass label to trigger retry. If you think it's better to pass the label, I will change it.

Hi @agilgur5 , can you give me your vote so that I can better solve these comments? Thanks.

Hey sorry, I've had a bit too much on my plate recently. (can you tell by how late I've been up in EDT? 🙃)

So I mentioned using a label in #12538 as we pick up on various labels already in the Controller, so that would be consistent. It would also make it possible for a user to just add a label and not need to go through the Server or CLI (such as in #12027 (comment)), which would also make for a consistent UX.
In #12624 (review), I also suggested against spec changes as those are, well, part of the spec, and so become (much) harder to change.

In particular, per #6490 (comment), we may want to create (optional?) CRs for imperative actions like retry and resubmit. Per #6490 (comment), I also discussed that in SIG Security during two meetings, and they were supportive of that and apparently Argo CD had been thinking of implementing something similar to that.

With that in mind, in particular, I wouldn't want us to implement a spec that will change that we then will have trouble to remove due to breaking spec changes.

Both annotation and spec make sense in their own way.

  • annotation Directives from the end-user to the implementations to modify behavior or engage non-standard features.
  • The spec is a complete description of the desired state, including configuration settings provided by the user

I would vote for spec thinking in a way that, when we retry, we update the desired state of the workflow, and then the controller works on it towards the desired state

Hi @agilgur5, what are your thoughts on this?

Thanks for summarizing @tczhao.
So a retry is an imperative action, not a declarative state. So "desired state" is not really accurate to describe a retry -- it doesn't describe state at all, it describes an action. It also only affects the status and not the actual Workflow spec.
"non-standard feature" could be considered as a roughly accurate description of how retries currently work, as they are not supported via k8s directly, only via the Argo API. And as they are imperative, that is also non-standard in k8s.
A "directive from the end-user [...] to modify behavior" could be considered indicative of an action and it does modify behavior; it tells the Workflow to reprocess some nodes.

Perhaps an example would be good -- you wouldn't specify a retry within your spec when you submit your Workflow initially, as it's something you do manually after the fact. Whereas you do specify a retryStrategy beforehand for automatic retries. Also retryConfig needing to be disambiguated even more from retryStrategy I think would add confusion (it's a bit confusing even now, but at least it is relatively self-evident that one is manual whereas the other is automatic).

They are not entirely mutually exclusive definitions though, there is certainly some overlap.

implementation details

When I first looked at this implementation (it's been a month since then), my bigger question was actually, "can this even be properly implemented in a label or annotation?" as it's a primitive string -- with a limited length no less -- and not a more complex datatype.
parameters in particular seem a bit non-trivial to marshal.

Although retrying with new parameters actually seems to break the declarative state in general if I'm not mistaken -- and so may be an anti-pattern in and of itself. For instance, you could have some nodes with one set of parameters and others nodes with a different set. Meanwhile there is only one set of parameters in the actual Workflow spec. EDIT: This appears to have been mentioned in #9141 (comment) as well

@agilgur5 agilgur5 requested review from Joibel and isubasinghe April 9, 2024 06:00
@agilgur5 agilgur5 added area/controller Controller issues, panics area/server labels Apr 9, 2024
Signed-off-by: shuangkun <[email protected]>
Signed-off-by: shuangkun <[email protected]>
Signed-off-by: shuangkun <[email protected]>
Signed-off-by: shuangkun <[email protected]>
Signed-off-by: shuangkun <[email protected]>
shuangkun added 10 commits April 9, 2024 20:53
Signed-off-by: shuangkun <[email protected]>
Signed-off-by: shuangkun <[email protected]>
Signed-off-by: shuangkun <[email protected]>
Signed-off-by: shuangkun <[email protected]>
Signed-off-by: shuangkun <[email protected]>
Signed-off-by: shuangkun <[email protected]>
Signed-off-by: shuangkun <[email protected]>
Signed-off-by: shuangkun <[email protected]>
Signed-off-by: shuangkun <[email protected]>
Signed-off-by: shuangkun <[email protected]>
@shuangkun shuangkun force-pushed the fix/RefactorRetryDeleteLogic branch from 2c0587d to d6ac764 Compare April 9, 2024 12:59
shuangkun and others added 9 commits April 9, 2024 21:08
Signed-off-by: shuangkun <[email protected]>
Signed-off-by: shuangkun <[email protected]>
Signed-off-by: shuangkun <[email protected]>
Signed-off-by: shuangkun <[email protected]>
Signed-off-by: shuangkun <[email protected]>
Signed-off-by: shuangkun <[email protected]>
Signed-off-by: shuangkun <[email protected]>
Co-authored-by: shuangkun <[email protected]>
Co-authored-by: AlbeeSo <[email protected]>
Signed-off-by: shuangkun <[email protected]>
Signed-off-by: shuangkun <[email protected]>
@shuangkun
Copy link
Member Author

@agilgur5 Thank you for your reply. I have modified a version. Can you help me take a look?

@shuangkun shuangkun added the prioritized-review For members of the Sustainability Effort label Apr 16, 2024
assert.Contains(t, output, "hello world")
}
}).
Wait(3*time.Second).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what would be the potential issue without wait here

Comment on lines +593 to +599
case labelBatchDeletePodsCompleted:
// When running here, means that all pods that need to be deleted for the retry operation have been completed.
workflowName := podName
err := wfc.labelWorkflowRetried(ctx, namespace, workflowName)
if err != nil {
return err
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a problem here,
the podCleanupQueue is handled by multiple workers,
there's no guarantee all pods are cleaned when the worker sees labelBatchDeletePodsCompleted.

We may end up in a situation where a retry workflow starts but still having pods that's yet to be cleaned up

@tooptoop4

This comment was marked as spam.

@shuangkun
Copy link
Member Author

🦗

This PR needs some updates

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries area/server prioritized-review For members of the Sustainability Effort
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Move retry Pod deletions out of Server and into Controller for proper separation of duties
5 participants