Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: correct retry logic #13734

Merged
merged 13 commits into from
Nov 21, 2024
Merged

fix: correct retry logic #13734

merged 13 commits into from
Nov 21, 2024

Conversation

isubasinghe
Copy link
Member

@isubasinghe isubasinghe commented Oct 10, 2024

Fixes #12543 and other retry related issues.

Motivation

The current retry logic is too simplistic, it relies only on resetting nodes by traversing boundary nodes.
This approach does not work, it needs to be a combination of manual traversal and boundary node traversal.

In addition to this the current logic does not take care of various edge cases, for example container sets.

It was clear some rethinking needed to be done such that:
a) edge cases were taken care of.
b) retry logic was simpler to understand and hopefully fix if any further bug were to arise.

Modifications

This is a complete rewrite of the FormulateRetryWorkflow function.

Verification

Extensive testing performed manually to verify that behaviour is as expected.

@agilgur5 agilgur5 added the area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries label Oct 10, 2024
@Joibel
Copy link
Member

Joibel commented Oct 17, 2024

This might supersede #12105

Signed-off-by: isubasinghe <[email protected]>
Signed-off-by: isubasinghe <[email protected]>
Signed-off-by: isubasinghe <[email protected]>
Signed-off-by: isubasinghe <[email protected]>
Signed-off-by: isubasinghe <[email protected]>
Signed-off-by: isubasinghe <[email protected]>
Signed-off-by: isubasinghe <[email protected]>
Signed-off-by: isubasinghe <[email protected]>
Signed-off-by: isubasinghe <[email protected]>
Signed-off-by: isubasinghe <[email protected]>
@isubasinghe isubasinghe marked this pull request as ready for review October 21, 2024 07:12
Comment on lines 3798 to 3807
func TestNestedDAG(t *testing.T) {
require := require.New(t)
wf := wfv1.MustUnmarshalWorkflow(nestedDAG)

newWf, podsToDelete, err := FormulateRetryWorkflow(context.Background(), wf, true, "id=dag-nested-zxlc2-744943701", []string{})
require.NoError(err)
_ = newWf
_ = podsToDelete

}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should actually test for individual node status as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.

workflow/util/util_test.go Outdated Show resolved Hide resolved
`

func TestStepsRetryWorkflow(t *testing.T) {
assert := assert.New(t)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you do this for assert and require here and elsewhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I was waiting for an opinion on #13734 (comment). I plan to use these when I fill in the tests.

I was unsure if it was worth the effort of these tests. I wrote these to test edge conditions and make sure they worked as expected. They will prevent regressions though, so I will add them.

@@ -1026,51 +1025,6 @@ func TestRetryExitHandler(t *testing.T) {
func TestFormulateRetryWorkflow(t *testing.T) {
ctx := context.Background()
wfClient := argofake.NewSimpleClientset().ArgoprojV1alpha1().Workflows("my-ns")
createdTime := metav1.Time{Time: time.Now().Add(-1 * time.Second).UTC()}
finishedTime := metav1.Time{Time: createdTime.Add(time.Second * 2)}
t.Run("Steps", func(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropping this entire test feels wrong, why can't it be checked still?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not really testable easily, we cannot establish a dag for these status fields. The root node is not present.
This will error out complaining that the root node couldn't be found.

Comment on lines 3798 to 3807
func TestNestedDAG(t *testing.T) {
require := require.New(t)
wf := wfv1.MustUnmarshalWorkflow(nestedDAG)

newWf, podsToDelete, err := FormulateRetryWorkflow(context.Background(), wf, true, "id=dag-nested-zxlc2-744943701", []string{})
require.NoError(err)
_ = newWf
_ = podsToDelete

}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.

}

func isGroupNode(node wfv1.NodeStatus) bool {
return node.Type == wfv1.NodeTypeDAG || node.Type == wfv1.NodeTypeTaskGroup || node.Type == wfv1.NodeTypeStepGroup || node.Type == wfv1.NodeTypeSteps
func consumeTill(n *node, should tillFn, resetFunc resetFn) (*node, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Rename to resetUntil

reset instead of consume, as you call the resetFunc. Or call that consumeFunc

till has more meanings, until is much clearer if English isn't your first language.

workflow/util/util.go Outdated Show resolved Hide resolved
workflow/util/util.go Outdated Show resolved Hide resolved
workflow/util/util.go Outdated Show resolved Hide resolved
workflow/util/util.go Outdated Show resolved Hide resolved
workflow/util/util.go Show resolved Hide resolved
default:
// Do not allow retry of workflows with pods in Running/Pending phase
return nil, nil, errors.InternalErrorf("Workflow cannot be retried with node %s in %s phase", node.Name, node.Phase)
deleteNodes, err := getNodeIDsToResetNoChildren(restartSuccessful, nodeFieldSelector, wf.Status.Nodes)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would I have to set the --restart-successful in order to identify the nodes that need to be deleted, even if I only wish to retry a specific failed node?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this wouldn't be the case

Copy link
Member

@Joibel Joibel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

workflow/util/util.go Show resolved Hide resolved
addToDelete(curr.n.ID, true)
children := getChildren(curr)
for childID := range children {
addToDelete(childID, true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I follow and agree now. Thanks.

@isubasinghe isubasinghe merged commit 497f338 into argoproj:main Nov 21, 2024
28 checks passed
@Joibel Joibel deleted the fix-retry-logic branch November 29, 2024 10:17
Joibel pushed a commit to Joibel/argo-workflows that referenced this pull request Dec 5, 2024
Joibel pushed a commit to Joibel/argo-workflows that referenced this pull request Dec 6, 2024
Joibel pushed a commit to Joibel/argo-workflows that referenced this pull request Dec 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Retrying specific failed node does not work
4 participants