Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: mark taskresult complete when failed or error. Fixes #12993, Fixes #13533 #13798

Merged
merged 5 commits into from
Oct 28, 2024

Conversation

isubasinghe
Copy link
Member

@isubasinghe isubasinghe commented Oct 23, 2024

Fixes #12993 and Fixes #13533

Motivation

The previous fix relied upon a Message field. There is no guarantee that this Message is always given to us.
We now directly check if a pod exists.

Modifications

Check if pod exists.

Verification

Unable to verify with certainty due to being a rare edge case.

@agilgur5 agilgur5 added the area/controller Controller issues, panics label Oct 23, 2024
@agilgur5 agilgur5 changed the title fix: mark taskresult complete when failed or error. Fixes ##12993 fix: mark taskresult complete when failed or error. Fixes #12993, Fixes #13533 Oct 23, 2024
@isubasinghe
Copy link
Member Author

/retest

1 similar comment
@isubasinghe
Copy link
Member Author

/retest

@isubasinghe
Copy link
Member Author

isubasinghe commented Oct 24, 2024

Seems to be failing to pull argocli:latest since the imagePullPolicy is Never.
I guess this image doesn't exist on the cluster as well.

@agilgur5 agilgur5 requested a review from jswxstw October 24, 2024 21:33
@isubasinghe isubasinghe force-pushed the ignore-taskresults-when-failed branch from 8c06c57 to 3cb3151 Compare October 25, 2024 00:57
@isubasinghe isubasinghe force-pushed the ignore-taskresults-when-failed branch from 3cb3151 to 49ff7a4 Compare October 25, 2024 00:59
@isubasinghe
Copy link
Member Author

isubasinghe commented Oct 25, 2024

This won't quite work. The previous comment ^ is incorrect. The code fails tests due to a race condition.
Turns out we can use a mounted volume as side channel effectively.

@isubasinghe
Copy link
Member Author

Made a change that checks if the pod exists instead of the node Message.

@isubasinghe isubasinghe reopened this Oct 25, 2024
Copy link
Member

@jswxstw jswxstw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

By the way, I think IsPodDeleted can be removed since it is useless now.

Signed-off-by: isubasinghe <[email protected]>
@isubasinghe
Copy link
Member Author

/retest

@agilgur5 agilgur5 added this to the v3.5.x patches milestone Oct 25, 2024
agilgur5
agilgur5 previously approved these changes Oct 25, 2024
Copy link

@agilgur5 agilgur5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense: if the Task Result exists, that means the Pod at some point existed, and if it no longer exists, then that means it was deleted.

The main race I can think of here is if the Task Result was seen in the Informer before the Pod was. Wasn't this effectively the purpose of the POD_ABSENT_TIMEOUT from #13454?

@agilgur5 agilgur5 requested a review from Joibel October 25, 2024 15:06
@Joibel
Copy link
Member

Joibel commented Oct 25, 2024

The main race I can think of here is if the Task Result was seen in the Informer before the Pod was. Wasn't this effectively the purpose of the POD_ABSENT_TIMEOUT from #13454?

I have a worry around this too. I'd like there to be some timeout between when we've noticed that a pod has disappeared so that a delayed but completed WorkflowTaskResult can arrive and be actioned.

As this PR is now I don't think we're guaranteed to see the completed task result before the pod removal event, and something like POD_ABSENT_TIMEOUT (or similar) would give us that window.

@agilgur5
Copy link

agilgur5 commented Oct 25, 2024

As this PR is now I don't think we're guaranteed to see the completed task result before the pod removal event

Ah I actually said the inverse race, which is less likely; this variant is more likely and possible too.

Copy link
Member

@Joibel Joibel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking on the basis that @agilgur5 and I both think this needs a time window. #13798 (comment)

@agilgur5 agilgur5 dismissed their stale review October 25, 2024 15:34

See above

@agilgur5
Copy link

I approved on the basis that this would catch more races than the code before this, but it creates a few too 😅
A more holistic fix would be great either way for sure

@isubasinghe isubasinghe requested review from Joibel and jswxstw October 28, 2024 12:14
workflow/controller/taskresult.go Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics
Projects
None yet
4 participants