-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(containerSet): mark container deleted when pod deleted. Fixes: #12210 #12756
Conversation
} | ||
|
||
// delete pod | ||
time.Sleep(10 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be mindful of how long a test takes, with so many tests in place, the time to run tests could accumulate very quickly
use
_ = os.Setenv("RECENTLY_STARTED_POD_DURATION", "0")
and
set graceperiod to 0 on metav1.DeleteOptions{
in deletePods
function
so that we can get rid of time.sleep
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I add it. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use
_ = os.Setenv("RECENTLY_STARTED_POD_DURATION", "0")
This sets it for the whole process though, no? not just this single test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sigle and i remove it after test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not quite sufficient. Env vars are set for an entire process -- meaning it affects any tests that are ran in parallel and use the same env var. But I wasn't sure if Go had any special handling for this given that I've seen this in other tests. According to Go's own docs (for the t.Setenv
function), it does not have any special handling and this would indeed affect any parallel tests.
While this is done in other tests in this codebase (50 occurrences) which probably need a larger refactoring, this env var is a bit more global in its effects.
This is one of those why tests shouldn't rely on globals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed it, and i test sleep 5s is not enough. need 10s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, not needing to sleep
in the tests is good to have though... @tczhao what do you think?
perhaps we leave one in and then refactor in a separate PR to avoid globals? (note that refactoring the globals might be substantially easier said than done)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed it, and i test sleep 5s is not enough. need 10s
The 10s is due to RECENTLY_STARTED_POD_DURATION
defaults to 10s,
podReconciliation
will not update the wf status until RECENTLY_STARTED_POD_DURATION
elapsed
https://github.com/argoproj/argo-workflows/blob/v3.5.5/workflow/controller/operator.go#L1209
perhaps we leave one in and then refactor in a separate PR to avoid globals?
Sounds good to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make sure to add a // TODO:
comment here to modify this in the future. Can reference this thread in the comment: #12756 (comment).
This would be a really easy follow-up to miss (and may not happen soon due to the possible complexity of the refactor), so being explicit is good (and TODO
s can be searched for in the codebase easily)
…goproj#12210 Signed-off-by: shuangkun <[email protected]>
fb904b8
to
a56cd09
Compare
Signed-off-by: shuangkun <[email protected]>
Signed-off-by: shuangkun <[email protected]>
To make sure I understand this correctly -- the Pod and container were correctly stopped, but the container's If so, we should retitle this and clarify in the issue -- the container was stopped and is not running, it's just incorrectly marked as running. |
Signed-off-by: shuangkun <[email protected]>
9fe5934
to
678c764
Compare
Signed-off-by: shuangkun <[email protected]>
678c764
to
19d2f5c
Compare
Signed-off-by: shuangkun <[email protected]>
Signed-off-by: shuangkun <[email protected]>
Signed-off-by: shuangkun <[email protected]>
Signed-off-by: shuangkun <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for tracking this down, fixing it, and adding a regression test!
Thany you and @tczhao , thank you for your suggestions! |
…2210 (#12756) Signed-off-by: shuangkun <[email protected]> (cherry picked from commit cfe2bb7)
Backported cleanly to |
Fixes #12210
Motivation
Avoid incorrect judgment that dag running causes workflow hang. The fundamental reason is that when the pod is cleaned up, the container is not still displayed as running, which causes the dag to be judged as running in assessDAGPhase.
Modifications
Set container node error instead of running when pod was deleted.
Verification
test local and e2e.
Before, pod delete but container running:
After fix: