fix(containerSet): mark container deleted when pod deleted. Fixes: #12210 #12756

shuangkun · 2024-03-07T07:49:32Z

Motivation

Avoid incorrect judgment that dag running causes workflow hang. The fundamental reason is that when the pod is cleaned up, the container is not still displayed as running, which causes the dag to be judged as running in assessDAGPhase.

Modifications

Set container node error instead of running when pod was deleted.

Verification

test local and e2e.

Before, pod delete but container running:

    lovely-rhino-544408588:
      boundaryID: lovely-rhino
      children:
      - lovely-rhino-3630214831
      - lovely-rhino-1026744615
      displayName: A
      finishedAt: "2024-03-07T07:07:07Z"
      hostNodeName: kind-control-plane
      id: lovely-rhino-544408588
      message: pod deleted
      name: lovely-rhino.A
      outputs:
        artifacts:
        - name: main-logs
          s3:
            key: lovely-rhino/lovely-rhino-run-544408588/main.log
        - name: main2-logs
          s3:
            key: lovely-rhino/lovely-rhino-run-544408588/main2.log
      phase: Error
      progress: 0/1
      startedAt: "2024-03-07T07:06:34Z"
      templateName: run
      templateScope: local/lovely-rhino
      type: Pod
    lovely-rhino-1026744615:
      boundaryID: lovely-rhino-544408588
      displayName: main2
      finishedAt: null
      id: lovely-rhino-1026744615
      name: lovely-rhino.A.main2
      phase: Running
      progress: 0/1
      startedAt: "2024-03-07T07:06:34Z"
      templateName: run
      templateScope: local/lovely-rhino
      type: Container
    lovely-rhino-3630214831:
      boundaryID: lovely-rhino-544408588
      displayName: main
      finishedAt: null
      id: lovely-rhino-3630214831
      name: lovely-rhino.A.main
      phase: Running
      progress: 0/1
      startedAt: "2024-03-07T07:06:34Z"
      templateName: run
      templateScope: local/lovely-rhino
      type: Container
  phase: Running

After fix:

      lovely-rhino-544408588:
        boundaryID: lovely-rhino
        children:
        - lovely-rhino-3630214831
        - lovely-rhino-1026744615
        displayName: A
        finishedAt: "2024-03-07T07:25:56Z"
        hostNodeName: kind-control-plane
        id: lovely-rhino-544408588
        message: pod deleted
        name: lovely-rhino.A
        outputs:
          artifacts:
          - name: main-logs
            s3:
              key: lovely-rhino/lovely-rhino-run-544408588/main.log
          - name: main2-logs
            s3:
              key: lovely-rhino/lovely-rhino-run-544408588/main2.log
        phase: Error
        progress: 0/1
        startedAt: "2024-03-07T07:25:32Z"
        templateName: run
        templateScope: local/lovely-rhino
        type: Pod
      lovely-rhino-1026744615:
        boundaryID: lovely-rhino-544408588
        displayName: main2
        finishedAt: "2024-03-07T07:25:56Z"
        id: lovely-rhino-1026744615
        message: container deleted
        name: lovely-rhino.A.main2
        phase: Error
        progress: 0/1
        startedAt: "2024-03-07T07:25:33Z"
        templateName: run
        templateScope: local/lovely-rhino
        type: Container
      lovely-rhino-3630214831:
        boundaryID: lovely-rhino-544408588
        displayName: main
        finishedAt: "2024-03-07T07:25:56Z"
        id: lovely-rhino-3630214831
        message: container deleted
        name: lovely-rhino.A.main
        phase: Error
        progress: 0/1
        startedAt: "2024-03-07T07:25:33Z"
        templateName: run
        templateScope: local/lovely-rhino
        type: Container
    phase: Error

tczhao · 2024-03-19T12:11:45Z

workflow/controller/operator_test.go

+	}
+
+	// delete pod
+	time.Sleep(10 * time.Second)


Be mindful of how long a test takes, with so many tests in place, the time to run tests could accumulate very quickly

use
_ = os.Setenv("RECENTLY_STARTED_POD_DURATION", "0")
and
set graceperiod to 0 on metav1.DeleteOptions{ in deletePods function

so that we can get rid of time.sleep here

Yeah, I add it. Thanks!

use
_ = os.Setenv("RECENTLY_STARTED_POD_DURATION", "0")

This sets it for the whole process though, no? not just this single test?

Sigle and i remove it after test.

That's not quite sufficient. Env vars are set for an entire process -- meaning it affects any tests that are ran in parallel and use the same env var. But I wasn't sure if Go had any special handling for this given that I've seen this in other tests. According to Go's own docs (for the t.Setenv function), it does not have any special handling and this would indeed affect any parallel tests.

While this is done in other tests in this codebase (50 occurrences) which probably need a larger refactoring, this env var is a bit more global in its effects.

This is one of those why tests shouldn't rely on globals.

removed it, and i test sleep 5s is not enough. need 10s

hmm, not needing to sleep in the tests is good to have though... @tczhao what do you think?

perhaps we leave one in and then refactor in a separate PR to avoid globals? (note that refactoring the globals might be substantially easier said than done)

removed it, and i test sleep 5s is not enough. need 10s

The 10s is due to RECENTLY_STARTED_POD_DURATION defaults to 10s,
podReconciliation will not update the wf status until RECENTLY_STARTED_POD_DURATION elapsed
https://github.com/argoproj/argo-workflows/blob/v3.5.5/workflow/controller/operator.go#L1209

perhaps we leave one in and then refactor in a separate PR to avoid globals?

Sounds good to me

Let's make sure to add a // TODO: comment here to modify this in the future. Can reference this thread in the comment: #12756 (comment).
This would be a really easy follow-up to miss (and may not happen soon due to the possible complexity of the refactor), so being explicit is good (and TODOs can be searched for in the codebase easily)

workflow/controller/operator.go

…goproj#12210 Signed-off-by: shuangkun <[email protected]>

Signed-off-by: shuangkun <[email protected]>

agilgur5 · 2024-03-21T18:30:18Z

containerset does not stop container when pod removed.

The fundamental reason is that when the pod is cleaned up, the container is not still displayed as running, which causes the dag to be judged as running in assessDAGPhase.

Set container node error instead of running when pod was deleted.

To make sure I understand this correctly -- the Pod and container were correctly stopped, but the container's status node was incorrectly marked?

If so, we should retitle this and clarify in the issue -- the container was stopped and is not running, it's just incorrectly marked as running.

workflow/controller/operator_test.go

workflow/controller/operator.go

Signed-off-by: shuangkun <[email protected]>

workflow/controller/operator_test.go

Signed-off-by: shuangkun <[email protected]>

agilgur5

Thanks for tracking this down, fixing it, and adding a regression test!

shuangkun · 2024-03-25T07:19:46Z

Thanks for tracking this down, fixing it, and adding a regression test!

Thany you and @tczhao , thank you for your suggestions！

…2210 (#12756) Signed-off-by: shuangkun <[email protected]> (cherry picked from commit cfe2bb7)

agilgur5 · 2024-04-19T17:02:36Z

Backported cleanly to release-3.5 as faaddf3

shuangkun marked this pull request as draft March 7, 2024 07:49

shuangkun added the area/templates/container-set label Mar 7, 2024

shuangkun closed this Mar 9, 2024

shuangkun reopened this Mar 9, 2024

shuangkun marked this pull request as ready for review March 9, 2024 14:32

shuangkun added the prioritized-review For members of the Sustainability Effort label Mar 11, 2024

tczhao requested changes Mar 19, 2024

View reviewed changes

tczhao reviewed Mar 19, 2024

View reviewed changes

workflow/controller/operator.go Outdated Show resolved Hide resolved

tczhao self-assigned this Mar 19, 2024

fix: containerset does not stop container when pod removed. Fixes: ar…

a56cd09

…goproj#12210 Signed-off-by: shuangkun <[email protected]>

shuangkun force-pushed the fix/containerSetNotStop branch from fb904b8 to a56cd09 Compare March 19, 2024 13:46

shuangkun added 2 commits March 19, 2024 21:47

fix: test

022979f

Signed-off-by: shuangkun <[email protected]>

fix: lint

ed01af3

Signed-off-by: shuangkun <[email protected]>

shuangkun requested a review from tczhao March 19, 2024 15:09

tczhao approved these changes Mar 19, 2024

View reviewed changes

agilgur5 reviewed Mar 21, 2024

View reviewed changes

workflow/controller/operator_test.go Outdated Show resolved Hide resolved

agilgur5 reviewed Mar 21, 2024

View reviewed changes

workflow/controller/operator.go Outdated Show resolved Hide resolved

shuangkun changed the title ~~fix: containerset does not stop container when pod removed. Fixes: #12210~~ fix: mark containerset container error when pod removed. Fixes: #12210 Mar 22, 2024

shuangkun changed the title ~~fix: mark containerset container error when pod removed. Fixes: #12210~~ fix: mark containerset container deleted when pod deleted. Fixes: #12210 Mar 22, 2024

fix: test

9fe5934

Signed-off-by: shuangkun <[email protected]>

shuangkun force-pushed the fix/containerSetNotStop branch from 9fe5934 to 678c764 Compare March 22, 2024 02:57

agilgur5 changed the title ~~fix: mark containerset container deleted when pod deleted. Fixes: #12210~~ fix(containerSet): mark container deleted when pod deleted. Fixes: #12210 Mar 22, 2024

shuangkun requested a review from agilgur5 March 22, 2024 06:14

fix: comment

19d2f5c

Signed-off-by: shuangkun <[email protected]>

shuangkun force-pushed the fix/containerSetNotStop branch from 678c764 to 19d2f5c Compare March 24, 2024 01:04

shuangkun added 3 commits March 24, 2024 09:10

fix: comment

db041ff

Signed-off-by: shuangkun <[email protected]>

fix: test

1f8b48f

Signed-off-by: shuangkun <[email protected]>

fix: add to do

689c247

Signed-off-by: shuangkun <[email protected]>

agilgur5 reviewed Mar 25, 2024

View reviewed changes

workflow/controller/operator_test.go Outdated Show resolved Hide resolved

fix: to do.

26e3b03

Signed-off-by: shuangkun <[email protected]>

agilgur5 approved these changes Mar 25, 2024

View reviewed changes

agilgur5 added the area/controller Controller issues, panics label Mar 25, 2024

agilgur5 enabled auto-merge (squash) March 25, 2024 07:11

agilgur5 merged commit cfe2bb7 into argoproj:main Mar 25, 2024
27 checks passed

shuangkun mentioned this pull request Apr 4, 2024

REQUEST: Promotion to Reviewer for @shuangkun argoproj/argoproj#293

Closed

6 tasks

agilgur5 added this to the v3.5.x patches milestone Apr 19, 2024

agilgur5 pushed a commit that referenced this pull request Apr 19, 2024

fix(containerSet): mark container deleted when pod deleted. Fixes: #1…

faaddf3

…2210 (#12756) Signed-off-by: shuangkun <[email protected]> (cherry picked from commit cfe2bb7)

tczhao mentioned this pull request May 1, 2024

REQUEST: Promotion to Approver for @tczhao argoproj/argoproj#296

Closed

6 tasks

This was referenced Dec 9, 2024

Workflows with ContainerSet template stuck forever in case of pod deletion #13951

Closed

fix: mark all its children(container) as deleted if pod deleted. Fixes #13951 #13978

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(containerSet): mark container deleted when pod deleted. Fixes: #12210 #12756

fix(containerSet): mark container deleted when pod deleted. Fixes: #12210 #12756

shuangkun commented Mar 7, 2024 •

edited by agilgur5

Loading

tczhao Mar 19, 2024 •

edited

Loading

shuangkun Mar 19, 2024

agilgur5 Mar 21, 2024

shuangkun Mar 22, 2024

agilgur5 Mar 24, 2024 •

edited

Loading

shuangkun Mar 24, 2024

agilgur5 Mar 24, 2024 •

edited

Loading

tczhao Mar 25, 2024

agilgur5 Mar 25, 2024 •

edited

Loading

agilgur5 commented Mar 21, 2024 •

edited

Loading

agilgur5 left a comment

shuangkun commented Mar 25, 2024

agilgur5 commented Apr 19, 2024

fix(containerSet): mark container deleted when pod deleted. Fixes: #12210 #12756

fix(containerSet): mark container deleted when pod deleted. Fixes: #12210 #12756

Conversation

shuangkun commented Mar 7, 2024 • edited by agilgur5 Loading

Motivation

Modifications

Verification

tczhao Mar 19, 2024 • edited Loading

Choose a reason for hiding this comment

shuangkun Mar 19, 2024

Choose a reason for hiding this comment

agilgur5 Mar 21, 2024

Choose a reason for hiding this comment

shuangkun Mar 22, 2024

Choose a reason for hiding this comment

agilgur5 Mar 24, 2024 • edited Loading

Choose a reason for hiding this comment

shuangkun Mar 24, 2024

Choose a reason for hiding this comment

agilgur5 Mar 24, 2024 • edited Loading

Choose a reason for hiding this comment

tczhao Mar 25, 2024

Choose a reason for hiding this comment

agilgur5 Mar 25, 2024 • edited Loading

Choose a reason for hiding this comment

agilgur5 commented Mar 21, 2024 • edited Loading

agilgur5 left a comment

Choose a reason for hiding this comment

shuangkun commented Mar 25, 2024

agilgur5 commented Apr 19, 2024

shuangkun commented Mar 7, 2024 •

edited by agilgur5

Loading

tczhao Mar 19, 2024 •

edited

Loading

agilgur5 Mar 24, 2024 •

edited

Loading

agilgur5 Mar 24, 2024 •

edited

Loading

agilgur5 Mar 25, 2024 •

edited

Loading

agilgur5 commented Mar 21, 2024 •

edited

Loading