Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: workflows that are retrying should not be deleted (Fixes #12636) #12905

Merged
merged 1 commit into from
Apr 13, 2024

Conversation

siwet
Copy link
Contributor

@siwet siwet commented Apr 6, 2024

Fixes #12636

Motivation

try to fix #12636

Modifications

modify gc_controller.go to ignore deletion for workflow not completed due to a retry operation

Verification

test workflow file fix-12636.yaml:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: fix-12636-
spec:
  entrypoint: main
  ttlStrategy:
    secondsAfterFailure: 30
  arguments:
    parameters:
    - name: fail
      value: 'true'
  templates:
  - name: main
    dag:
      tasks:
      - name: A
        template: first-fail
        arguments:
          parameters:
          - name: fail
            value: "{{workflow.parameters.fail}}"
      - name: B
        depends: "A"
        template: echo

  - name: first-fail
    inputs:
      parameters:
        - name: fail
    container:
      image: alpine:latest
      imagePullPolicy: IfNotPresent
      command: [sh, -c]
      args: 
      - |
        set -e
        if [ "{{inputs.parameters.fail}}" == "true" ]; then
          echo "Failing"
          exit 1
        fi
        echo "Succeeded"
  - name: echo
    container:
      image: alpine:latest
      imagePullPolicy: IfNotPresent
      command: [sh, -c]
      args:
      - |
        set -e
        echo "sleep 120s"
        sleep 120
        echo "finished"
  • Submit test workflow: argo submit fix-12636.yaml, assuming name is fix-12636-dvw4d
  • Node A is expected to fail.
  • Retry workflow immediately and pass parameters to ensure node A succeeds.: argo retry fix-12636-dvw4d -p fail=false
  • Workflows can run successfully without being mistakenly deleted (in the unpatched version, workflows could be erroneously deleted during execution).

@agilgur5 agilgur5 marked this pull request as draft April 6, 2024 18:48
@blkperl blkperl added the area/gc Garbage collection, such as TTLs, retentionPolicy, delays, and more label Apr 8, 2024
@siwet siwet changed the title fix: workflows that have not completed due to a retry operation should not be deleted (WIP Fixes #12636) fix: workflows that have not completed due to a retry operation should not be deleted (Fixes #12636) Apr 8, 2024
@siwet siwet marked this pull request as ready for review April 8, 2024 16:37
@agilgur5 agilgur5 added the area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries label Apr 9, 2024
Copy link

@agilgur5 agilgur5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a comment on the issue in #12636 (comment). also left one in-line here

workflow/gccontroller/gc_controller.go Outdated Show resolved Hide resolved
@agilgur5 agilgur5 changed the title fix: workflows that have not completed due to a retry operation should not be deleted (Fixes #12636) fix: workflows that are retrying should not be deleted (Fixes #12636) Apr 9, 2024
@siwet
Copy link
Contributor Author

siwet commented Apr 11, 2024

@agilgur5 Thanks for the suggestion. After testing, the Done function (as mentioned in #12636 (comment)) cannot remove elements from the queue.
The latest workflow is changed to obtain from the infromer cache (per #12905 (comment)) to avoid network overhead, need to review again

@agilgur5
Copy link

agilgur5 commented Apr 11, 2024

@agilgur5 Thanks for the suggestion. After testing, the Done function (as mentioned in #12636 (comment)) cannot remove elements from the queue.

Can you elaborate on this? Done is the only way to remove elements from the queue as far as I know and is already used to remove workflows from the queue after deletion

@siwet siwet closed this Apr 12, 2024
@siwet siwet reopened this Apr 12, 2024
@siwet
Copy link
Contributor Author

siwet commented Apr 13, 2024

Can you elaborate on this? Done is the only way to remove elements from the queue as far as I know

Certainly. I'm using this unit test to verify that Done cannot remove an element from the workflow_ttl_queue, preventing the WF from being removed.

func TestWorkQueueDoneFun(t *testing.T) {
	workqueue := workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), "workflow_ttl_queue")

	workqueue.Add("test1")
	workqueue.Done("test1")
	assert.Equal(t, 2, workqueue.Len())
	e, _ := workqueue.Get()
	assert.Equal(t, "test1", e)
	e, _ = workqueue.Get()
	assert.Equal(t, "test1", e)
	assert.Equal(t, 0, workqueue.Len())

	workqueue.AddAfter("test2", time.Second*1)
	assert.Equal(t, 0, workqueue.Len())
	workqueue.Done("test2")
	time.Sleep(time.Second * 2)
	assert.Equal(t, 1, workqueue.Len())
	e, _ = workqueue.Get()
	assert.Equal(t, "test2", e)
	assert.Equal(t, 0, workqueue.Len())
}

and is already used to remove workflows from the queue after deletion

According to the Done source code, seems to be used to mark that an element has been processed to prevent an element from being processed concurrently.

func TestWorQueueDoneFun2(t *testing.T) {
	workqueue := workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), "workflow_ttl_queue")

	// Workqueue internally implemented using three fields: queue,dirty,processing
	// simple test
	workqueue.Add("test1") // queue=[test1],  dirty=[test1], processing=[]
	assert.Equal(t, 1, workqueue.Len())
	e, _ := workqueue.Get() // queue=[],  dirty=[], processing=[test1]
	assert.Equal(t, "test1", e)
	workqueue.Done("test1") // queue=[],  dirty=[], processing=[]
	assert.Equal(t, 0, workqueue.Len())

	// concurrently test 
	workqueue.Add("test2") // queue=[test2],  dirty=[test2], processing=[]
	e, _ = workqueue.Get() // queue=[],  dirty=[], processing=[test2]
	go func() {
		time.Sleep(time.Second * 3)
		workqueue.Done("test2") // queue=[test2],  dirty=[test2], processing=[]
	}()
	workqueue.Add("test2") // queue=[],  dirty=[test2], processing=[test2]
	last := time.Now()
	e, _ = workqueue.Get() // will wait until the above Done function is called
	                                        // queue=[],  dirty=[], processing=[test2]
	assert.Equal(t, true, time.Now().Sub(last) >= time.Second*2)
	assert.Equal(t, "test2", e)
	workqueue.Done("test2")     // queue=[],  dirty=[], processing=[]
}

And since workflow_ttl_queue is a delaying queue, the element should actually be removed from the waitingForQueue. however the Done function will not handle waitingForQueue.

Maybe the explanation is not very clear, waiting for your suggestions

@siwet siwet requested a review from agilgur5 April 13, 2024 18:37
@agilgur5
Copy link

ah I see, Done deletes it from the processing set, but not the queue array, which, in the public API, can only be removed sequentially via Get.

Thanks for the explanation! That's a bummer that we can't delete it directly

And since workflow_ttl_queue is a delaying queue, the element should actually be removed from the waitingForQueue. however the Done function will not handle waitingForQueue.

Hmm I had been looking at the RateLimitingQueue code, which is built on top of the DelayingQueue, and there Forget says to call Done.

In any case, since we can't remove from the queue, then this is pretty much the best we can do without using another data structure. Let me do a final look-over then

Copy link

@agilgur5 agilgur5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with the informer store variant, which seems to be the best we can do without changing up the whole queue data structure.

Thanks for the fix!

@agilgur5 agilgur5 merged commit 2095621 into argoproj:main Apr 13, 2024
51 checks passed
@agilgur5 agilgur5 added this to the v3.5.x patches milestone Apr 13, 2024
@agilgur5
Copy link

agilgur5 commented Apr 18, 2024

Hmm I was wondering how to test this as it only occurs during a race condition and the success condition is that it's not deleted during a retry...
I was thinking of potentially creating a new E2E test suite dedicated to races with wacky conditions but I'm actually not sure of a good way to test this even if that existed 🤔 We'd have to induce a deletion during the retry... maybe a retry of a long step (e.g. 30s) and then a queue deletion is manually called in the middle of that?

@siwet
Copy link
Contributor Author

siwet commented Apr 19, 2024

I was thinking of potentially creating a new E2E test suite dedicated to races with wacky conditions ...

hi, @agilgur5 I tested it manually, using the method described in Part Verification of #12905 (comment), and indeed it takes tens of seconds to complete. May I ask if you need me to contribute this E2E test? If So, I might need some guidance, such as which directory to put in, and previous references.

@agilgur5
Copy link

agilgur5 commented Apr 19, 2024

If you can get it to work in the E2E tests that'd be great as a regression test! test/e2e/cli_test.go has several existing retry tests, all named TestRetry* or TestWorkflowRetry*, can put it with those and use those as examples 🙂

agilgur5 pushed a commit that referenced this pull request Apr 19, 2024
…#12905)

Signed-off-by: Shiwei Tang <[email protected]>
Co-authored-by: Anton Gilgur <[email protected]>
(cherry picked from commit 2095621)
@agilgur5
Copy link

Backported cleanly into release-3.5 as ab7bee7

isubasinghe pushed a commit to isubasinghe/argo-workflows that referenced this pull request May 6, 2024
isubasinghe pushed a commit to isubasinghe/argo-workflows that referenced this pull request May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/gc Garbage collection, such as TTLs, retentionPolicy, delays, and more area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Retry failed workflow with ttl deleted after initial secondsAfterFailure while still running
4 participants