-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: workflows that are retrying should not be deleted (Fixes #12636) #12905
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a comment on the issue in #12636 (comment). also left one in-line here
…j#12636) Co-authored-by: Anton Gilgur <[email protected]> Signed-off-by: Shiwei Tang <[email protected]>
@agilgur5 Thanks for the suggestion. After testing, the Done function (as mentioned in #12636 (comment)) cannot remove elements from the queue. |
Can you elaborate on this? |
Certainly. I'm using this unit test to verify that func TestWorkQueueDoneFun(t *testing.T) {
workqueue := workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), "workflow_ttl_queue")
workqueue.Add("test1")
workqueue.Done("test1")
assert.Equal(t, 2, workqueue.Len())
e, _ := workqueue.Get()
assert.Equal(t, "test1", e)
e, _ = workqueue.Get()
assert.Equal(t, "test1", e)
assert.Equal(t, 0, workqueue.Len())
workqueue.AddAfter("test2", time.Second*1)
assert.Equal(t, 0, workqueue.Len())
workqueue.Done("test2")
time.Sleep(time.Second * 2)
assert.Equal(t, 1, workqueue.Len())
e, _ = workqueue.Get()
assert.Equal(t, "test2", e)
assert.Equal(t, 0, workqueue.Len())
}
According to the func TestWorQueueDoneFun2(t *testing.T) {
workqueue := workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), "workflow_ttl_queue")
// Workqueue internally implemented using three fields: queue,dirty,processing
// simple test
workqueue.Add("test1") // queue=[test1], dirty=[test1], processing=[]
assert.Equal(t, 1, workqueue.Len())
e, _ := workqueue.Get() // queue=[], dirty=[], processing=[test1]
assert.Equal(t, "test1", e)
workqueue.Done("test1") // queue=[], dirty=[], processing=[]
assert.Equal(t, 0, workqueue.Len())
// concurrently test
workqueue.Add("test2") // queue=[test2], dirty=[test2], processing=[]
e, _ = workqueue.Get() // queue=[], dirty=[], processing=[test2]
go func() {
time.Sleep(time.Second * 3)
workqueue.Done("test2") // queue=[test2], dirty=[test2], processing=[]
}()
workqueue.Add("test2") // queue=[], dirty=[test2], processing=[test2]
last := time.Now()
e, _ = workqueue.Get() // will wait until the above Done function is called
// queue=[], dirty=[], processing=[test2]
assert.Equal(t, true, time.Now().Sub(last) >= time.Second*2)
assert.Equal(t, "test2", e)
workqueue.Done("test2") // queue=[], dirty=[], processing=[]
} And since Maybe the explanation is not very clear, waiting for your suggestions |
ah I see, Thanks for the explanation! That's a bummer that we can't
Hmm I had been looking at the In any case, since we can't remove from the queue, then this is pretty much the best we can do without using another data structure. Let me do a final look-over then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with the informer store variant, which seems to be the best we can do without changing up the whole queue data structure.
Thanks for the fix!
Hmm I was wondering how to test this as it only occurs during a race condition and the success condition is that it's not deleted during a retry... |
hi, @agilgur5 I tested it manually, using the method described in Part Verification of #12905 (comment), and indeed it takes tens of seconds to complete. May I ask if you need me to contribute this E2E test? If So, I might need some guidance, such as which directory to put in, and previous references. |
If you can get it to work in the E2E tests that'd be great as a regression test! |
…#12905) Signed-off-by: Shiwei Tang <[email protected]> Co-authored-by: Anton Gilgur <[email protected]> (cherry picked from commit 2095621)
Backported cleanly into |
…j#12636) (argoproj#12905) Signed-off-by: Shiwei Tang <[email protected]> Co-authored-by: Anton Gilgur <[email protected]>
…j#12636) (argoproj#12905) Signed-off-by: Shiwei Tang <[email protected]> Co-authored-by: Anton Gilgur <[email protected]>
Fixes #12636
Motivation
try to fix #12636
Modifications
modify
gc_controller.go
to ignore deletion for workflow not completed due to a retry operationVerification
test workflow file
fix-12636.yaml
:argo submit fix-12636.yaml
, assuming name isfix-12636-dvw4d
argo retry fix-12636-dvw4d -p fail=false