-
-
Notifications
You must be signed in to change notification settings - Fork 936
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SQS: when redelivering a message apply visibility timeout based on wait_time_seconds (long polling time) #2049
base: main
Are you sure you want to change the base?
Conversation
Issue also reported here: celery/celery#7283 back in 2022. |
Do you need anything else to merge this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also consider updating related documents for this?
I haven’t fully reviewed this PR yet, but I offer an alternative approach at celery/celery#9213. @pbudzon check it out, it might solve your original problem (still using 10s). |
@Nusnus I'll see if we can upgrade internally to celery 5.5 and if this new approach solves the problem. The background for this specific issue, as far as I understood, was caused by a race condition inside kombu where ReceiveMessage calls are made by using an async library and happen "in the background" of everything else - and they are not properly (or quickly enough?) terminated when shutdown happens, so part of the code restoring messages to the queue on shutdown can happen while there's an active ReceiveMessage call still "running" which then immediately fetches the restored message (removing it from the queue or hiding it again, depending on That is my understanding based partially on what I saw in the code and on this: #1819 So, with the new shutdowns, I imagine this would still be an issue with the hard shutdown and cold shutdown. Others - I cannot say without testing. That would mainly depend on whether the async ReceiveMessage is killed fast enough, but that's controlled by kombu not celery, so I'm not sure if it's even possible to solve this from celery's side (but I don't know all the source code, so again, I don't know for sure). What would you say is the best approach to terminating celery when we have time constraints (like in containers services like ECS, etc, where we have 1-2 minutes to exit cleanly) and long-running tasks - so we are fairly sure the tasks won't be able to finish within the time limit anyway, but we want to have absolute certainty they will end up back in the queue? |
🔥
Basically yeah, the worker terminates too fast. It is very confusing during debugging because the debugger keeps the worker process alive, but I am able to replicate it using the smoke tests so I could test automatically.
The approach with the soft shutdown is to allow the user to define a timer before everything collapses. 10-30 seconds should be enough. It means it will give a “limited warm shutdown” of 10-30s (for example) so the tasks will run a bit more. Another The hard shutdown is never used directly. It is the result of multiple SIGQUIT signals (or SIGINT or SIGTERM if REMAP_SIGTERM is configured). But yes, it will do the same as now, forcing termination of the worker process immediately. |
Celery v5.5.0b3 released. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there @pbudzon, have you tried v5.5 beta?
Do you still need the changes of this PR or does the soft shutdown fix your use-case?
Thank!
@Nusnus apologies, we did not have a chance to test the 5.5 yet. I'm hoping we'll be able to do that in the coming weeks - I'd like to keep this PR open until we can confirm this issue has been resolved, please. |
Fix for: #1835
Partially also: #1819
Possibly related: celery/celery#8583
Issue
when not using short polling (
wait_time_seconds > 0
) andtask_acks_late=True
it is possible for message to become stuck with full visibility timeout when celery is shut down. I was able to replicate that consistently with cold shutdowns.Similar issue happens with warm shutdowns (see related discussion/bugs above). And with
task_acks_late=False
I believe this causes the message to become lost completely.To reproduce
SIGQUIT
to celery master process.Restoring X unacknowledged message(s)
.Note: this is a race condition, so it doesn't happen every time, but in my tests it happened MOST of the time (like 80%).
What happens
I believe the internal details are described in #1819.
On AWS side, you can see in CloudTrail logs the
ChangeMessageVisibility
call appearing correctly, but right after it there is (at least one, sometimes more, depending on yourconcurrency
) aReceiveMessage
call which fetches the message again from the queue, causing its visibility timeout to be set back to non-zero (whatever your settings are).This fix
This change changes the visibility timeout of the redelivered message from 0 (zero) to
wait_time_seconds
value. While this doesn't solve the underlying issue, it prevents the message from being re-fetched by the still-runningReceiveMessage
call(s).For most scenarios this should create a considerable improvement, because:
task_acks_late=True
, the message will be hidden forwait_time_seconds
(default 10 seconds), instead of visibility timeout (default 30 minutes). I have tested that change in this scenario and it is working as expected.task_acks_late=False
, this should prevent the message from being lost completely. I did not test that scenario, however.