-
-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gracefully handle NameError
while workers are restarting
#1546
Comments
I'm always very concious when making changes to jobs precisely because of this. Adding a job, removing one, changing arguments can all lead to trouble when the server and worker don't agree on what is what. I usually make multiple deploys to sidestep this (even with sidekiq which I know would handle these errors for me, just seems safer). That said, I think some kind of implicit error handling for this does make sense. Sidekiq for example has a retry mechanism where it retries a few times over a month or so. For errors happening before the job even has a chance to start, I think that makes sense. |
I’m open to it. Just thinking about the situations I encounter are:
I feel like I’d want to scope this particular issue to “deploy rollout” related problems, and do our best to not mask the other problems (or at least not mask them for too long) That makes me think that the config would be like “config.graceful_job_rollout_period = 15.minutes” which would have the effect of:
(I jammed ArgumentErrors into this, but feel free to exclude if you think that’s fundamentally different) Or, a totally different thought: should we rethink “retry_on_unhandled_error”? It could take a proc with the error as argument, and the proc could return the number of seconds to wait to retry. (Heaven knows what happens if the proc itself raises an exception). I don’t love this idea because I frequently feel like people interpret “advanced usage” as something to aspire to. And I rather build explicit features. But I want to open the door in case there are lots of other use cases that would benefit from a more generic change. |
For
Wouldn't that just reimplement what you can do with More generally, this would help with maybe 90% of problems. If a job is deleted but still enqueued, no matter the retries it will always fail. Same if an argument is removed or added without the job being prepared for this. It helps with newly enqueued jobs but really you need to be mindful of already enqueued jobs and probably do these changes across multiple deploys. |
Thanks for those comments! That's helpful. And reminds me why these things are slightly different:
So I guess I'll reduce the scope back down to: would be nice to handle NameError specifically from job deserialization. I think it makes sense for Active Job to raise a special error. Though maybe slightly different than DeserializationError to distinguish between:
|
I agree that this should just be about
Yeah, I reckon in an ideal world Active Job would have some kind of story around how this is handled, because it's going to crop up for any queue adapter. Maybe it would honour the retry semantics set on |
That's a good point. Although, it would still be somewhat difficult to distinquish between the error happening inside your job and before it executes. For now, I openend rails/rails#53770 |
Here's a deployment scenario we encountered:
NameError
We cannot configure retries for this error via Active Job, because that requires the job constant to be deserialized in order to instantiate the job instance. (The error occurs here in Active Job.)
It would be nice if Good Job could gracefully handle this somehow. I could turn on
config.retry_on_unhandled_error
, but I don't want to because of the potential downsides of that.Perhaps this config option could be extended to allow error types to be specified:
However, I feel like this would still only be useful if there were some kind of exponential backoff built in. (Or at least a retry interval at the bare minimum, so we don't DoS ourselves.)
The text was updated successfully, but these errors were encountered: