-
-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Discarded" covers several different error-handling scenarios and they should be disambiguated #890
Comments
I think what you're describing is the intended behavior, which is how I understand "Discarded" to mean: a job that has errored, but not been retried, is discarded. Active Job is vague on this so far as I can tell. Though it does has instrumentation for "retry_stopped" which is different than the instrumentation for "discard". Can you tell me a bit more about how you think of a job being "discarded" is different that retries being exhausted? You mention you want to mark the job as "succeeded" but maybe there's a third state to expose? (Are you using the Dashboard?) The other thing I could suggest, more manually, is for you not to use |
I think from a practical perspective, what I want is:
I think based on my reading of the documentation, it doesn't really make sense to consider this a job that has errored. I'm passing a block to handle the error, right? I guess my original post wasn't totally clear on this, but I specifically mean when a block is passed to
That might be right. I'm not sure what the third state would be, but I do have other cases where I want to basically "ignore" certain errors that occur. We handle this right now by rescuing those errors inside of |
I find that compelling 👍🏻 In that case, how would you want the data about the job to show up in the Dashboard? Should that final execution of the job have an error attached to it? Or to ask a different way, how would you represent the difference between "this job finished without error" and "this job finished with error, but that's fine"? The easiest thing here would be to add a new configuration that is like But where I'm angling at is that I don't think that will be sufficient because I imagine you'll still want to know what the final error was to differentiate between "succeeded succeeded" and "errored but didn't explicitly discard" So I'm wondering if I should add another column that's like "error reason" that could store some different options of:
I'm ok adding another tab to the dashboard too, but just making sure I understand it all 😄 |
Speaking just for ourselves, we don't really care about seeing this in the dashboard (just like we don't care about seeing rescued errors in the dashboard if our I think another tab on the dashboard is overkill, but I think showing the error when you inspect a job on the Succeeded tab would probably be good enough? |
Thanks for the back and forth. I guess just to really try to scope this down, is the core of what you're asking for: don't report an exception to the error handler if a rescue_from block used? It doesn't sound like you're using the named status of a job ("success" or "discarded"), so tell me if I'm wrong, it's just the behavior that's problematic. So would the option of |
I spent a little more time tracing this through GoodJob. I'm surprised by this:
It looks like that should be the case already if an exception is handled by Active Job (e.g. rescue_from with a block). It's only Here's where the result object is populated: good_job/app/models/good_job/execution.rb Lines 352 to 355 in c7abca3
And here is where that's sent to the error callback: good_job/lib/good_job/scheduler.rb Lines 200 to 201 in c7abca3
So I think half of what you'd like should be the case... so that's troubling if that's not working properly 🤔 |
Hmm, you're right that it's not hitting
I'm not sure I'm totally following the question here, but maybe this will help. The biggest thing for us is that we have an engineer triage all the jobs on the "discarded" tab each morning to understand if we need to manually retry them, destroy them put a code-fix in for them, etc. So these jobs showing up there is disruptive to our workflow since we know we never want to retry them. Right now we destroy them, but really I'd prefer they don't show up in the discarded tab in the first place. Does that make sense? |
@TAGraves that's really helpful context! (Also, I realized you also opened #844 which is related, though different) You are making me believe more strongly that I should record some context about the error that is stored on the job/execution record. That's the first step to being able to answer the question "what should we do with this job that has an error on it", and then secondarily we can figure out the workflow/dashboard/etc. |
@TAGraves fyi, I started tracking the "error_event" in the database now (#995). I'm surfacing it slightly in the Dashboard, but there isn't currently any filters for drilling down into it. I wasn't quite sure if I should add new tabs for each item, or to make a single tab like "Failed" that could then be further drilled down into. I'd love your feedback. The available values are: good_job/app/models/concerns/good_job/error_events.rb Lines 17 to 24 in 079db52
|
We retry everything except specific errors that are specified by the
That being said, I think #995 is a good start! |
@shouichi thank you for the feedback 💖 I'm hopeful that we can get to that level of granularity. Do you think the error_event types introduced in #995 cover all of the scenarios you need? I wanted to get some feedback on those before updating the Dashboard. I know many folks have written code around the existing terminology and want to get the deprecation cycle right when I fully switch over to the more granular failure types. In your scenario, I believe that you would end up with jobs in |
I quickly read #995 and the following is my understanding.
Is this understanding correct? If so, #995 should cover our scenario. Thank you! Not directly related to this issue but I personally don't need to see |
We have a job configured like:
When the retries are exhausted, the block we pass to
retry_on
is called, but SomeError is still propagated up to GoodJob, causing the job to get discarded with SomeError. We'd instead like to treat the job as successful in this case.The docs seem to suggest this should already be happening by default, but it isn't for us.
The text was updated successfully, but these errors were encountered: