"Discarded" covers several different error-handling scenarios and they should be disambiguated #890

TAGraves · 2023-03-14T17:33:23Z

We have a job configured like:

    retry_on SomeError, wait: 5.seconds, attempts: 4 do |job, _exception|
      Rails.logger.warn("SomeError still occurred after retries were exhausted")
      # do something to handle this
    end

When the retries are exhausted, the block we pass to retry_on is called, but SomeError is still propagated up to GoodJob, causing the job to get discarded with SomeError. We'd instead like to treat the job as successful in this case.

The docs seem to suggest this should already be happening by default, but it isn't for us.

When using retry_on with a limited number of retries, the final exception will not be rescued and will raise to GoodJob's error handler. To avoid this, pass a block to retry_on to handle the final exception instead of raising it to GoodJob

The text was updated successfully, but these errors were encountered:

bensheldon · 2023-03-15T02:11:06Z

I think what you're describing is the intended behavior, which is how I understand "Discarded" to mean: a job that has errored, but not been retried, is discarded.

Active Job is vague on this so far as I can tell. Though it does has instrumentation for "retry_stopped" which is different than the instrumentation for "discard".

Can you tell me a bit more about how you think of a job being "discarded" is different that retries being exhausted? You mention you want to mark the job as "succeeded" but maybe there's a third state to expose? (Are you using the Dashboard?)

The other thing I could suggest, more manually, is for you not to use rescue_from and instead use around_performand manually count executions and call retry_job.

TAGraves · 2023-03-15T02:38:47Z

I think from a practical perspective, what I want is:

The job should show up under the "Succeeded" tab rather than "Discarded" tab on the dashboard (and shouldn't be retriable anymore)
good_job.on_thread_error shouldn't be called.

a job that has errored, but not been retried, is discarded

I think based on my reading of the documentation, it doesn't really make sense to consider this a job that has errored. I'm passing a block to handle the error, right?

I guess my original post wasn't totally clear on this, but I specifically mean when a block is passed to retry_on. It definitely seems like the correct behavior for the job to be discarded when retries are exhausted and a block is not passed to retry_on. But if I pass a block, I'm handling the error in the block and I want to treat it as rescued and handled. (I can always re-raise the error if I do want to explicitly give it back to GoodJob!)

maybe there's a third state to expose

That might be right. I'm not sure what the third state would be, but I do have other cases where I want to basically "ignore" certain errors that occur. We handle this right now by rescuing those errors inside of perform, but then the job shows up as "succeeded" in the dashboard where what I might really want is "failed, but I don't want this to be retriable and I don't want it to show up on the "discarded" tab and I don't want my error reporter to report the error to my monitoring tools."

bensheldon · 2023-03-15T03:37:42Z

if I pass a block [to rescue_from], I'm handling the error in the block and I want to treat it as rescued and handled. (I can always re-raise the error if I do want to explicitly give it back to GoodJob!)

I find that compelling 👍🏻

In that case, how would you want the data about the job to show up in the Dashboard? Should that final execution of the job have an error attached to it? Or to ask a different way, how would you represent the difference between "this job finished without error" and "this job finished with error, but that's fine"?

The easiest thing here would be to add a new configuration that is like retry_stopped_block_is_discard = true (default) and then you could set that to false and then the final execution of the job wouldn't record the error on the Execution record. (currently discarded == (error IS NOT NULL AND retried_execution_id IS NULL)

But where I'm angling at is that I don't think that will be sufficient because I imagine you'll still want to know what the final error was to differentiate between "succeeded succeeded" and "errored but didn't explicitly discard" So I'm wondering if I should add another column that's like "error reason" that could store some different options of:

unhandled
retry
retry_stopped
discarded

I'm ok adding another tab to the dashboard too, but just making sure I understand it all 😄

TAGraves · 2023-03-15T13:22:45Z

Or to ask a different way, how would you represent the difference between "this job finished without error" and "this job finished with error, but that's fine"?

Speaking just for ourselves, we don't really care about seeing this in the dashboard (just like we don't care about seeing rescued errors in the dashboard if our perform method has a rescue block at the end). We typically write a warning or error to the log if we do care about it and then use our APM and logging tools to find cases where jobs had errors that we didn't want to cause discards. So for us, the configuration option you proposed would be sufficient.

I think another tab on the dashboard is overkill, but I think showing the error when you inspect a job on the Succeeded tab would probably be good enough?

bensheldon · 2023-03-15T14:52:25Z

Thanks for the back and forth. I guess just to really try to scope this down, is the core of what you're asking for: don't report an exception to the error handler if a rescue_from block used?

It doesn't sound like you're using the named status of a job ("success" or "discarded"), so tell me if I'm wrong, it's just the behavior that's problematic. So would the option of retry_stopped_reports_error = true/false be enough?

bensheldon · 2023-03-17T15:45:07Z

I spent a little more time tracing this through GoodJob. I'm surprised by this:

good_job.on_thread_error shouldn't be called.

It looks like that should be the case already if an exception is handled by Active Job (e.g. rescue_from with a block). It's only unhandled errors, which are exceptions that propagate up from the job, and the rescue_from block should stop that.

Here's where the result object is populated:

good_job/app/models/good_job/execution.rb

Lines 352 to 355 in c7abca3

    
             ExecutionResult.new(value: value, handled_error: handled_error, retried: current_thread.error_on_retry.present?) 
        
           rescue StandardError => e 
        
             instrument_payload[:unhandled_error] = e 
        
             ExecutionResult.new(value: nil, unhandled_error: e)

And here is where that's sent to the error callback:

good_job/lib/good_job/scheduler.rb

Lines 200 to 201 in c7abca3

    
           error = thread_error || (output.is_a?(GoodJob::ExecutionResult) ? output.unhandled_error : nil) 
        
           GoodJob._on_thread_error(error) if error

So I think half of what you'd like should be the case... so that's troubling if that's not working properly 🤔

TAGraves · 2023-03-17T18:15:04Z

Hmm, you're right that it's not hitting on_thread_error 🤔. I'll have to look deeper to understand where our error-reporting code is reporting these errors to our monitoring tools. Sorry about the misleading point there!

It doesn't sound like you're using the named status of a job ("success" or "discarded"), so tell me if I'm wrong, it's just the behavior that's problematic.

I'm not sure I'm totally following the question here, but maybe this will help. The biggest thing for us is that we have an engineer triage all the jobs on the "discarded" tab each morning to understand if we need to manually retry them, destroy them put a code-fix in for them, etc. So these jobs showing up there is disruptive to our workflow since we know we never want to retry them. Right now we destroy them, but really I'd prefer they don't show up in the discarded tab in the first place. Does that make sense?

bensheldon · 2023-03-21T15:58:47Z

@TAGraves that's really helpful context! (Also, I realized you also opened #844 which is related, though different)

You are making me believe more strongly that I should record some context about the error that is stored on the job/execution record. That's the first step to being able to answer the question "what should we do with this job that has an error on it", and then secondarily we can figure out the workflow/dashboard/etc.

bensheldon · 2023-07-06T01:03:13Z

@TAGraves fyi, I started tracking the "error_event" in the database now (#995). I'm surfacing it slightly in the Dashboard, but there isn't currently any filters for drilling down into it. I wasn't quite sure if I should add new tabs for each item, or to make a single tab like "Failed" that could then be further drilled down into. I'd love your feedback.

The available values are:

good_job/app/models/concerns/good_job/error_events.rb

Lines 17 to 24 in 079db52

    
           enum error_event: { 
        
             ERROR_EVENT_INTERRUPTED => 0, 
        
             ERROR_EVENT_UNHANDLED => 1, 
        
             ERROR_EVENT_HANDLED => 2, 
        
             ERROR_EVENT_RETRIED => 3, 
        
             ERROR_EVENT_RETRY_STOPPED => 4, 
        
             ERROR_EVENT_DISCARDED => 5, 
        
           }.freeze, _prefix: :error_event

shouichi · 2023-08-14T07:30:23Z

We retry everything except specific errors that are specified by the discard_on macro. The problem we have is that there seems to be no way to distinguish between unhandled errors (where retrying was exhausted) and handled errors (specified by discard_on). We want to ignore handled errors and only want to know about unhandled ones.

class ApplicationJob < ActiveJob::Base
  # We basically retry everything.
  retry_on StandardError, wait: :exponentially_longer, attempts: 13

  # We know it's safe to discard and don't want to see them but
  # jobs with this error appear in the discarded jobs dashboard.
  discard_on ActiveJob::DeserializationError
end

That being said, I think #995 is a good start!

bensheldon · 2023-08-14T14:27:41Z

@shouichi thank you for the feedback 💖 I'm hopeful that we can get to that level of granularity.

Do you think the error_event types introduced in #995 cover all of the scenarios you need? I wanted to get some feedback on those before updating the Dashboard. I know many folks have written code around the existing terminology and want to get the deprecation cycle right when I fully switch over to the more granular failure types.

In your scenario, I believe that you would end up with jobs in ERROR_EVENT_DISCARDED, ERROR_EVENT_RETRIED, and ERROR_EVENT_RETRY_STOPPED.

shouichi · 2023-08-15T05:12:56Z

I quickly read #995 and the following is my understanding.

ERROR_EVENT_INTERRUPTED: Interrupted by OS signal or web UI.
ERROR_EVENT_UNHANDLED: Exception was unhandled and bubbled up to GoodJob.
ERROR_EVENT_HANDLED: ???
ERROR_EVENT_RETRIED: retry_on handled the exception and the job was retried.
ERROR_EVENT_RETRY_STOPPED: retry_on tried to handle the exception but retry attempts were exhausted and the error was bubbled up to GoodJob.
ERROR_EVENT_DISCARDED: discard_on handled the exception and the job was discarded.

Is this understanding correct? If so, #995 should cover our scenario. Thank you!

Not directly related to this issue but I personally don't need to see ERROR_EVENT_DISCARDED jobs in the dashboard because I know I want to discard them with discard_on. They can be destroyed automatically.

bensheldon added this to GoodJob Backlog v2 Mar 14, 2023

github-project-automation bot moved this to Inbox in GoodJob Backlog v2 Mar 14, 2023

bensheldon changed the title ~~How can I avoid a job being discarded when retries are exhausted?~~ "Discarded" covers several different error-handling scenarios and they should be disambiguated Mar 21, 2023

bensheldon mentioned this issue Apr 7, 2023

Pause or cancel a batch? #919

Open

bensheldon mentioned this issue Jul 3, 2023

Create error_event column to track the context of an error (discarded, retried, retry_stopped, etc) #995

Merged

bensheldon mentioned this issue Sep 19, 2023

"Force" discard jobs that are already running/runaway to prevent retry #1073

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Discarded" covers several different error-handling scenarios and they should be disambiguated #890

"Discarded" covers several different error-handling scenarios and they should be disambiguated #890

TAGraves commented Mar 14, 2023

bensheldon commented Mar 15, 2023 •

edited

Loading

TAGraves commented Mar 15, 2023

bensheldon commented Mar 15, 2023

TAGraves commented Mar 15, 2023

bensheldon commented Mar 15, 2023

bensheldon commented Mar 17, 2023

TAGraves commented Mar 17, 2023

bensheldon commented Mar 21, 2023

bensheldon commented Jul 6, 2023 •

edited

Loading

shouichi commented Aug 14, 2023

bensheldon commented Aug 14, 2023

shouichi commented Aug 15, 2023

"Discarded" covers several different error-handling scenarios and they should be disambiguated #890

"Discarded" covers several different error-handling scenarios and they should be disambiguated #890

Comments

TAGraves commented Mar 14, 2023

bensheldon commented Mar 15, 2023 • edited Loading

TAGraves commented Mar 15, 2023

bensheldon commented Mar 15, 2023

TAGraves commented Mar 15, 2023

bensheldon commented Mar 15, 2023

bensheldon commented Mar 17, 2023

TAGraves commented Mar 17, 2023

bensheldon commented Mar 21, 2023

bensheldon commented Jul 6, 2023 • edited Loading

shouichi commented Aug 14, 2023

bensheldon commented Aug 14, 2023

shouichi commented Aug 15, 2023

bensheldon commented Mar 15, 2023 •

edited

Loading

bensheldon commented Jul 6, 2023 •

edited

Loading