Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Discarded" covers several different error-handling scenarios and they should be disambiguated #890

Open
TAGraves opened this issue Mar 14, 2023 · 12 comments

Comments

@TAGraves
Copy link
Contributor

We have a job configured like:

    retry_on SomeError, wait: 5.seconds, attempts: 4 do |job, _exception|
      Rails.logger.warn("SomeError still occurred after retries were exhausted")
      # do something to handle this
    end

When the retries are exhausted, the block we pass to retry_on is called, but SomeError is still propagated up to GoodJob, causing the job to get discarded with SomeError. We'd instead like to treat the job as successful in this case.

The docs seem to suggest this should already be happening by default, but it isn't for us.

When using retry_on with a limited number of retries, the final exception will not be rescued and will raise to GoodJob's error handler. To avoid this, pass a block to retry_on to handle the final exception instead of raising it to GoodJob

@bensheldon
Copy link
Owner

bensheldon commented Mar 15, 2023

I think what you're describing is the intended behavior, which is how I understand "Discarded" to mean: a job that has errored, but not been retried, is discarded.

Active Job is vague on this so far as I can tell. Though it does has instrumentation for "retry_stopped" which is different than the instrumentation for "discard".

Can you tell me a bit more about how you think of a job being "discarded" is different that retries being exhausted? You mention you want to mark the job as "succeeded" but maybe there's a third state to expose? (Are you using the Dashboard?)

The other thing I could suggest, more manually, is for you not to use rescue_from and instead use around_performand manually count executions and call retry_job.

@TAGraves
Copy link
Contributor Author

I think from a practical perspective, what I want is:

  1. The job should show up under the "Succeeded" tab rather than "Discarded" tab on the dashboard (and shouldn't be retriable anymore)
  2. good_job.on_thread_error shouldn't be called.

a job that has errored, but not been retried, is discarded

I think based on my reading of the documentation, it doesn't really make sense to consider this a job that has errored. I'm passing a block to handle the error, right?

I guess my original post wasn't totally clear on this, but I specifically mean when a block is passed to retry_on. It definitely seems like the correct behavior for the job to be discarded when retries are exhausted and a block is not passed to retry_on. But if I pass a block, I'm handling the error in the block and I want to treat it as rescued and handled. (I can always re-raise the error if I do want to explicitly give it back to GoodJob!)

maybe there's a third state to expose

That might be right. I'm not sure what the third state would be, but I do have other cases where I want to basically "ignore" certain errors that occur. We handle this right now by rescuing those errors inside of perform, but then the job shows up as "succeeded" in the dashboard where what I might really want is "failed, but I don't want this to be retriable and I don't want it to show up on the "discarded" tab and I don't want my error reporter to report the error to my monitoring tools."

@bensheldon
Copy link
Owner

if I pass a block [to rescue_from], I'm handling the error in the block and I want to treat it as rescued and handled. (I can always re-raise the error if I do want to explicitly give it back to GoodJob!)

I find that compelling 👍🏻

In that case, how would you want the data about the job to show up in the Dashboard? Should that final execution of the job have an error attached to it? Or to ask a different way, how would you represent the difference between "this job finished without error" and "this job finished with error, but that's fine"?

The easiest thing here would be to add a new configuration that is like retry_stopped_block_is_discard = true (default) and then you could set that to false and then the final execution of the job wouldn't record the error on the Execution record. (currently discarded == (error IS NOT NULL AND retried_execution_id IS NULL)

But where I'm angling at is that I don't think that will be sufficient because I imagine you'll still want to know what the final error was to differentiate between "succeeded succeeded" and "errored but didn't explicitly discard" So I'm wondering if I should add another column that's like "error reason" that could store some different options of:

  • unhandled
  • retry
  • retry_stopped
  • discarded

I'm ok adding another tab to the dashboard too, but just making sure I understand it all 😄

@TAGraves
Copy link
Contributor Author

Or to ask a different way, how would you represent the difference between "this job finished without error" and "this job finished with error, but that's fine"?

Speaking just for ourselves, we don't really care about seeing this in the dashboard (just like we don't care about seeing rescued errors in the dashboard if our perform method has a rescue block at the end). We typically write a warning or error to the log if we do care about it and then use our APM and logging tools to find cases where jobs had errors that we didn't want to cause discards. So for us, the configuration option you proposed would be sufficient.

I think another tab on the dashboard is overkill, but I think showing the error when you inspect a job on the Succeeded tab would probably be good enough?

@bensheldon
Copy link
Owner

Thanks for the back and forth. I guess just to really try to scope this down, is the core of what you're asking for: don't report an exception to the error handler if a rescue_from block used?

It doesn't sound like you're using the named status of a job ("success" or "discarded"), so tell me if I'm wrong, it's just the behavior that's problematic. So would the option of retry_stopped_reports_error = true/false be enough?

@bensheldon
Copy link
Owner

I spent a little more time tracing this through GoodJob. I'm surprised by this:

good_job.on_thread_error shouldn't be called.

It looks like that should be the case already if an exception is handled by Active Job (e.g. rescue_from with a block). It's only unhandled errors, which are exceptions that propagate up from the job, and the rescue_from block should stop that.

Here's where the result object is populated:

ExecutionResult.new(value: value, handled_error: handled_error, retried: current_thread.error_on_retry.present?)
rescue StandardError => e
instrument_payload[:unhandled_error] = e
ExecutionResult.new(value: nil, unhandled_error: e)

And here is where that's sent to the error callback:

error = thread_error || (output.is_a?(GoodJob::ExecutionResult) ? output.unhandled_error : nil)
GoodJob._on_thread_error(error) if error

So I think half of what you'd like should be the case... so that's troubling if that's not working properly 🤔

@TAGraves
Copy link
Contributor Author

Hmm, you're right that it's not hitting on_thread_error 🤔. I'll have to look deeper to understand where our error-reporting code is reporting these errors to our monitoring tools. Sorry about the misleading point there!

It doesn't sound like you're using the named status of a job ("success" or "discarded"), so tell me if I'm wrong, it's just the behavior that's problematic.

I'm not sure I'm totally following the question here, but maybe this will help. The biggest thing for us is that we have an engineer triage all the jobs on the "discarded" tab each morning to understand if we need to manually retry them, destroy them put a code-fix in for them, etc. So these jobs showing up there is disruptive to our workflow since we know we never want to retry them. Right now we destroy them, but really I'd prefer they don't show up in the discarded tab in the first place. Does that make sense?

@bensheldon
Copy link
Owner

@TAGraves that's really helpful context! (Also, I realized you also opened #844 which is related, though different)

You are making me believe more strongly that I should record some context about the error that is stored on the job/execution record. That's the first step to being able to answer the question "what should we do with this job that has an error on it", and then secondarily we can figure out the workflow/dashboard/etc.

@bensheldon bensheldon changed the title How can I avoid a job being discarded when retries are exhausted? "Discarded" covers several different error-handling scenarios and they should be disambiguated Mar 21, 2023
@bensheldon
Copy link
Owner

bensheldon commented Jul 6, 2023

@TAGraves fyi, I started tracking the "error_event" in the database now (#995). I'm surfacing it slightly in the Dashboard, but there isn't currently any filters for drilling down into it. I wasn't quite sure if I should add new tabs for each item, or to make a single tab like "Failed" that could then be further drilled down into. I'd love your feedback.

Screenshot 2023-07-05 at 5 59 06 PM

The available values are:

enum error_event: {
ERROR_EVENT_INTERRUPTED => 0,
ERROR_EVENT_UNHANDLED => 1,
ERROR_EVENT_HANDLED => 2,
ERROR_EVENT_RETRIED => 3,
ERROR_EVENT_RETRY_STOPPED => 4,
ERROR_EVENT_DISCARDED => 5,
}.freeze, _prefix: :error_event

@shouichi
Copy link
Contributor

We retry everything except specific errors that are specified by the discard_on macro. The problem we have is that there seems to be no way to distinguish between unhandled errors (where retrying was exhausted) and handled errors (specified by discard_on). We want to ignore handled errors and only want to know about unhandled ones.

class ApplicationJob < ActiveJob::Base
  # We basically retry everything.
  retry_on StandardError, wait: :exponentially_longer, attempts: 13

  # We know it's safe to discard and don't want to see them but
  # jobs with this error appear in the discarded jobs dashboard.
  discard_on ActiveJob::DeserializationError
end

That being said, I think #995 is a good start!

@bensheldon
Copy link
Owner

@shouichi thank you for the feedback 💖 I'm hopeful that we can get to that level of granularity.

Do you think the error_event types introduced in #995 cover all of the scenarios you need? I wanted to get some feedback on those before updating the Dashboard. I know many folks have written code around the existing terminology and want to get the deprecation cycle right when I fully switch over to the more granular failure types.

In your scenario, I believe that you would end up with jobs in ERROR_EVENT_DISCARDED, ERROR_EVENT_RETRIED, and ERROR_EVENT_RETRY_STOPPED.

@shouichi
Copy link
Contributor

I quickly read #995 and the following is my understanding.

  • ERROR_EVENT_INTERRUPTED: Interrupted by OS signal or web UI.
  • ERROR_EVENT_UNHANDLED: Exception was unhandled and bubbled up to GoodJob.
  • ERROR_EVENT_HANDLED: ???
  • ERROR_EVENT_RETRIED: retry_on handled the exception and the job was retried.
  • ERROR_EVENT_RETRY_STOPPED: retry_on tried to handle the exception but retry attempts were exhausted and the error was bubbled up to GoodJob.
  • ERROR_EVENT_DISCARDED: discard_on handled the exception and the job was discarded.

Is this understanding correct? If so, #995 should cover our scenario. Thank you!

Not directly related to this issue but I personally don't need to see ERROR_EVENT_DISCARDED jobs in the dashboard because I know I want to discard them with discard_on. They can be destroyed automatically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Inbox
Development

No branches or pull requests

3 participants