Add new option to server to report crashes/exceptions #377

beligante · 2024-07-07T18:46:22Z

Is your feature request related to a problem? Please describe.

Hello Team!

Currently at my work, the team face a challenge when comes to the topic of reporting crashes to tools such as Sentry or Bugsnag. Internally we use a fork of this project where we add some implementation details to report crashes from inside Cowboy.Handler

So, I thought on a proposal where we could introduce something generic and flexible enough so that anyone can use for their own purposes.

My Idea here is to introduce an option to the GRPC Server startup where you can pass a callback to be invoked whenever an exception (or what the maintainers consider an exception worthy to be reported), that callback expects an exception to be provided and internally any implementation can be used.

Describe the solution you'd like

The new option would be passed on the server startup. Eg:

# bugsnag (no sanitization)
Supervisor.init(endpoint: MyEnpoint, port: 1234, start_server: true, crash_report: &Bugsnag.report/1)

#sentry
Supervisor.init(endpoint: MyEnpoint, port: 1234, start_server: true, crash_report: &Sentry.capture_exception/1)

# own impl
Supervisor.init(endpoint: MyEnpoint, port: 1234, start_server: true, crash_report: &MyOwnReport.capture/1)

Describe alternatives you've considered

We can alternatively add these report calls on the service level implementations. Such as

defmodule FeatureServer do
  use GRPC.Server, service: Routeguide.RouteGuide.Service

  def get_feature(point, _stream) do
    # something that can generate exceptions
  catch
    kind, e -> Bugsnag.report(e, __STACKTRACE__)  
  end
end

Tho, add the report on that level has it's limitations. For example

We can only capture exceptions related to the function used to resolve this request (which is usually what we're interested), but with this level, we don't have access to the request ( :cowboy.req ) to add another layer of information about that execution
We can't capture exceptions related to issues on the grpc connection level that can at some point terminate the connection (such as client canceling request or closing the connection abrupt abruptly)

Additional context
I made a WIP PR to illustrate better the idea, there's some punishment required, but I think I can add more depth to my request. It's on my fork for fow. Good part of the work there I got inspired by @darrenclark

beligante#3

The text was updated successfully, but these errors were encountered:

beligante · 2024-07-08T14:36:38Z

Another interesting option was presented to me today. Logger Metadata -- In specific the crash_reason metadata

The change in this case would be, instead of invoke a callback, we could just produce an error log passing the exception as part of the metadata. Eg:

def info({:EXIT, pid, {:handle_error, error}} = err, req, state = %{pid: pid}) do
    Logger.warning("3. #{inspect(state)} #{inspect(err)}")

    error = %RPCError{status: GRPC.Status.unknown(), message: "Internal Server Error"}
    req = send_error(req, error, state, :error)
    rpc_error = %RPCError{status: GRPC.Status.unknown(), message: "Internal Server Error"}
    req = send_error(req, rpc_error, state, :error)

    exception = HandlerException.new(req, error.kind, error.stack)

    Logger.error("Unexpected error happened", crash_reason: {exception, error.stack})

    {:stop, req, state}
end

Each issue reporter can implement it's own logger backend and consume this crash_reason to report the issues. Sentry already does that

Let me know if one of these captured you guys interest. I can work on them ASAP

sleipnir · 2024-07-08T18:16:21Z

Hello friend.

I have a few questions:

1 Can we measure the impact on the library's overall performance when adding these calls? I know that if we failed, what would matter least would be this, but even so, I still think it's relevant to reflect on the possible impact that these calls could have.
2 Could we suffer some type of cascading error if some of these external services are down?

sleipnir · 2024-07-08T18:18:09Z

Another interesting option was presented to me today. Logger Metadata -- In specific the crash_reason metadata

The change in this case would be, instead of invoke a callback, we could just produce an error log passing the exception as part of the metadata. Eg:
def info({:EXIT, pid, {:handle_error, error}} = err, req, state = %{pid: pid}) do
    Logger.warning("3. #{inspect(state)} #{inspect(err)}")

    error = %RPCError{status: GRPC.Status.unknown(), message: "Internal Server Error"}
    req = send_error(req, error, state, :error)
    rpc_error = %RPCError{status: GRPC.Status.unknown(), message: "Internal Server Error"}
    req = send_error(req, rpc_error, state, :error)

    exception = HandlerException.new(req, error.kind, error.stack)

    Logger.error("Unexpected error happened", crash_reason: {exception, error.stack})

    {:stop, req, state}
end
Each issue reporter can implement it's own logger backend and consume this crash_reason to report the issues. Sentry already does that

Let me know if one of these captured you guys interest. I can work on them ASAP

I find this solution cleaner and more interesting.

beligante · 2024-07-09T13:55:54Z

Hello friend.

I have a few questions:

1 Can we measure the impact on the library's overall performance when adding these calls? I know that if we failed, what would matter least would be this, but even so, I still think it's relevant to reflect on the possible impact that these calls could have. 2 Could we suffer some type of cascading error if some of these external services are down?

I don't know exactly how to measure that, but if we follow the crash_reason path, I believe the impact is minimal. It will all vary based on the Logger Backend you apply to capture those exeptions. So, it's pretty much on whoever uses the lib hands to add burden on this.

To add more to this, we're mainly working with already raised exceptions and we're just adding the :cowboy.req as extra information, which is a pretty simple struct

It all vary from implementations, but from my experience Sentry and Bugsnag have excellent implementations to handle outages.

sleipnir · 2024-07-10T02:13:48Z

Hello friend.
I have a few questions:
1 Can we measure the impact on the library's overall performance when adding these calls? I know that if we failed, what would matter least would be this, but even so, I still think it's relevant to reflect on the possible impact that these calls could have. 2 Could we suffer some type of cascading error if some of these external services are down?

I don't know exactly how to measure that, but if we follow the crash_reason path, I believe the impact is minimal. It will all vary based on the Logger Backend you apply to capture those exeptions. So, it's pretty much on whoever uses the lib hands to add burden on this.

To add more to this, we're mainly working with already raised exceptions and we're just adding the :cowboy.req as extra information, which is a pretty simple struct

It all vary from implementations, but from my experience Sentry and Bugsnag have excellent implementations to handle outages.

Once you choose to follow the Logger way, I believe there will be no major impacts. The comments went to your first suggestion

beligante · 2024-07-11T19:19:05Z

@sleipnir I'm okay with closing this in favor of: #381

sleipnir · 2024-07-11T19:28:15Z

@sleipnir I'm okay with closing this in favor of: #381

Yes of course

beligante added the enhancement label Jul 7, 2024

sleipnir mentioned this issue Jul 10, 2024

Allow client settings opt on mint adapter #378

Merged

beligante mentioned this issue Jul 10, 2024

[ 377 ] Add crash_report logger metadata for exceptions captured by Cowboy.Handler #381

Merged

sleipnir closed this as completed Jul 11, 2024

ksanderer mentioned this issue Aug 8, 2024

Clarification on Handling Raised RPCErrors as Crash Events #389

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new option to server to report crashes/exceptions #377

Add new option to server to report crashes/exceptions #377

beligante commented Jul 7, 2024 •

edited

Loading

beligante commented Jul 8, 2024

sleipnir commented Jul 8, 2024

sleipnir commented Jul 8, 2024

beligante commented Jul 9, 2024

sleipnir commented Jul 10, 2024

beligante commented Jul 11, 2024

sleipnir commented Jul 11, 2024

Add new option to server to report crashes/exceptions #377

Add new option to server to report crashes/exceptions #377

Comments

beligante commented Jul 7, 2024 • edited Loading

beligante commented Jul 8, 2024

sleipnir commented Jul 8, 2024

sleipnir commented Jul 8, 2024

beligante commented Jul 9, 2024

sleipnir commented Jul 10, 2024

beligante commented Jul 11, 2024

sleipnir commented Jul 11, 2024

beligante commented Jul 7, 2024 •

edited

Loading