Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Send gRPC errors properly #1983

Merged
merged 11 commits into from
Jun 30, 2021
Merged

Send gRPC errors properly #1983

merged 11 commits into from
Jun 30, 2021

Conversation

ewolak-sq
Copy link
Collaborator

According to the gRPC spec, all
gRPC responses, including errors, should have an HTTP status of 200. gRPC errors are signaled using
the grpc-status header. This brings Misk in compliance with the spec, translating HTTP errors into
properly formatted gRPC errors.

Fixes #1933

Most WebActionException subclasses use the same value for message and responseBody, so let's make
that something clean and easy to do.

Co-authored-by: Eric Wolak <[email protected]>
@ewolak-sq ewolak-sq force-pushed the jwilson-ewolak.0617.grpc_status branch from 00c2729 to 8f200e2 Compare June 18, 2021 00:27
According to the [gRPC spec](https://github.com/grpc/grpc/blob/master/doc/PROTOCOL-HTTP2.md), all
gRPC responses, including errors, should have an HTTP status of 200. gRPC errors are signaled using
the `grpc-status` header. This brings Misk in compliance with the spec, translating HTTP errors into
properly formatted gRPC errors.
@ewolak-sq ewolak-sq force-pushed the jwilson-ewolak.0617.grpc_status branch from 8f200e2 to c7ea3c0 Compare June 18, 2021 00:43
Comment on lines 33 to 34
/** The HTTP response under construction. */
/** Meaningful HTTP status about what actually happened */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove old javadoc comment, and add comment about how this may not be the code actually sent over the network and why this is the case. I see it's commented elsewhere, but adding the comment here too will help reduce confusion

Comment on lines +128 to +129
assertResponseCount(200, 0)
assertResponseCount(500, 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is broken; will approve when it passes.

Also, nit: add code comment about why these are the expected metrics

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you be more specific about what you mean by "broken" here? The assertion is about what's counted in metrics, which we very much do want to be what's asserted here. See the discussion in #1983 (comment)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I saw a test failure here, but it looks like it's working now.

Comment on lines +63 to +69
/**
* Borrow behavior from [GrpcFeatureBinding] to send a gRPC error with an HTTP 200 status code.
* This is weird but it's how gRPC clients work.
*
* One thing to note is for our metrics we want to pretend that the HTTP code is what we sent.
* Otherwise gRPC requests that crashed and yielded an HTTP 200 code will confuse operators.
*/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that this will reduce disruption with metrics, but I feel like having two sets of HTTP status codes is confusing in the long run, and it adds a layer of indirection with gRPC statuses. Why not have a metric for different gRPC statuses instead?

To provide a good migration path, we can probably emit both HTTP and gRPC status code metrics, then turn off HTTP status code metrics for gRPC endpoints later.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest new metrics that look like these, but no real need to make them identical I think? https://github.com/grpc-ecosystem/java-grpc-prometheus

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(fwiw, this is fine to keep as is in this PR)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gRPC is always 200 OK, even when it’s not okay, and that is awful and a guarantee that there will be confusion.

My preference is that we try to hide the insincere 200 OKs in implementation details. Metrics and alerting and even just developer expectations are pretty invested in 200/400/500 meaning different things. The 200 is a lie, and we want to quietly send it over the Wire without spreading that lie to other systems.

I think trying to get all those downstream tools to have independent HTTP and gRPC modes only spreads the confusion.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get that gRPC errors being HTTP 200 is confusing. Thinking about it more, my concerns are a) HTTP 4xx/5xx and gRPC have a non-1:1 mapping with different semantics, and b) hiding the HTTP and gRPC status makes it a bit more difficult to trace requests when debugging (e.g. the server said 502 in its logs, but the client got a 200).

I see what you mean with metrics though, and am ok with having a mapping for metrics purposes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we should emit some gRPC-specific metrics to improve observability, but I think we can do that as a follow-up PR (#1992). This PR strictly improves the situation from "Misk counts 4xx/5xx metrics but breaks gRPC clients" to "Misk counts 4xx/5xx metrics and doesn't break gRPC clients". The third option, of "Misk fails to count 4xx/5xx metrics for gRPC but clients get the right error responses" seems strictly worse than either of those to me.

Comment on lines +14 to +16
if (request.latitude == -1) {
throw WebActionException(request.longitude ?: 500, "unexpected latitude error!")
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels so hacky lol. How about a separate RPC that always fails?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure that the RouteGuide service is borrowed from the gRPC reference implementation, so I'm loathe to extend that with another method just for this sort of test. Given that it's strictly test code, I'm not sure I agree that it's worth doing in a perfect way.

@swankjesse what do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy as is.


/** The HTTP status code actually sent over the network. For gRPC, this is *always* 200, even
* for errors. */
var networkStatusCode: Int
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻

upstreamResponse.statusCode = value
}

override var networkStatusCode: Int
get() = upstreamResponse.statusCode
set(value) {
upstreamResponse.statusCode = value
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there’s a bit of a hazard here where you gotta set this one last. Ugh!

Maybe add a function and make networkStatusCode a val?

fun setStatusCodes(statusCode: Int, networkStatusCode: Int) { ... }

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good call

Comment on lines +63 to +69
/**
* Borrow behavior from [GrpcFeatureBinding] to send a gRPC error with an HTTP 200 status code.
* This is weird but it's how gRPC clients work.
*
* One thing to note is for our metrics we want to pretend that the HTTP code is what we sent.
* Otherwise gRPC requests that crashed and yielded an HTTP 200 code will confuse operators.
*/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gRPC is always 200 OK, even when it’s not okay, and that is awful and a guarantee that there will be confusion.

My preference is that we try to hide the insincere 200 OKs in implementation details. Metrics and alerting and even just developer expectations are pretty invested in 200/400/500 meaning different things. The 200 is a lie, and we want to quietly send it over the Wire without spreading that lie to other systems.

I think trying to get all those downstream tools to have independent HTTP and gRPC modes only spreads the confusion.

@ewolak-sq
Copy link
Collaborator Author

I'm fairly certain that the test failure on GrpcConnectivityTest is a flake. I have it passing and failing locally with the same sets of tests before it, so it's definitely flaking. See #1853.

@ewolak-sq
Copy link
Collaborator Author

@r3mariano @swankjesse 😠 I'm getting a consistent failure with okhttp3.internal.http2.StreamResetException: stream was reset: CANCEL in CI, but it won't fail locally. I think it's just #1853 rearing its head. What should I do to move forward?

* master:
  Move testing code (Fake) to testing jar; clean up deps; remove need for guava (#2005)
  Disable all tests in GrpcConnectivityTest (#2004)
  Make RedisService not internal (#2002)
  Allow user to supply their own moshi setup to the module (#2000)
  misk.feature to wisp.feature (#1987)
  Downgrade gRPC requiring HTTP/2 check to log.warn (#1996)
  Remove unneeded misk-datadog dependencies (#1994)
  Add OPA Policy gradle testing plugin (#1993)
@r3mariano
Copy link
Collaborator

@r3mariano @swankjesse 😠 I'm getting a consistent failure with okhttp3.internal.http2.StreamResetException: stream was reset: CANCEL in CI, but it won't fail locally. I think it's just #1853 rearing its head. What should I do to move forward?

I've disabled the offending test in #2004.

* master:
  Prepare for release 0.19.0. (#2023)
  Uppercase service metadata environment in response (#2022)
  Add a toString() to Feature and Attributes (#2021)
  Use Moshi in Guice context for FakeFeatureFlags (#2020)
  Remove extra feature binds - not required now (#2019)
  Reinstate misk.feature.* (#2018)
  Bind both wisp and misk FeatureFlags and DynamicConfig for tests (#2017)
  Bind both wisp and misk FeatureFlags and DynamicConfig (#2016)
  Add misk.feature.Feature and Attribute shims (#2015)
  Add misk.feature.FeatureFlags and DynamicConfig shims (#2014)
  Ensure service metadata environment is uppercased (#2013)
  WarmupTask, a mechanism to improve request latency (#2012)
  Upgrade Wire to 4.0.0-alpha.5 (#2010)
  Prepare next development version. (#2011)
  Prepare for release 0.18.0.
  Remove OPA Gradle plugin, moved to seperate repository (#2006)
  Add hget, hset, and hgetAll to misk-redis (#2003)
  Remove `ActionException` and `StatusCode` (#2008)
  Always mask 5xx errors in WebActionExceptionMapper.kt (#2007)
We changed behavior in #2007 to mask 5xx error messages, so we can't expect the message of the throwable to show up on the client.
@ewolak-sq
Copy link
Collaborator Author

Ugh. I think all these tests are flaky in different and interesting ways. MiskClientMiskServerTest.serverFailureNotFound() sometimes gets java.io.IOException: gRPC transport failure (HTTP status=200, grpc-status=null, grpc-message=null), which I'm guessing is the same flakiness as in GrpcConnectivityTest. Then the metrics counts seem to have a race condition, so I probably need to await for them.

These assertions seem to be flaky, so let's give them 100ms to converge across threads or whatever.
@ewolak-sq ewolak-sq merged commit 171a7ce into master Jun 30, 2021
@ewolak-sq ewolak-sq deleted the jwilson-ewolak.0617.grpc_status branch June 30, 2021 00:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle errors with gRPC
4 participants