-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Send gRPC errors properly #1983
Conversation
Most WebActionException subclasses use the same value for message and responseBody, so let's make that something clean and easy to do. Co-authored-by: Eric Wolak <[email protected]>
00c2729
to
8f200e2
Compare
According to the [gRPC spec](https://github.com/grpc/grpc/blob/master/doc/PROTOCOL-HTTP2.md), all gRPC responses, including errors, should have an HTTP status of 200. gRPC errors are signaled using the `grpc-status` header. This brings Misk in compliance with the spec, translating HTTP errors into properly formatted gRPC errors.
8f200e2
to
c7ea3c0
Compare
/** The HTTP response under construction. */ | ||
/** Meaningful HTTP status about what actually happened */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove old javadoc comment, and add comment about how this may not be the code actually sent over the network and why this is the case. I see it's commented elsewhere, but adding the comment here too will help reduce confusion
assertResponseCount(200, 0) | ||
assertResponseCount(500, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is broken; will approve when it passes.
Also, nit: add code comment about why these are the expected metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you be more specific about what you mean by "broken" here? The assertion is about what's counted in metrics, which we very much do want to be what's asserted here. See the discussion in #1983 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I saw a test failure here, but it looks like it's working now.
/** | ||
* Borrow behavior from [GrpcFeatureBinding] to send a gRPC error with an HTTP 200 status code. | ||
* This is weird but it's how gRPC clients work. | ||
* | ||
* One thing to note is for our metrics we want to pretend that the HTTP code is what we sent. | ||
* Otherwise gRPC requests that crashed and yielded an HTTP 200 code will confuse operators. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that this will reduce disruption with metrics, but I feel like having two sets of HTTP status codes is confusing in the long run, and it adds a layer of indirection with gRPC statuses. Why not have a metric for different gRPC statuses instead?
To provide a good migration path, we can probably emit both HTTP and gRPC status code metrics, then turn off HTTP status code metrics for gRPC endpoints later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest new metrics that look like these, but no real need to make them identical I think? https://github.com/grpc-ecosystem/java-grpc-prometheus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(fwiw, this is fine to keep as is in this PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gRPC is always 200 OK, even when it’s not okay, and that is awful and a guarantee that there will be confusion.
My preference is that we try to hide the insincere 200 OKs in implementation details. Metrics and alerting and even just developer expectations are pretty invested in 200/400/500 meaning different things. The 200 is a lie, and we want to quietly send it over the Wire without spreading that lie to other systems.
I think trying to get all those downstream tools to have independent HTTP and gRPC modes only spreads the confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get that gRPC errors being HTTP 200 is confusing. Thinking about it more, my concerns are a) HTTP 4xx/5xx and gRPC have a non-1:1 mapping with different semantics, and b) hiding the HTTP and gRPC status makes it a bit more difficult to trace requests when debugging (e.g. the server said 502 in its logs, but the client got a 200).
I see what you mean with metrics though, and am ok with having a mapping for metrics purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we should emit some gRPC-specific metrics to improve observability, but I think we can do that as a follow-up PR (#1992). This PR strictly improves the situation from "Misk counts 4xx/5xx metrics but breaks gRPC clients" to "Misk counts 4xx/5xx metrics and doesn't break gRPC clients". The third option, of "Misk fails to count 4xx/5xx metrics for gRPC but clients get the right error responses" seems strictly worse than either of those to me.
if (request.latitude == -1) { | ||
throw WebActionException(request.longitude ?: 500, "unexpected latitude error!") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels so hacky lol. How about a separate RPC that always fails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure that the RouteGuide
service is borrowed from the gRPC reference implementation, so I'm loathe to extend that with another method just for this sort of test. Given that it's strictly test code, I'm not sure I agree that it's worth doing in a perfect way.
@swankjesse what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy as is.
|
||
/** The HTTP status code actually sent over the network. For gRPC, this is *always* 200, even | ||
* for errors. */ | ||
var networkStatusCode: Int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏻
upstreamResponse.statusCode = value | ||
} | ||
|
||
override var networkStatusCode: Int | ||
get() = upstreamResponse.statusCode | ||
set(value) { | ||
upstreamResponse.statusCode = value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there’s a bit of a hazard here where you gotta set this one last. Ugh!
Maybe add a function and make networkStatusCode
a val?
fun setStatusCodes(statusCode: Int, networkStatusCode: Int) { ... }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good call
/** | ||
* Borrow behavior from [GrpcFeatureBinding] to send a gRPC error with an HTTP 200 status code. | ||
* This is weird but it's how gRPC clients work. | ||
* | ||
* One thing to note is for our metrics we want to pretend that the HTTP code is what we sent. | ||
* Otherwise gRPC requests that crashed and yielded an HTTP 200 code will confuse operators. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gRPC is always 200 OK, even when it’s not okay, and that is awful and a guarantee that there will be confusion.
My preference is that we try to hide the insincere 200 OKs in implementation details. Metrics and alerting and even just developer expectations are pretty invested in 200/400/500 meaning different things. The 200 is a lie, and we want to quietly send it over the Wire without spreading that lie to other systems.
I think trying to get all those downstream tools to have independent HTTP and gRPC modes only spreads the confusion.
I'm fairly certain that the test failure on |
@r3mariano @swankjesse 😠 I'm getting a consistent failure with |
* master: Move testing code (Fake) to testing jar; clean up deps; remove need for guava (#2005) Disable all tests in GrpcConnectivityTest (#2004) Make RedisService not internal (#2002) Allow user to supply their own moshi setup to the module (#2000) misk.feature to wisp.feature (#1987) Downgrade gRPC requiring HTTP/2 check to log.warn (#1996) Remove unneeded misk-datadog dependencies (#1994) Add OPA Policy gradle testing plugin (#1993)
I've disabled the offending test in #2004. |
* master: Prepare for release 0.19.0. (#2023) Uppercase service metadata environment in response (#2022) Add a toString() to Feature and Attributes (#2021) Use Moshi in Guice context for FakeFeatureFlags (#2020) Remove extra feature binds - not required now (#2019) Reinstate misk.feature.* (#2018) Bind both wisp and misk FeatureFlags and DynamicConfig for tests (#2017) Bind both wisp and misk FeatureFlags and DynamicConfig (#2016) Add misk.feature.Feature and Attribute shims (#2015) Add misk.feature.FeatureFlags and DynamicConfig shims (#2014) Ensure service metadata environment is uppercased (#2013) WarmupTask, a mechanism to improve request latency (#2012) Upgrade Wire to 4.0.0-alpha.5 (#2010) Prepare next development version. (#2011) Prepare for release 0.18.0. Remove OPA Gradle plugin, moved to seperate repository (#2006) Add hget, hset, and hgetAll to misk-redis (#2003) Remove `ActionException` and `StatusCode` (#2008) Always mask 5xx errors in WebActionExceptionMapper.kt (#2007)
We changed behavior in #2007 to mask 5xx error messages, so we can't expect the message of the throwable to show up on the client.
Ugh. I think all these tests are flaky in different and interesting ways. |
These assertions seem to be flaky, so let's give them 100ms to converge across threads or whatever.
According to the gRPC spec, all
gRPC responses, including errors, should have an HTTP status of 200. gRPC errors are signaled using
the
grpc-status
header. This brings Misk in compliance with the spec, translating HTTP errors intoproperly formatted gRPC errors.
Fixes #1933