-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: return 500 when BackendTLSPolicy translation fails #4363
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4363 +/- ##
==========================================
- Coverage 66.29% 66.28% -0.01%
==========================================
Files 209 209
Lines 31912 31956 +44
==========================================
+ Hits 21157 21183 +26
- Misses 9507 9523 +16
- Partials 1248 1250 +2 ☔ View full report in Codecov by Sentry. |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexwo Thanks for picking up this! This PR looks great overall! I just have a few nits.
@@ -310,7 +286,6 @@ xdsIR: | |||
- ECDSA-SECP256R1-SHA256 | |||
sni: example.com | |||
useSystemTrustStore: true | |||
weight: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this is changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated, a route will not get translated, if it is a TCP/UDP route & contains invalid configurations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexwo - let's maybe update the PR description and title so that the release notes will reflect the two behaviors:
- 500 for HTTP
- connection reset for TCP/UDP
In terms of UX, users experience a reset in this case right? Just want to make sure that the connections on the client side don't hang and wait for some timeout, so that it's clear to users that the request is explicitly rejected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the route associated with the TLS connection is removed, I would expect ongoing connections that are relying on this route will be dropped. This will most likely result in a TCP reset (RST) or a termination of the TLS session, effectively severing the connection. Maybe we can try to confirm it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Few questions on behavior and scope.
@@ -26,28 +26,30 @@ import ( | |||
|
|||
func (t *Translator) validateBackendRef(backendRefContext BackendRefContext, parentRef *RouteParentContext, route RouteContext, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the changes made here extending the scope of the PR to reject additional cases like invalid backend config, or is this already handled with a 500 response?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, these changes capture cases where the backend configuration is invalid.
@@ -121,8 +121,6 @@ xdsIR: | |||
routes: | |||
- destination: | |||
name: tlsroute/default/tlsroute-1/rule/-1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should the destination exist in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope, updated.
message: Route is accepted | ||
reason: Accepted | ||
status: "True" | ||
message: 'backend reference validation failed: backend is not supported for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
/retes |
/retest |
1 similar comment
/retest |
/retest |
/retest |
@@ -121,8 +121,6 @@ xdsIR: | |||
routes: | |||
- destination: | |||
name: tlsroute/default/tlsroute-1/rule/-1 | |||
settings: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious what happened here, and why this got deleted ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this scenario, the route generally shouldn’t be operational. I’ll look into why the settings are being removed to better understand the problem. For now, the route is left as-is because fully invalidating it causes the conformance test to fail. Perhaps we can ignore the TLS route translation errors in the affected flow and create a separate task to address it.
message: Route is accepted | ||
reason: Accepted | ||
status: "True" | ||
message: 'Backend service validation failed: Service default/service-unknown |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this 2 similar conditions here
Backend service validation failed: Service default/service-unknown
&&
Service default/service-unknown not found
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to provide a static message explaining why the status is not accepted, such as “invalid configuration detected”? This approach would avoid repeating the message in the accepted status reason. The types, such as “Accepted” and “ResolvedRefs,” are already distinct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its fine for this PR
status: "True" | ||
message: 'Error validating backend port: port number not specified for backend | ||
reference.' | ||
reason: UnsupportedValue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar comment as https://github.com/envoyproxy/gateway/pull/4363/files#r1792663080
internal/gatewayapi/route.go
Outdated
@@ -112,7 +112,7 @@ func (t *Translator) processHTTPRouteParentRefs(httpRoute *HTTPRouteContext, res | |||
// any conditions that come out of it have to go on each RouteParentStatus, | |||
// not on the Route as a whole. | |||
routeRoutes, err := t.processHTTPRouteRules(httpRoute, parentRef, resources) | |||
if err != nil { | |||
if err != nil && !parentRef.HasCondition(httpRoute, gwapiv1.RouteConditionAccepted, metav1.ConditionFalse) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious why we need to change the status generation logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify why the traffic cannot be routed, previously, the route was accepted, and the traffic was considered routable before this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trying to better understand why do we need an additional check here, isnt if err != nil
enough ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this stage, it is already possible that a route is not applicable due to other reasons. This check attempts to retain the original reason, as it may be more specific. However, if that is the case, no traffic would be routed regardless to the settings validations errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you share an example of what the status looks like with and w/o this change ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this does not seem to happen, I did not notice any different translations when this extra condition is not in place. I have removed it.
internal/gatewayapi/route.go
Outdated
@@ -470,7 +479,7 @@ func (t *Translator) processGRPCRouteParentRefs(grpcRoute *GRPCRouteContext, res | |||
// any conditions that come out of it have to go on each RouteParentStatus, | |||
// not on the Route as a whole. | |||
routeRoutes, err := t.processGRPCRouteRules(grpcRoute, parentRef, resources) | |||
if err != nil { | |||
if err != nil && !parentRef.HasCondition(grpcRoute, gwapiv1.RouteConditionAccepted, metav1.ConditionFalse) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
internal/gatewayapi/route.go
Outdated
ds, err := t.processDestination(backendRef, parentRef, udpRoute, resources) | ||
// skip adding the route and provide the reason via route status. | ||
|
||
if err != nil && !parentRef.HasCondition(udpRoute, gwapiv1.RouteConditionAccepted, metav1.ConditionFalse) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this isnt added for HTTPRoute and GRPCRoute but is added for UDP and TCP ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For HTTP and TCP routes, we currently accept the route but return a 500 response. However, for UDP and TCP routes, the suggestion is to reject and not translate it at all incase of errors, and indicate this via the route status as traffic won't be routable at all.
gwapiv1.RouteConditionAccepted, | ||
metav1.ConditionFalse, | ||
"Failed to process the settings associated with the TCP route.", | ||
err.Error(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
@alexwo I dont see the conformance tests passing. I'm worried about the size of this change that not only applies to BTP but every error while translating route rules and backendRefs. For example if one out of the many backendRefs is invalid, only a portion of the requests should receive a 503 |
@arkodg indeed, the impact of the change is large and it will render out many of the partial working, not fully configured routes, I can fully relate to it. We should be covered as long as many of the necessary tests should be in place,what can detect potential issues. The conformance tests are passing, but there was a linting issue previously that prevented them from running. Now, if a part of the backend reference configuration is invalid — for example, if one of the associated configurations cannot be processed — the key question is: should routing to this backend reference still occur in such cases? |
/retest |
According to the spec, the answer is a portion of the requests (based on weights) should continue to be sent to the valid backendRef |
We’re processing all backend routes and addressing only falty configured ones. This should ensures alignment with the spec, as valid backend routes will continue to function, while only those with invalid configurations will be skipped / not translated. |
/retest |
This pull request has been automatically marked as stale because it has not had activity in the last 30 days. Please feel free to give a status update now, ping for review, when it's ready. Thank you for your contributions! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@alexwo - can you update the PR?
Signed-off-by: Alexander Volchok <[email protected]>
Signed-off-by: Alexander Volchok <[email protected]>
/retest |
Signed-off-by: Alexander Volchok <[email protected]>
/retest |
2 similar comments
/retest |
/retest |
What type of PR is this?
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #4087