Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: return 500 when BackendTLSPolicy translation fails #4363

Merged
merged 3 commits into from
Dec 8, 2024

Conversation

alexwo
Copy link
Contributor

@alexwo alexwo commented Sep 29, 2024

What type of PR is this?

What this PR does / why we need it:

  • 500 for HTTP
  • connection reset for TCP/UDP

Which issue(s) this PR fixes:
Fixes #4087

@alexwo alexwo requested a review from a team as a code owner September 29, 2024 12:20
@alexwo alexwo changed the title return 500 when backtls policy config fails Return 500 when BackendTLSPolicy translation fails #4087 Sep 29, 2024
@zirain zirain changed the title Return 500 when BackendTLSPolicy translation fails #4087 bugfix: return 500 when BackendTLSPolicy translation fails Sep 29, 2024
@alexwo alexwo changed the title bugfix: return 500 when BackendTLSPolicy translation fails refactor: return 500 when BackendTLSPolicy translation fails Sep 29, 2024
Copy link

codecov bot commented Sep 29, 2024

Codecov Report

Attention: Patch coverage is 81.62162% with 34 lines in your changes missing coverage. Please review.

Project coverage is 66.28%. Comparing base (b9f9a9f) to head (c2c8b4b).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
internal/gatewayapi/route.go 75.00% 16 Missing and 5 partials ⚠️
internal/gatewayapi/filters.go 63.63% 7 Missing and 1 partial ⚠️
internal/gatewayapi/ext_service.go 40.00% 2 Missing and 1 partial ⚠️
internal/gatewayapi/validate.go 95.55% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4363      +/-   ##
==========================================
- Coverage   66.29%   66.28%   -0.01%     
==========================================
  Files         209      209              
  Lines       31912    31956      +44     
==========================================
+ Hits        21157    21183      +26     
- Misses       9507     9523      +16     
- Partials     1248     1250       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@alexwo
Copy link
Contributor Author

alexwo commented Sep 29, 2024

/retest

Copy link
Member

@zhaohuabing zhaohuabing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexwo Thanks for picking up this! This PR looks great overall! I just have a few nits.

internal/gatewayapi/route.go Show resolved Hide resolved
@@ -310,7 +286,6 @@ xdsIR:
- ECDSA-SECP256R1-SHA256
sni: example.com
useSystemTrustStore: true
weight: 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this is changed?

Copy link
Contributor Author

@alexwo alexwo Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, a route will not get translated, if it is a TCP/UDP route & contains invalid configurations

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexwo - let's maybe update the PR description and title so that the release notes will reflect the two behaviors:

  • 500 for HTTP
  • connection reset for TCP/UDP

In terms of UX, users experience a reset in this case right? Just want to make sure that the connections on the client side don't hang and wait for some timeout, so that it's clear to users that the request is explicitly rejected.

Copy link
Contributor Author

@alexwo alexwo Oct 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the route associated with the TLS connection is removed, I would expect ongoing connections that are relying on this route will be dropped. This will most likely result in a TCP reset (RST) or a termination of the TLS session, effectively severing the connection. Maybe we can try to confirm it.

guydc
guydc previously approved these changes Sep 30, 2024
Copy link
Contributor

@guydc guydc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Few questions on behavior and scope.

@@ -26,28 +26,30 @@ import (

func (t *Translator) validateBackendRef(backendRefContext BackendRefContext, parentRef *RouteParentContext, route RouteContext,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the changes made here extending the scope of the PR to reject additional cases like invalid backend config, or is this already handled with a 500 response?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, these changes capture cases where the backend configuration is invalid.

@@ -121,8 +121,6 @@ xdsIR:
routes:
- destination:
name: tlsroute/default/tlsroute-1/rule/-1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the destination exist in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, updated.

message: Route is accepted
reason: Accepted
status: "True"
message: 'backend reference validation failed: backend is not supported for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@alexwo
Copy link
Contributor Author

alexwo commented Oct 1, 2024

/retes

@alexwo
Copy link
Contributor Author

alexwo commented Oct 1, 2024

/retest

1 similar comment
@alexwo
Copy link
Contributor Author

alexwo commented Oct 1, 2024

/retest

@alexwo
Copy link
Contributor Author

alexwo commented Oct 2, 2024

/retest

@arkodg arkodg requested review from zhaohuabing and guydc October 8, 2024 05:20
@alexwo
Copy link
Contributor Author

alexwo commented Oct 8, 2024

/retest

@guydc guydc added the release-note Indicates a required release note label Oct 8, 2024
guydc
guydc previously approved these changes Oct 8, 2024
@@ -121,8 +121,6 @@ xdsIR:
routes:
- destination:
name: tlsroute/default/tlsroute-1/rule/-1
settings:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious what happened here, and why this got deleted ?

Copy link
Contributor Author

@alexwo alexwo Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this scenario, the route generally shouldn’t be operational. I’ll look into why the settings are being removed to better understand the problem. For now, the route is left as-is because fully invalidating it causes the conformance test to fail. Perhaps we can ignore the TLS route translation errors in the affected flow and create a separate task to address it.

message: Route is accepted
reason: Accepted
status: "True"
message: 'Backend service validation failed: Service default/service-unknown
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this 2 similar conditions here

Backend service validation failed: Service default/service-unknown

&&

 Service default/service-unknown not found

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to provide a static message explaining why the status is not accepted, such as “invalid configuration detected”? This approach would avoid repeating the message in the accepted status reason. The types, such as “Accepted” and “ResolvedRefs,” are already distinct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its fine for this PR

status: "True"
message: 'Error validating backend port: port number not specified for backend
reference.'
reason: UnsupportedValue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

internal/gatewayapi/route.go Outdated Show resolved Hide resolved
@@ -112,7 +112,7 @@ func (t *Translator) processHTTPRouteParentRefs(httpRoute *HTTPRouteContext, res
// any conditions that come out of it have to go on each RouteParentStatus,
// not on the Route as a whole.
routeRoutes, err := t.processHTTPRouteRules(httpRoute, parentRef, resources)
if err != nil {
if err != nil && !parentRef.HasCondition(httpRoute, gwapiv1.RouteConditionAccepted, metav1.ConditionFalse) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious why we need to change the status generation logic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify why the traffic cannot be routed, previously, the route was accepted, and the traffic was considered routable before this change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trying to better understand why do we need an additional check here, isnt if err != nil enough ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this stage, it is already possible that a route is not applicable due to other reasons. This check attempts to retain the original reason, as it may be more specific. However, if that is the case, no traffic would be routed regardless to the settings validations errors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you share an example of what the status looks like with and w/o this change ?

Copy link
Contributor Author

@alexwo alexwo Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this does not seem to happen, I did not notice any different translations when this extra condition is not in place. I have removed it.

@@ -470,7 +479,7 @@ func (t *Translator) processGRPCRouteParentRefs(grpcRoute *GRPCRouteContext, res
// any conditions that come out of it have to go on each RouteParentStatus,
// not on the Route as a whole.
routeRoutes, err := t.processGRPCRouteRules(grpcRoute, parentRef, resources)
if err != nil {
if err != nil && !parentRef.HasCondition(grpcRoute, gwapiv1.RouteConditionAccepted, metav1.ConditionFalse) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ds, err := t.processDestination(backendRef, parentRef, udpRoute, resources)
// skip adding the route and provide the reason via route status.

if err != nil && !parentRef.HasCondition(udpRoute, gwapiv1.RouteConditionAccepted, metav1.ConditionFalse) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isnt added for HTTPRoute and GRPCRoute but is added for UDP and TCP ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For HTTP and TCP routes, we currently accept the route but return a 500 response. However, for UDP and TCP routes, the suggestion is to reject and not translate it at all incase of errors, and indicate this via the route status as traffic won't be routable at all.

gwapiv1.RouteConditionAccepted,
metav1.ConditionFalse,
"Failed to process the settings associated with the TCP route.",
err.Error(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

@arkodg
Copy link
Contributor

arkodg commented Oct 17, 2024

@alexwo I dont see the conformance tests passing. I'm worried about the size of this change that not only applies to BTP but every error while translating route rules and backendRefs. For example if one out of the many backendRefs is invalid, only a portion of the requests should receive a 503
https://gateway-api.sigs.k8s.io/reference/spec/#gateway.networking.k8s.io%2fv1.HTTPBackendRef

@alexwo
Copy link
Contributor Author

alexwo commented Oct 18, 2024

@alexwo I dont see the conformance tests passing. I'm worried about the size of this change that not only applies to BTP but every error while translating route rules and backendRefs. For example if one out of the many backendRefs is invalid, only a portion of the requests should receive a 503 https://gateway-api.sigs.k8s.io/reference/spec/#gateway.networking.k8s.io%2fv1.HTTPBackendRef

@arkodg indeed, the impact of the change is large and it will render out many of the partial working, not fully configured routes, I can fully relate to it.

We should be covered as long as many of the necessary tests should be in place,what can detect potential issues.

The conformance tests are passing, but there was a linting issue previously that prevented them from running.

Now, if a part of the backend reference configuration is invalid — for example, if one of the associated configurations cannot be processed — the key question is: should routing to this backend reference still occur in such cases?

@alexwo
Copy link
Contributor Author

alexwo commented Oct 18, 2024

/retest

@arkodg
Copy link
Contributor

arkodg commented Oct 18, 2024

Now, if a part of the backend reference configuration is invalid — for example, if one of the associated configurations cannot be processed — the key question is: should routing to this backend reference still occur in such cases?

According to the spec, the answer is a portion of the requests (based on weights) should continue to be sent to the valid backendRef

@alexwo
Copy link
Contributor Author

alexwo commented Oct 19, 2024

Now, if a part of the backend reference configuration is invalid — for example, if one of the associated configurations cannot be processed — the key question is: should routing to this backend reference still occur in such cases?

According to the spec, the answer is a portion of the requests (based on weights) should continue to be sent to the valid backendRef

We’re processing all backend routes and addressing only falty configured ones. This should ensures alignment with the spec, as valid backend routes will continue to function, while only those with invalid configurations will be skipped / not translated.
( we process all routes for each backend and go over all routes, even if some one the backend routes did have errors . )

@alexwo
Copy link
Contributor Author

alexwo commented Oct 20, 2024

/retest

Copy link

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. Please feel free to give a status update now, ping for review, when it's ready. Thank you for your contributions!

@github-actions github-actions bot added the stale label Nov 19, 2024
arkodg
arkodg previously approved these changes Dec 4, 2024
Copy link
Contributor

@arkodg arkodg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks !

guydc
guydc previously approved these changes Dec 4, 2024
Copy link
Contributor

@guydc guydc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
@alexwo - can you update the PR?

@github-actions github-actions bot removed the stale label Dec 5, 2024
Signed-off-by: Alexander Volchok <[email protected]>
@alexwo alexwo dismissed stale reviews from guydc and arkodg via f71e441 December 7, 2024 20:28
Signed-off-by: Alexander Volchok <[email protected]>
@alexwo
Copy link
Contributor Author

alexwo commented Dec 7, 2024

/retest

Signed-off-by: Alexander Volchok <[email protected]>
@alexwo
Copy link
Contributor Author

alexwo commented Dec 7, 2024

/retest

2 similar comments
@alexwo
Copy link
Contributor Author

alexwo commented Dec 7, 2024

/retest

@alexwo
Copy link
Contributor Author

alexwo commented Dec 8, 2024

/retest

@zirain zirain requested review from guydc and arkodg December 8, 2024 11:47
@arkodg arkodg merged commit 05ee5f4 into envoyproxy:main Dec 8, 2024
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note Indicates a required release note
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Return 500 when BackendTLSPolicy translation fails
5 participants