Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: add cause to canceled context when Client sending a RST_STREAM due to an error on their end #7778

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

dastrobu
Copy link

@dastrobu dastrobu commented Oct 24, 2024

Related Issue: #7541

Summary:
This PR introduces a feature to add a cause to the context cancellation, for example when a client sends an RST_STREAM frame. This enhancement aims to provide more transparency and ease debugging by distinguishing between different reasons for context cancellation.

Use Case:
When a client resets a connection, the http2_server cancels the current context to interrupt all server internal processing. However, it is challenging to determine the exact reason for the context cancellation, whether it was due to an internal error, a timeout, or a client sending an RST frame. This feature will help in identifying the cause of the context cancellation, making it easier to debug and handle errors appropriately.

Proposed Solution:

  • Replace context.CancelFunc with context.CancelCauseFunc.
  • Adjust all context cancellations to report the cause.
  • For example, when handling an RST_STREAM frame, the context will be canceled with a specific error indicating the receipt of the RST_STREAM frame.

Additional Context:
This change is backward-compatible and does not affect existing application code that checks for ctx.Err(). The new context.Cause(ctx) method will provide additional information when available.

Example:

The PR adds an example which shows how to implement a server with asynchronous processing using goroutines.
The example illustrates some client behaviors (RST frame, close connection, timeout) which are propagated to the worker goroutine through the context.

Here are the logs of the example server (with comments):

starting work on message: "world"
starting work on message: "hello"
'context canceled' with cause 'NO_ERROR', message worker for message "world" stopping. 👈🏻 NO_ERROR indicates the context was caused by HTTP status 0
server: error receiving from stream: rpc error: code = Canceled desc = context canceled 
'context canceled' with cause 'NO_ERROR', message worker for message "hello" stopping. 👈🏻 see above
starting work on message: "world"
starting work on message: "hello"
'context canceled' with cause 'context canceled', message worker for message "hello" stopping. 👈🏻 this case could be made more transparent when setting a cancel cause for a closed connection as well. 
'context canceled' with cause 'context canceled', message worker for message "world" stopping. 👈🏻 see above
server: error receiving from stream: rpc error: code = Canceled desc = context canceled
starting work on message: "world"
starting work on message: "hello"
'context deadline exceeded' with cause 'context deadline exceeded', message worker for message "world" stopping. 👈🏻 this case could be made more transparent by setting a custom cause for the context deadline in `ctx, timoutCancel = context.WithTimeout(ctx, timeout)`
server: error receiving from stream: rpc error: code = DeadlineExceeded desc = context deadline exceeded
'context deadline exceeded' with cause 'context deadline exceeded', message worker for message "hello" stopping. 👈🏻 see above

Open Issues

  • This PR is still a draft and almost no unit tests have been added to production code. This can be done when the reviewers agree that the feature of this PR will be considered to be added to the library.
  • Some uses cases (client closes connection, timeout) do not receive a proper cause in this PR, as it was explicitly requested to set the cause only for the RST frame case yet.
  • So far one error type (Http2CodeError) has been added to illustrate how server implementations could do error matching. The design for error types could be discussed in more detail. I'd see more error cases in the future that could be propagated through the context (e.g. ConnectionClosed by client, GrpcTimeout to distinguish from other timeouts).

Copy link

linux-foundation-easycla bot commented Oct 24, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

Copy link

codecov bot commented Oct 24, 2024

Codecov Report

Attention: Patch coverage is 86.36364% with 3 lines in your changes missing coverage. Please review.

Project coverage is 81.79%. Comparing base (89737ae) to head (a9a5167).
Report is 12 commits behind head on master.

Files with missing lines Patch % Lines
internal/transport/http2_server.go 70.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7778      +/-   ##
==========================================
- Coverage   81.80%   81.79%   -0.02%     
==========================================
  Files         374      375       +1     
  Lines       37978    37985       +7     
==========================================
  Hits        31068    31068              
- Misses       5605     5608       +3     
- Partials     1305     1309       +4     
Files with missing lines Coverage Δ
internal/transport/context.go 100.00% <100.00%> (ø)
internal/transport/handler_server.go 86.37% <100.00%> (-0.14%) ⬇️
internal/transport/server_stream.go 88.88% <ø> (ø)
internal/transport/http2_server.go 90.59% <70.00%> (+0.18%) ⬆️

... and 29 files with indirect coverage changes

---- 🚨 Try these New Features:

@dastrobu dastrobu force-pushed the cancel-cause branch 3 times, most recently from fedc451 to f1793eb Compare October 24, 2024 22:26
@aranjans aranjans added Type: Feature New features or improvements in behavior Area: Server Includes Server, Streams and Server Options. labels Oct 26, 2024
@purnesh42H
Copy link
Contributor

purnesh42H commented Nov 4, 2024

@dastrobu thanks for the PR. Apologies for getting late to it. These are too many errors to verify at once. Could you just narrow down this PR to one case? Perhaps only to RstStream code from server? Also, please describe in the PR (only for RST case) what is the limitation and how this change improves that.

Also, not sure if we should have a separate function to create context. We need to have this change easy to just do natively. Like adding the cancelCause only when we need to and it should be left on the author of code to decide. Let me know what you think.

If you want, in the description of the related issue we can link the error cases you have identified

@purnesh42H purnesh42H assigned dastrobu and unassigned purnesh42H Nov 4, 2024
@purnesh42H purnesh42H changed the title add cause to canceled context proposal: add cause to canceled context to distinguishing client-initiated RST_STREAM from other cancellation causes Nov 5, 2024
@purnesh42H
Copy link
Contributor

@dastrobu let's narrow down this PR to only "Client sending a RST_STREAM due to an error on their end". We want to see how much change it is and if it has enough benefit to reason about. Thanks

@purnesh42H purnesh42H changed the title proposal: add cause to canceled context to distinguishing client-initiated RST_STREAM from other cancellation causes proposal: add cause to canceled context when Client sending a RST_STREAM due to an error on their end Nov 5, 2024
@purnesh42H
Copy link
Contributor

@dastrobu ping on this if you have bandwidth for making the the change. Thanks

@dastrobu
Copy link
Author

@purnesh42H I will update the PR draft later this week.

@dastrobu dastrobu force-pushed the cancel-cause branch 3 times, most recently from e0753ce to 67fded0 Compare November 14, 2024 15:38
@dastrobu
Copy link
Author

Could you just narrow down this PR to one case? Perhaps only to RstStream code from server?

I narrowed it down and added an example.

Also, please describe in the PR (only for RST case) what is the limitation and how this change improves that.

PR description was updated.

Also, not sure if we should have a separate function to create context. We need to have this change easy to just do natively. Like adding the cancelCause only when we need to and it should be left on the author of code to decide. Let me know what you think.

I extracted the function here to dry out code and make the logic it testable through unit tests (see context_test.go).
The reason we cannot use any context in the two cases is that the cancel function is part of the ServerStream type.

Hence, the cancel function must always be of the same type.
In most go functions the context is cancelled at the end of the function that creates it (through a defer call). This library, however, needs to pass the cancel function around to call it in appropriate places.

We could inline the createContextWithTimeout function, but for the reasons explained above this would lead to code duplication and make it harder to test. I am open for alternative approaches, if you have good suggestion.

If you want, in the description of the related issue we can link the error cases you have identified

I am not sure I understand what error cases you mean here?

@dfawley
Copy link
Member

dfawley commented Nov 14, 2024

Can I request that we first enumerate all the reasons the grpc server might cancel the context?

I can think of:

  1. Unexpected connection loss or a more graceful connection closure (preceded by a GOAWAY)
  2. Client-side cancelation (which is transmitted via a RST_STREAM)
  3. Deadline exceeded

I'm not sure about "internal error" mentioned in the first comment. Can you give an example of this?

After that, what would knowing the difference between these cases affect? Is this purely for monitoring/alerting/metrics?

Closes grpc#7541

� Conflicts:
�	internal/transport/transport.go
@dastrobu dastrobu force-pushed the cancel-cause branch 2 times, most recently from d105c77 to 59a52b7 Compare November 14, 2024 15:55
@dastrobu
Copy link
Author

Can I request that we first enumerate all the reasons the grpc server might cancel the context?

I can think of:

  1. Unexpected connection loss or a more graceful connection closure (preceded by a GOAWAY)
  2. Client-side cancelation (which is transmitted via a RST_STREAM)
  3. Deadline exceeded

I'm not sure about "internal error" mentioned in the first comment. Can you give an example of this?

After that, what would knowing the difference between these cases affect? Is this purely for monitoring/alerting/metrics?

@dfawley thanks for the prompt reply. I will pick that up tomorrow and try to write up the cases in #7541

@dastrobu
Copy link
Author

I have updated the Additional Context section in issue #7541 with the previously mentioned cases. I hope this is helpful. We can continue the discussion here or in the issue.

Copy link

This PR is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

@github-actions github-actions bot added stale and removed stale labels Nov 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Server Includes Server, Streams and Server Options. Status: Requires Reporter Clarification Type: Feature New features or improvements in behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants