Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Inconsistent behavior when canceling a retry #2375

Open
kmcclellan opened this issue Nov 11, 2024 · 10 comments
Open

[Bug]: Inconsistent behavior when canceling a retry #2375

kmcclellan opened this issue Nov 11, 2024 · 10 comments
Assignees
Labels
Milestone

Comments

@kmcclellan
Copy link

kmcclellan commented Nov 11, 2024

Describe the bug

If I cancel an execution that has been handled by a retry policy, it may or may not return the last executed outcome.

Expected behavior

There is some question as to what should happen when you trigger cancellation for a resilience strategy that has already invoked the callback but has more work to do. On the one hand, the caller may want to know that execution did not complete per configuration (esp. if the intent is to retry indefinitely). On the other, the invocation may have had side effects, and we do have an outcome that the caller might be interested in.

In my opinion, callbacks passed to resilience pipelines shouldn't make assumptions about how they will be executed and should include a catch to clean up any side-effects. Therefore, a retry strategy that would have attempted additional invocations should acknowledge cancellation by throwing OperationCanceledException regardless of whether it has executed the callback or what delay configuration is in place.

However, if instead we want the retry strategy to return the last executed outcome, it should do so consistently and refrain from acknowledging cancellation if invocation has completed at least once.

Actual behavior

If the cancellation token is in the requested state at the time an invocation of the callback completes, that outcome is returned to the caller. If the token is instead triggered during the subsequent delay between attempts, an OperationCanceledException outcome is returned.

Steps to reproduce

using Polly;

var pipeline = new ResiliencePipelineBuilder<int>()
    .AddRetry(
        new()
        {
            ShouldHandle = x => ValueTask.FromResult(x.Outcome.Result < 10),
            MaxRetryAttempts = 10,
            Delay = TimeSpan.FromSeconds(2),
        })
    .Build();

await TestCancel(TimeSpan.FromSeconds(1));
await TestCancel(TimeSpan.FromSeconds(3));
await TestCancel(TimeSpan.FromSeconds(5));
await TestCancel(TimeSpan.FromSeconds(7));

async Task TestCancel(TimeSpan duration)
{
    try
    {
        using var cancellation = new CancellationTokenSource(duration);
        var n = 0;

        var result = await pipeline.ExecuteAsync(
            async cancellationToken =>
            {
                Console.WriteLine($"Attempt: {n}");
                await Task.Delay(2000, CancellationToken.None);
                return n++;
            },
            cancellation.Token);

        Console.WriteLine($"Execution completed: {result}");
    }
    catch (OperationCanceledException)
    {
        Console.WriteLine("Execution was canceled.");
    }
}
Attempt: 0
Execution completed: 0
Attempt: 0
Execution was canceled.
Attempt: 0
Attempt: 1
Execution completed: 1
Attempt: 0
Attempt: 1
Execution was canceled.

Exception(s) (if any)

No response

Polly version

8.4.2

.NET Version

8.0.403

Anything else?

No response

@kmcclellan kmcclellan added the bug label Nov 11, 2024
@quixoticaxis
Copy link

quixoticaxis commented Nov 12, 2024

I hit this issue today while writing tests for a class that uses Polly.

I believe the policy should always throw OperationCancelledException when being stopped and provide some other means to get the outcome if the user needs it.

For example, this is how I start a background service that uses HttpClient:

public async override Task StartAsync(CancellationToken cancellationToken)
{
    var retryPipeline = new ResiliencePipelineBuilder()
        .AddRetry(
            new RetryStrategyOptions
            {
                ShouldHandle = new PredicateBuilder()
                    .Handle<HttpRequestException>()
                    .Handle<JsonException>()
                    .Handle<OperationCanceledException>(
                        exception => exception.CancellationToken != cancellationToken),
                /* other settings */
            }).Build();
    /* some HttpClient usage managed by the pipeline */

Looks like the code above may throw: HttpRequestException, JsonException, OperationCancelledException because of timeout, and OperationCancelledException because of the service being stopped. Feels counter-intuitive.

@peter-csala
Copy link
Contributor

I'm a bit busy this week but I can jump on this issue next week. @martincostello Can you please assign me to this ticket?

@peter-csala
Copy link
Contributor

peter-csala commented Nov 15, 2024

Your observed behavior is due to the fact that the Task.Delay does not respect the cancellation token.

If it would respect:

await Task.Delay(2000, cancellationToken);

then the outputs will be consistent

Test case Output
await TestCancel(TimeSpan.FromSeconds(1)); Attempt: 0
Execution was canceled.
await TestCancel(TimeSpan.FromSeconds(3)); Attempt: 0
Execution was canceled.
await TestCancel(TimeSpan.FromSeconds(5)); Attempt: 0
Attempt: 1
Execution was canceled.
await TestCancel(TimeSpan.FromSeconds(7)); Attempt: 0
Attempt: 1
Execution was canceled.

I've created a simple dotnet fiddle for this: https://dotnetfiddle.net/fmvUFr

@peter-csala
Copy link
Contributor

I hit this issue today while writing tests for a class that uses Polly.

I believe the policy should always throw OperationCancelledException when being stopped and provide some other means to get the outcome if the user needs it.

For example, this is how I start a background service that uses HttpClient:

...

Looks like the code above may throw: HttpRequestException, JsonException, OperationCancelledException because of timeout, and OperationCancelledException because of the service being stopped. Feels counter-intuitive.

Could you please share with us how do you use the pipeline?

@kmcclellan
Copy link
Author

Your observed behavior is due to the fact that the Task.Delay does not respect the cancellation token.

Obviously canceling the delay will cause the callback to throw. My example simulates the callback completing while the token is in the cancellation requested state. This is quite possible and needs to be handled correctly. Surfacing an outcome that would have been retried is not expected cancellation behavior.

@kmcclellan
Copy link
Author

Instead of:

if (context.CancellationToken.IsCancellationRequested || isLastAttempt || !handle)
{
    return outcome;
}

We want:

if (isLastAttempt || !handle)
{
    return outcome;
}

context.CancellationToken.ThrowIfCancellationRequested();

https://github.com/App-vNext/Polly/blob/7567acc330c9f74d6c7b14897a8ef4d15e9d475a/src/Polly.Core/Retry/RetryResilienceStrategy.cs#L70-73

@quixoticaxis
Copy link

quixoticaxis commented Nov 15, 2024

Could you please share with us how do you use the pipeline?

I'll share the code a bit later, but the issue is, IMHO, perfectly summarized by @kmcclellan .

@quixoticaxis
Copy link

quixoticaxis commented Nov 15, 2024

First of all, another example, as a test:

[Fact]
public async Task RetryShouldThrowOperationCancelledExceptionOnBeingCancelled()
{
    var pipeline = new ResiliencePipelineBuilder<int>()
        .AddRetry(new()
        {
            UseJitter = false,
            MaxRetryAttempts = 1_000,
            Delay = TimeSpan.FromMilliseconds(1),
            ShouldHandle = new PredicateBuilder<int>().Handle<InvalidOperationException>()
        })
        .Build();

    using var stopper = new CancellationTokenSource();

    int counter = 0;

    int? result = null;

    var exception = await Record.ExceptionAsync(
        async () =>
        {
            result = await pipeline.ExecuteAsync(
                async (token) =>
                {
                    await Task.Yield();
                    ++counter;
                    stopper.Cancel(); // cancellation happens

                    if (counter < 500) // nearly at the same time
                    {
                        throw new InvalidOperationException(); // as failure
                    }
                    else
                    {
                        return 1;
                    }
                },
                stopper.Token);
        });

    result.Should().BeNull(); // Passes
    counter.Should().Be(1); // Passes
    exception.Should().NotBeNull().And.BeOfType<OperationCanceledException>(); // Fails! But why does it return an outcome? I have a thousand attempts left
}

@quixoticaxis
Copy link

quixoticaxis commented Nov 15, 2024

Real code:

var retryPipeline = new ResiliencePipelineBuilder()
    .AddRetry(
        new RetryStrategyOptions
        {
            ShouldHandle = new PredicateBuilder()
                .Handle<HttpRequestException>() // HTTP failure
                .Handle<OperationCanceledException>( // HTTP timeout, not a cancellation
                    exception => exception.CancellationToken != cancellationToken),
            BackoffType = DelayBackoffType.Constant,
            UseJitter = true,
            MaxRetryAttempts = configuration.MaxRetryAttempts,
            Delay = configuration.RetryPeriod,
            OnRetry = HandleOnRetryCallback
        }).Build();
// ...       
await retryPipeline.ExecuteAsync(
    async token => await fetcher.UpdateAsync(token), // uses HTTP client and JSON deserialization inside
    cancellationToken);

This code may throw HttpRequestException, JsonException, OperationCanceledException (which is a timeout) on cancellation. Or it may throw OperationCanceledException if it's currently in a "waiting" state.

I use the following fix:

async Task RetryAndFixPollyExceptionsAsync(
    ResiliencePipeline retryPipeline,
    CancellationToken cancellationToken)
{
   try
   {
       await retryPipeline.ExecuteAsync(
           async token => await fetcher.UpdateAsync(token),
           cancellationToken);
   }
   catch (Exception caught) when (cancellationToken.IsCancellationRequested
       && caught is HttpRequestException
           or OperationCanceledException
           or JsonException)
   {
       cancellationToken.ThrowIfCancellationRequested();
   }
}

which is a dirty hack, IMHO.

@peter-csala
Copy link
Contributor

Thank you @kmcclellan and @quixoticaxis for the input. I will start working on this today. I need to double check whether other strategies are impacted as well (like Timeout and Hedging).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants