TokenCleanupService creates a sub optimal query plan with an expensive keylookup #1304

pregress · 2024-06-20T10:16:26Z

Which version of Duende IdentityServer are you using?
7.0.5

Which version of .NET are you using?
.NET 8

Describe the bug

https://github.com/DuendeSoftware/IdentityServer/blob/e9860c6488f90e8fbc11a4452b9dd111dbfae933/src/EntityFramework.Storage/TokenCleanup/TokenCleanupService.cs#L92

Fetches the whole PersistedGrant object for a cleanup, This causes very expensive key lookups.
See execution plan: https://www.brentozar.com/pastetheplan/?id=SyUSxtbI0
This could be fixed by using the existing index correctly IX_PersistedGrants_Expiration if only Id & Expiration are returned.
But these requires a change to the public api of IOperationalStoreNotification since this works with an IEnumerable<PersistedGrant>

The fix would be:

ar query = _persistedGrantDbContext.PersistedGrants
                .Where(x => x.Expiration < DateTime.UtcNow)
                .OrderBy(x => x.Expiration)
                .Select(x => new { x.Id, x.Expiration });
// ..
if (_operationalStoreNotification != null)
{
    await _operationalStoreNotification.PersistedGrantsRemovedAsync(expiredGrants); // <= different type maybe or seperate implementation if _operationalStoreNotification is null. 
}

To Reproduce

Have a lot of PersistedGrants
Run the token cleanup
Check query store.

Expected behavior

Cleanup should not produce an execution plan with a key lookup.

Log output/exception with stacktrace
N/A

Additional context
Adding include columns to the existing index would also solve this, but then it's just a copy of the whole table.

Query plan also contains the openjson (DuendeSoftware/IdentityServer#1564) which also screws with the estimated number of rows..., while the estimated number of rows should always be equal to the batch size. In our case it's not a big deal, but could imagine that this would go wrong.

The text was updated successfully, but these errors were encountered:

AndersAbel · 2024-06-24T11:31:44Z

Thank you for sharing the details of the performance of this query. We rewrote the token cleanup for v7 with the intention to fix race conditions and also make it more performant.

But we do have some limitations:

But these requires a change to the public api of IOperationalStoreNotification since this works with an IEnumerable

We are very careful when introducing breaking changes. We considered the fix you propose to not have to load the full entities, but decided that it didn't motivate a breaking change. The fix we considered was to check if there are any subscribers to the notifications and if not, simply not load the entities at all.

It is very valuable for us to get the actual performance numbers and the query plan details. The conclusions I make from this is that:

The query utilizes the index as expected which is a very efficient operation.
A clustered index key lookup is usually a fast operation, so while it takes 99% of the cost in this query, it might indicate that the index scan is even faster.

Do you have any actual durations for this, i.e. what is the query duration in milliseconds?

Any optimizations to this query would also have to take the next query into account. This query will bring all the relevant database pages into RAM and CPU cache. The next query deletes the data and that operation requires access to the database pages too. That next query would benefit from the database pages already being in RAM/cache memory.

Regarding the OpenJSON issue that is really not an IdentityServer issue. It is an incompatibility between EF Core and SQL Server. If it would be a major performance issue we would still have to handle it. But our estimate this far has been that the way our queries are written the OpenJSON row estimate issue should not be an issue.

If you have a query plan for the deletion query that would be valuable to see to understand how OpenJSON is handled.

pregress · 2024-06-25T07:04:36Z

I agree, the query uses the index, and this is efficient, but the key look up make the query plan suboptimal.

Here are the performance metrics for:

Delete expired with subquery

(@__First_Expiration_0 datetime2(7),@__Last_Expiration_1 datetime2(7),@__foundIds_2 nvarchar(4000))DELETE FROM [p]
FROM [PersistedGrants] AS [p]
WHERE [p].[Id] IN (
    SELECT [p0].[Id]
    FROM [PersistedGrants] AS [p0]
    WHERE [p0].[Expiration] < GETUTCDATE() AND [p0].[Expiration] >= @__First_Expiration_0 AND [p0].[Expiration] <= @__Last_Expiration_1 AND [p0].[Id] IN (
        SELECT [f].[value]
        FROM OPENJSON(@__foundIds_2) WITH ([value] bigint '$') AS [f]
    )
)

INTERVAL	CPU[%]	DATA IO[%]	LOG IO[%]	DURATION[hh:mm:ss]	EXECUTIONS COUNT
5/26 - 5/27	0.06	0.61	0.17	00:09:40.120	444
5/27 - 5/28	0.06	0.55	0.16	00:13:57.300	403
5/28 - 5/29	0.07	0.7	0.16	00:14:22.730	420
5/29 - 5/30	0.08	0.73	0.18	00:14:30.200	455
5/30 - 5/31	0.09	0.98	0.19	00:19:15.630	474
5/31 - 6/1	0.09	1.12	0.19	00:43:22.420	460
6/1 - 6/2	0.08	0.83	0.15	00:21:14.860	374
6/2 - 6/3	0.06	0.82	0.08	00:51:13.220	203
6/3 - 6/4	0.11	1.14	0.27	00:29:57.910	692

Here is the execution plan for the actual delete query: https://www.brentozar.com/pastetheplan/?id=B1SS51dL0
Also containing a table spool although it's only a 6% cost, I don't think there is anything wrong with the actual delete. Writing is expensive on slow azure storage.
https://erikdarling.com/optimizing-sql-server-query-plans-with-table-spools-in-them/

Select expired records (with expensive keylookup)

INTERVAL	CPU[%]	DATA IO[%]	DURATION[hh:mm:ss]	EXECUTIONS COUNT
5/26 - 5/27	0.01	0.26	00:02:48.630	444
5/27 - 5/28	0.01	0.22	00:02:59.690	404
5/28 - 5/29	0.01	0.14	00:01:59.950	420
5/29 - 5/30	0.02	0.33	00:03:45.790	456
5/30 - 5/31	0.01	0.19	00:02:20.290	475
5/31 - 6/1	0.01	0.23	00:04:29.770	461
6/1 - 6/2	0.01	0.23	00:02:54.680	376
6/2 - 6/3	0	0.05	00:01:36.050	205
6/3 - 6/4	0.02	0.48	00:06:05.070	692
6/4 - 6/5	0.02	0.34	00:03:40.900	400

Queryplan: https://www.brentozar.com/pastetheplan/?id=SyUSxtbI0

We are running on Azure SQL, so every unnecessary page retrieval (eg: a key lookup for data that is not used by X% of the implementations) is always bad for performance. We scaled our azure sql temporary to mitigate this issue.

AndersAbel · 2024-07-02T19:16:12Z

Thanks for sharing the numbers. I did some calculations to understand how much of the total DB time is caused by the select. For example the first line, the delete is 580 seconds, the select is 168 seconds. The total operation time is 748. The select makes up 22% of the total DB time.

If I get this right, the select only makes up 11% of the total DB time. I don't think it's worth spending more time optimizing this on the EF abstraction layer.

For high volume/performance deployments we do recommend tuning the database with the features available on the DB engine used and possibly replacing/reworking the queries to be optimal with the specific database.

If you are using SQL Server I think that you could possibly get performance gains by changing the indexes:

Change the clustered index to be the expiry time.
Create a new unique index on the key, which includes all columns.

This would make the delete operations efficient as they would be a range scan on the beginning of the clustered index. The read-by-key operations would be able be completed from the index only, without having to do any lookup on the clustered index. The drawback would be increased storage requirements and probably slower write/insert. As with any DB tuning you would have to test and measure the performance on your specific environment.

AndersAbel · 2024-07-02T19:30:39Z

I noticed now that the execution plans looks like being the estimated execution plans. Could you please share the actual execution plans as that would be needed to find out if the OpenJSON problem gives a performance issue here.

pregress · 2024-07-03T04:57:24Z

We will implement our own TokenCleanupService that utilizes the existing indexes.

Regarding the actual execution plan, we don't have the actual plan in query store.

pregress · 2024-07-05T11:13:12Z

For those needing the implementation.

public class FastTokenCleanUpService : TokenCleanupService
{
    private readonly OperationalStoreOptions _options;
    private readonly IPersistedGrantDbContext _persistedGrantDbContext;
    private readonly ILogger<TokenCleanupService> _logger;

    public FastTokenCleanUpService(OperationalStoreOptions options,
        IPersistedGrantDbContext persistedGrantDbContext,
        ILogger<TokenCleanupService> logger,
        IOperationalStoreNotification operationalStoreNotification = null) : base(options, persistedGrantDbContext, logger, operationalStoreNotification)
    {
        _options = options;
        _persistedGrantDbContext = persistedGrantDbContext;
        _logger = logger;

        if (operationalStoreNotification != null)
        {
            throw new NotSupportedException(
                "FastTokenCleanUpService should not be used in combination with operationalStoreNotification, fall back on the default identity behavior");
        }
    }

    protected override async Task RemoveExpiredPersistedGrantsAsync(CancellationToken cancellationToken = default)
    {
        var found = int.MaxValue;

        while (found >= _options.TokenCleanupBatchSize)
        {
            var query = _persistedGrantDbContext.PersistedGrants
                .Where(x => x.Expiration < DateTime.UtcNow);

            found = await query.Take(_options.TokenCleanupBatchSize)
                    .ExecuteDeleteAsync(cancellationToken);

            _logger.LogInformation("Removed {grantCount} expired grants", found);
        }
    }

    protected override async Task RemoveConsumedPersistedGrantsAsync(CancellationToken cancellationToken = default)
    {
        var found = int.MaxValue;

        var delay = TimeSpan.FromSeconds(_options.ConsumedTokenCleanupDelay);
        var consumedTimeThreshold = DateTime.UtcNow.Subtract(delay);

        while (found >= _options.TokenCleanupBatchSize)
        {
            var query = _persistedGrantDbContext.PersistedGrants
                .Where(x => x.ConsumedTime < consumedTimeThreshold);

            found = await query.Take(_options.TokenCleanupBatchSize)
                    .ExecuteDeleteAsync(cancellationToken);

            _logger.LogInformation("Removed {grantCount} consumed grants", found);
        }
    }

    protected override async Task RemoveDeviceCodesAsync(CancellationToken cancellationToken = default)
    {
        var found = int.MaxValue;

        while (found >= _options.TokenCleanupBatchSize)
        {
            var query = _persistedGrantDbContext.DeviceFlowCodes
                .Where(x => x.Expiration < DateTime.UtcNow);

            found = await query.Take(_options.TokenCleanupBatchSize)
                    .ExecuteDeleteAsync(cancellationToken);

            _logger.LogInformation("Removed {deviceCodeCount} device flow codes", found);
        }
    }
}


// program.cs OR startup.cs
// after AddOperationalStore

    var serviceDescriptor = builder.Services.Single(descriptor => descriptor.ServiceType == typeof(ITokenCleanupService));
    builder.Services.Remove(serviceDescriptor);
    builder.Services.AddTransient<ITokenCleanupService, FastTokenCleanUpService>();

AndersAbel · 2024-07-31T20:59:26Z

It looks like you have found a solution that you are happy with and that is enough to close the issue. If anyone else runs into similar issues, please open a new issue. With more profiling data we would be happy to investigate how our implementation can be improved.

pregress added the IdentityServer label Jun 20, 2024

RolandGuijt self-assigned this Jun 24, 2024

RolandGuijt added feature request investigating labels Jun 24, 2024

RolandGuijt assigned AndersAbel and unassigned RolandGuijt Jun 24, 2024

RolandGuijt removed the investigating label Jun 24, 2024

pregress changed the title ~~TokenCleanupService doesn't fully utilize the existing database indexes.~~ TokenCleanupService creates a sub optimal query plan with an expensive keylookup Jun 25, 2024

saithis mentioned this issue Jul 2, 2024

Session ID is lost during Token Exchange #1312

Closed

pregress closed this as completed Jul 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TokenCleanupService creates a sub optimal query plan with an expensive keylookup #1304

TokenCleanupService creates a sub optimal query plan with an expensive keylookup #1304

pregress commented Jun 20, 2024 •

edited

Loading

AndersAbel commented Jun 24, 2024

pregress commented Jun 25, 2024

AndersAbel commented Jul 2, 2024

AndersAbel commented Jul 2, 2024

pregress commented Jul 3, 2024

pregress commented Jul 5, 2024 •

edited

Loading

AndersAbel commented Jul 31, 2024

TokenCleanupService creates a sub optimal query plan with an expensive keylookup #1304

TokenCleanupService creates a sub optimal query plan with an expensive keylookup #1304

Comments

pregress commented Jun 20, 2024 • edited Loading

AndersAbel commented Jun 24, 2024

pregress commented Jun 25, 2024

AndersAbel commented Jul 2, 2024

AndersAbel commented Jul 2, 2024

pregress commented Jul 3, 2024

pregress commented Jul 5, 2024 • edited Loading

AndersAbel commented Jul 31, 2024

pregress commented Jun 20, 2024 •

edited

Loading

pregress commented Jul 5, 2024 •

edited

Loading