Fix race in EvictableCache#removeDangling #23401

oskar-szwajkowski · 2024-09-13T13:25:54Z

Description

Previously, dataCache.asMap().containsKey() reported false when cache entry was loaded for key, removing the token. Then after value was loaded, token didn't exist so next thread checking the cache had to regenerate the token itself, making cache less usable for long loading operations

Additional context and related issues

Closes #23384

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(X) Release notes are required, with the following suggested text:
Fixed race in EvictableCache that manifested itself with eager cache entry invalidation.
It could happen whenever Thread B was waiting for locked Key 1 by Thread A, to compute cache value for Key 1. It sometimes led to invalidation of Key 1, preventing following threads accessing this key reusing cached value.

wendigo · 2024-09-13T13:30:04Z

lib/trino-cache/src/test/java/io/trino/cache/TestEvictableCache.java

@@ -585,4 +588,26 @@ public void testPutOnNonEmptyCacheImplementation()
                .isInstanceOf(UnsupportedOperationException.class)
                .hasMessage("The operation is not supported, as in inherently races with cache invalidation");
    }
+
+    @RepeatedTest(1000)


Just concurrency issues things, I think it worth keeping like that as is was easier to reproduce locally with it, and test itself takes 1-2ms to execute

jkylling · 2024-09-13T14:37:35Z

lib/trino-cache/src/test/java/io/trino/cache/TestEvictableCache.java

+        try (ExecutorService executor = Executors.newFixedThreadPool(2)) {
+            Runnable longRunningCacheLoad = () -> {
+                try {
+                    cache.get("key", () -> "value");


Is it possible to make this always fail (or more frequently?) by using a synchronization primitive within the value loader to ensure that the loaders both run at the same time?

good call, will try that in a second

i cannot sync anywhere in cache.get(), as that locks key, and actual race happen when cache result is returned, in finally, looking if I can make some test-only abstraction for that

oskar-szwajkowski · 2024-09-13T16:44:53Z

So I think I get what is happening, but I can't find a way to reliably test it with single test run

Here is my logging output after test failure (on original code before changes from this PR were applied) when I run with executor with 10 threads, with waiting 4s inside cache loader:

2024-09-13T10:41:46.446-0600	INFO	pool-2-thread-1	stdout	Inside executor#submit
2024-09-13T10:41:46.450-0600	INFO	pool-2-thread-5	stdout	Inside executor#submit
2024-09-13T10:41:46.450-0600	INFO	pool-2-thread-7	stdout	Inside executor#submit
2024-09-13T10:41:46.450-0600	INFO	pool-2-thread-2	stdout	Inside executor#submit
2024-09-13T10:41:46.450-0600	INFO	pool-2-thread-8	stdout	Inside executor#submit
2024-09-13T10:41:46.450-0600	INFO	pool-2-thread-9	stdout	Inside executor#submit
2024-09-13T10:41:46.450-0600	INFO	pool-2-thread-6	stdout	Inside executor#submit
2024-09-13T10:41:46.450-0600	INFO	pool-2-thread-3	stdout	Inside executor#submit
2024-09-13T10:41:46.451-0600	INFO	pool-2-thread-4	stdout	Inside executor#submit
2024-09-13T10:41:46.453-0600	INFO	pool-2-thread-4	stdout	Inside cache#get, Sleeping for 4000ms
2024-09-13T10:41:50.456-0600	INFO	pool-2-thread-4	stdout	Sleep finish
2024-09-13T10:41:50.460-0600	INFO	pool-2-thread-5	stdout	Inside EvictableCache#removeDangling if statement, Removing key: key
2024-09-13T10:41:50.460-0600	INFO	pool-2-thread-5	stdout	Cache loaded value: [value, thread: pool-2-thread-4]
2024-09-13T10:41:50.460-0600	INFO	pool-2-thread-6	stdout	Cache loaded value: [value, thread: pool-2-thread-4]
2024-09-13T10:41:50.461-0600	INFO	pool-2-thread-4	stdout	Cache loaded value: [value, thread: pool-2-thread-4]
2024-09-13T10:41:50.461-0600	INFO	pool-2-thread-1	stdout	Cache loaded value: [value, thread: pool-2-thread-4]
2024-09-13T10:41:50.461-0600	INFO	pool-2-thread-3	stdout	Cache loaded value: [value, thread: pool-2-thread-4]
2024-09-13T10:41:50.461-0600	INFO	pool-2-thread-2	stdout	Cache loaded value: [value, thread: pool-2-thread-4]
2024-09-13T10:41:50.462-0600	INFO	pool-2-thread-8	stdout	Cache loaded value: [value, thread: pool-2-thread-4]
2024-09-13T10:41:50.462-0600	INFO	pool-2-thread-9	stdout	Cache loaded value: [value, thread: pool-2-thread-4]
2024-09-13T10:41:50.462-0600	INFO	pool-2-thread-7	stdout	Cache loaded value: [value, thread: pool-2-thread-4]
2024-09-13T10:41:50.463-0600	INFO	ForkJoinPool-1-worker-1	stdout	null

only pool-2-thread-4 is actually getting to compute value, which means cache is locking on "key" correctly, but then after it computes, other threads are calling cache.get(), and get their value returned immediately (from cache), and get to finally block before pool-2-thread-4 does (pool-2-thread-5 logs removeDangling first)

I think what happens is that if thread which have not written cache value, get to call dataCache.asMap().containsKey(), it doesn't see value written in other thread (value wasn't safely published)

Synchronizing on token make safe publishing possible and value is visible to all threads upon entering synchronized block

wendigo · 2024-09-13T16:47:44Z

@oskar-szwajkowski can you test that with a lightweight locking scheme? This will be a default in JDK 23 and I don't want to revisit this PR once the JDK is updated.

wendigo · 2024-09-13T16:48:07Z

https://bugs.openjdk.org/browse/JDK-8305999 (XX:Locking=2)

oskar-szwajkowski · 2024-09-13T17:36:01Z

@oskar-szwajkowski can you test that with a lightweight locking scheme? This will be a default in JDK 23 and I don't want to revisit this PR once the JDK is updated.

Oh crap

But without this flag it also failed for me once

So I guess there is still some race but its very hard to reproduce, its better than it was (here is result with old code)

oskar-szwajkowski · 2024-09-13T19:49:10Z

@oskar-szwajkowski can you test that with a lightweight locking scheme? This will be a default in JDK 23 and I don't want to revisit this PR once the JDK is updated.

Okay I have synchronized few places which looked to me like could contribute to race conditions (and are pretty short code paths that read/write only local maps so shouldn't contribute that much to performance loss)

After running over milion of those tests (both with and without -XX:LockingMode=2) there was no single failure

I can't come up with single test that reproduces this issue tho, but this code should be much more thread-safe than previous one so I'd say merging it won't hurt even if there is no perfect test case yet

wendigo · 2024-09-13T19:52:58Z

@oskar-szwajkowski JMH?

findepi · 2024-09-14T10:14:35Z

are build failures related?

oskar-szwajkowski · 2024-09-14T10:18:57Z

are build failures related?

Unlikely, it was all green before I added synchronize around ongoing loadings

This is likely flake, but I cannot retry it, let me know if I should rebase and push

lib/trino-cache/src/main/java/io/trino/cache/EvictableCache.java

lib/trino-cache/src/test/java/io/trino/cache/TestEvictableCache.java

lib/trino-cache/src/main/java/io/trino/cache/EvictableCache.java

findepi · 2024-09-14T10:25:41Z

This is likely flake,

If we just ignore them, they will come for us tomorrow.
Pleas triage, create necessary issues -- and see whether you can fix.

but I cannot retry it, let me know if I should rebase and push

please add an empty commit.
this strategy is a bit expensive (re-runs whole build), but (0) it self-service, doesn't require a maintainer, (1) does't include rebase, so doesn't mess with review process and (2) makes it very clear there are flakes, since you can easily compare build results between code-equivalent commits. 0 + 1 + 2 => pure awesomeness.

oskar-szwajkowski · 2024-09-14T11:19:52Z

Raptor test flakiness has been already reported: #23399
When it comes to single test in test-hive, it seems like IO issue when reading a file, will see if it happened again before

Update, trino-hive test is also already reported: #22861

wendigo · 2024-09-16T08:58:12Z

@oskar-szwajkowski can you write a JMH benchmark that will check the performance of the additional synchronization?

oskar-szwajkowski · 2024-09-16T09:00:58Z

@oskar-szwajkowski can you write a JMH benchmark that will check the performance of the additional synchronization?

yes, will do and post results here and in the commit

oskar-szwajkowski · 2024-09-16T11:20:12Z

@wendigo added JMH benchmark I used, and also uploaded result with new code in test resources, for easier comparison

Here I put results with older code:
EvictableCacheBenchmark-result-2024-09-16T12:16:22.517751.json

As you can see, only if cache loading takes 0 time, there is penalty of few microseconds of each cache get

but as soon as cache loading time is measurable, then cache miss contributes greatly to execution time, with new synchronization cache retrieval is constant with cache load time

wendigo · 2024-09-16T11:45:03Z

Can you visualize before/after using https://jmh.morethan.io/ ?

oskar-szwajkowski · 2024-09-16T11:50:10Z

Before:

After:

Comparison:

As you can see, slight decline (or no change) is when cache loading takes 0 seconds, but if cache loading takes any time, synchronized version is better

lib/trino-cache/src/test/resources/benchmarks/EvictableCacheBenchmark-result.json

lib/trino-cache/src/test/java/io/trino/cache/TestEvictableCache.java

lib/trino-cache/src/test/java/io/trino/cache/EvictableCacheBenchmark.java

lib/trino-cache/src/main/java/io/trino/cache/EvictableCache.java

wendigo · 2024-09-16T14:48:57Z

@oskar-szwajkowski can you also post benchmarks for comparison of JDK 22 vs 23?

wendigo · 2024-09-16T21:01:48Z

I've renamed benchmark to match other benchmarks (BenchmarkX instead of XBenchmark)

Previously, dataCache.asMap().containsKey() reported false when cache entry was loaded for key, removing the token. Then after value was loaded, token didn't exist so next thread checking the cache had to regenerate the token itself, making cache less usable for long loading operations Delete ongoingLoads map in favor of variable inside Token instance

Add test JMH dependencies and benchmark results

wendigo · 2024-09-17T06:25:23Z

I hope that fixes it @jkylling

oskar-szwajkowski · 2024-09-17T13:32:33Z

Here is comparison between 22.0.1 temurin and 23.ea.29-open

Most scenarios are unchanged, when it comes to no-load-time scenarios jdk 23 is better, it might be because of new locking mechanism that play role in such close timing scenarios (its faster but its still faster by 50 microseconds per operation), some scenarios looks like declined by up to 17%, but this could be just data noise as most of them is unchanged

mosabua · 2024-09-17T16:52:34Z

i feel like we need an RN entry for this.. also @oskar-szwajkowski are you on trino slack?

oskar-szwajkowski · 2024-09-17T20:22:36Z

i feel like we need an RN entry for this.. also @oskar-szwajkowski are you on trino slack?

just joined, I'll update description to include RN, unless you want me to put them somewhere else

wendigo · 2024-09-17T20:23:46Z

@oskar-szwajkowski already done

oskar-szwajkowski · 2024-09-17T20:29:41Z

@oskar-szwajkowski already done

I have updated description, if you did as well I might have overriden your changes (not sure if there is versioning on github's side that would not accept my update)

anyway, thanks

cla-bot bot added the cla-signed label Sep 13, 2024

oskar-szwajkowski mentioned this pull request Sep 13, 2024

Concurrent access issue in EvictableCache #23384

Closed

wendigo reviewed Sep 13, 2024

View reviewed changes

oskar-szwajkowski force-pushed the osz/fix-remove-dangling-race-in-evictable-cache branch from f1e104e to 59f566b Compare September 13, 2024 14:32

jkylling reviewed Sep 13, 2024

View reviewed changes

martint requested a review from findepi September 13, 2024 16:04

oskar-szwajkowski force-pushed the osz/fix-remove-dangling-race-in-evictable-cache branch from 59f566b to 16c2399 Compare September 13, 2024 19:47

oskar-szwajkowski requested a review from wendigo September 13, 2024 19:49

findepi reviewed Sep 14, 2024

View reviewed changes

oskar-szwajkowski force-pushed the osz/fix-remove-dangling-race-in-evictable-cache branch 5 times, most recently from a7240c3 to 2c77163 Compare September 15, 2024 12:26

oskar-szwajkowski requested a review from findepi September 16, 2024 08:08

oskar-szwajkowski force-pushed the osz/fix-remove-dangling-race-in-evictable-cache branch from 2c77163 to 321e44a Compare September 16, 2024 11:16