-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race in EvictableCache#removeDangling #23401
Fix race in EvictableCache#removeDangling #23401
Conversation
@@ -585,4 +588,26 @@ public void testPutOnNonEmptyCacheImplementation() | |||
.isInstanceOf(UnsupportedOperationException.class) | |||
.hasMessage("The operation is not supported, as in inherently races with cache invalidation"); | |||
} | |||
|
|||
@RepeatedTest(1000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
omg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just concurrency issues things, I think it worth keeping like that as is was easier to reproduce locally with it, and test itself takes 1-2ms to execute
f1e104e
to
59f566b
Compare
try (ExecutorService executor = Executors.newFixedThreadPool(2)) { | ||
Runnable longRunningCacheLoad = () -> { | ||
try { | ||
cache.get("key", () -> "value"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to make this always fail (or more frequently?) by using a synchronization primitive within the value loader to ensure that the loaders both run at the same time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call, will try that in a second
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i cannot sync anywhere in cache.get()
, as that locks key, and actual race happen when cache result is returned, in finally
, looking if I can make some test-only abstraction for that
So I think I get what is happening, but I can't find a way to reliably test it with single test run Here is my logging output after test failure (on original code before changes from this PR were applied) when I run with executor with
only I think what happens is that if thread which have not written cache value, get to call Synchronizing on |
@oskar-szwajkowski can you test that with a lightweight locking scheme? This will be a default in JDK 23 and I don't want to revisit this PR once the JDK is updated. |
https://bugs.openjdk.org/browse/JDK-8305999 ( |
But without this flag it also failed for me once ![]() So I guess there is still some race but its very hard to reproduce, its better than it was (here is result with old code) |
59f566b
to
16c2399
Compare
Okay I have synchronized few places which looked to me like could contribute to race conditions (and are pretty short code paths that read/write only local maps so shouldn't contribute that much to performance loss) After running over milion of those tests (both with and without I can't come up with single test that reproduces this issue tho, but this code should be much more thread-safe than previous one so I'd say merging it won't hurt even if there is no perfect test case yet |
@oskar-szwajkowski JMH? |
are build failures related? |
Unlikely, it was all green before I added synchronize around ongoing loadings This is likely flake, but I cannot retry it, let me know if I should rebase and push |
lib/trino-cache/src/test/java/io/trino/cache/TestEvictableCache.java
Outdated
Show resolved
Hide resolved
lib/trino-cache/src/main/java/io/trino/cache/EvictableCache.java
Outdated
Show resolved
Hide resolved
If we just ignore them, they will come for us tomorrow.
please add an empty commit. |
a7240c3
to
2c77163
Compare
@oskar-szwajkowski can you write a JMH benchmark that will check the performance of the additional synchronization? |
yes, will do and post results here and in the commit |
2c77163
to
321e44a
Compare
@wendigo added JMH benchmark I used, and also uploaded result with new code in test resources, for easier comparison Here I put results with older code: As you can see, only if cache loading takes 0 time, there is penalty of few microseconds of each cache get but as soon as cache loading time is measurable, then cache miss contributes greatly to execution time, with new synchronization cache retrieval is constant with cache load time |
Can you visualize before/after using https://jmh.morethan.io/ ? |
lib/trino-cache/src/test/resources/benchmarks/EvictableCacheBenchmark-result.json
Outdated
Show resolved
Hide resolved
lib/trino-cache/src/test/java/io/trino/cache/TestEvictableCache.java
Outdated
Show resolved
Hide resolved
lib/trino-cache/src/test/java/io/trino/cache/EvictableCacheBenchmark.java
Outdated
Show resolved
Hide resolved
lib/trino-cache/src/test/java/io/trino/cache/EvictableCacheBenchmark.java
Outdated
Show resolved
Hide resolved
lib/trino-cache/src/main/java/io/trino/cache/EvictableCache.java
Outdated
Show resolved
Hide resolved
lib/trino-cache/src/main/java/io/trino/cache/EvictableCache.java
Outdated
Show resolved
Hide resolved
lib/trino-cache/src/main/java/io/trino/cache/EvictableCache.java
Outdated
Show resolved
Hide resolved
95dcbc0
to
23a137c
Compare
@oskar-szwajkowski can you also post benchmarks for comparison of JDK 22 vs 23? |
23a137c
to
27c0598
Compare
I've renamed benchmark to match other benchmarks (BenchmarkX instead of XBenchmark) |
Previously, dataCache.asMap().containsKey() reported false when cache entry was loaded for key, removing the token. Then after value was loaded, token didn't exist so next thread checking the cache had to regenerate the token itself, making cache less usable for long loading operations Delete ongoingLoads map in favor of variable inside Token instance
Add test JMH dependencies and benchmark results
27c0598
to
72e3055
Compare
I hope that fixes it @jkylling |
i feel like we need an RN entry for this.. also @oskar-szwajkowski are you on trino slack? |
just joined, I'll update description to include RN, unless you want me to put them somewhere else |
@oskar-szwajkowski already done |
I have updated description, if you did as well I might have overriden your changes (not sure if there is versioning on github's side that would not accept my update) anyway, thanks |
Description
Previously, dataCache.asMap().containsKey() reported false when cache entry was loaded for key, removing the token. Then after value was loaded, token didn't exist so next thread checking the cache had to regenerate the token itself, making cache less usable for long loading operations
Additional context and related issues
Closes #23384
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(X) Release notes are required, with the following suggested text:
Fixed race in
EvictableCache
that manifested itself with eager cache entry invalidation.It could happen whenever
Thread B
was waiting for lockedKey 1
byThread A
, to compute cache value forKey 1
. It sometimes led to invalidation ofKey 1
, preventing following threads accessing this key reusing cached value.