Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race in EvictableCache#removeDangling #23401

Conversation

oskar-szwajkowski
Copy link
Contributor

@oskar-szwajkowski oskar-szwajkowski commented Sep 13, 2024

Description

Previously, dataCache.asMap().containsKey() reported false when cache entry was loaded for key, removing the token. Then after value was loaded, token didn't exist so next thread checking the cache had to regenerate the token itself, making cache less usable for long loading operations

Additional context and related issues

Closes #23384

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(X) Release notes are required, with the following suggested text:
Fixed race in EvictableCache that manifested itself with eager cache entry invalidation.
It could happen whenever Thread B was waiting for locked Key 1 by Thread A, to compute cache value for Key 1. It sometimes led to invalidation of Key 1, preventing following threads accessing this key reusing cached value.

@@ -585,4 +588,26 @@ public void testPutOnNonEmptyCacheImplementation()
.isInstanceOf(UnsupportedOperationException.class)
.hasMessage("The operation is not supported, as in inherently races with cache invalidation");
}

@RepeatedTest(1000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

omg

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just concurrency issues things, I think it worth keeping like that as is was easier to reproduce locally with it, and test itself takes 1-2ms to execute

@oskar-szwajkowski oskar-szwajkowski force-pushed the osz/fix-remove-dangling-race-in-evictable-cache branch from f1e104e to 59f566b Compare September 13, 2024 14:32
try (ExecutorService executor = Executors.newFixedThreadPool(2)) {
Runnable longRunningCacheLoad = () -> {
try {
cache.get("key", () -> "value");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to make this always fail (or more frequently?) by using a synchronization primitive within the value loader to ensure that the loaders both run at the same time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call, will try that in a second

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i cannot sync anywhere in cache.get(), as that locks key, and actual race happen when cache result is returned, in finally, looking if I can make some test-only abstraction for that

@martint martint requested a review from findepi September 13, 2024 16:04
@oskar-szwajkowski
Copy link
Contributor Author

oskar-szwajkowski commented Sep 13, 2024

So I think I get what is happening, but I can't find a way to reliably test it with single test run

Here is my logging output after test failure (on original code before changes from this PR were applied) when I run with executor with 10 threads, with waiting 4s inside cache loader:

2024-09-13T10:41:46.446-0600	INFO	pool-2-thread-1	stdout	Inside executor#submit
2024-09-13T10:41:46.450-0600	INFO	pool-2-thread-5	stdout	Inside executor#submit
2024-09-13T10:41:46.450-0600	INFO	pool-2-thread-7	stdout	Inside executor#submit
2024-09-13T10:41:46.450-0600	INFO	pool-2-thread-2	stdout	Inside executor#submit
2024-09-13T10:41:46.450-0600	INFO	pool-2-thread-8	stdout	Inside executor#submit
2024-09-13T10:41:46.450-0600	INFO	pool-2-thread-9	stdout	Inside executor#submit
2024-09-13T10:41:46.450-0600	INFO	pool-2-thread-6	stdout	Inside executor#submit
2024-09-13T10:41:46.450-0600	INFO	pool-2-thread-3	stdout	Inside executor#submit
2024-09-13T10:41:46.451-0600	INFO	pool-2-thread-4	stdout	Inside executor#submit
2024-09-13T10:41:46.453-0600	INFO	pool-2-thread-4	stdout	Inside cache#get, Sleeping for 4000ms
2024-09-13T10:41:50.456-0600	INFO	pool-2-thread-4	stdout	Sleep finish
2024-09-13T10:41:50.460-0600	INFO	pool-2-thread-5	stdout	Inside EvictableCache#removeDangling if statement, Removing key: key
2024-09-13T10:41:50.460-0600	INFO	pool-2-thread-5	stdout	Cache loaded value: [value, thread: pool-2-thread-4]
2024-09-13T10:41:50.460-0600	INFO	pool-2-thread-6	stdout	Cache loaded value: [value, thread: pool-2-thread-4]
2024-09-13T10:41:50.461-0600	INFO	pool-2-thread-4	stdout	Cache loaded value: [value, thread: pool-2-thread-4]
2024-09-13T10:41:50.461-0600	INFO	pool-2-thread-1	stdout	Cache loaded value: [value, thread: pool-2-thread-4]
2024-09-13T10:41:50.461-0600	INFO	pool-2-thread-3	stdout	Cache loaded value: [value, thread: pool-2-thread-4]
2024-09-13T10:41:50.461-0600	INFO	pool-2-thread-2	stdout	Cache loaded value: [value, thread: pool-2-thread-4]
2024-09-13T10:41:50.462-0600	INFO	pool-2-thread-8	stdout	Cache loaded value: [value, thread: pool-2-thread-4]
2024-09-13T10:41:50.462-0600	INFO	pool-2-thread-9	stdout	Cache loaded value: [value, thread: pool-2-thread-4]
2024-09-13T10:41:50.462-0600	INFO	pool-2-thread-7	stdout	Cache loaded value: [value, thread: pool-2-thread-4]
2024-09-13T10:41:50.463-0600	INFO	ForkJoinPool-1-worker-1	stdout	null

only pool-2-thread-4 is actually getting to compute value, which means cache is locking on "key" correctly, but then after it computes, other threads are calling cache.get(), and get their value returned immediately (from cache), and get to finally block before pool-2-thread-4 does (pool-2-thread-5 logs removeDangling first)

I think what happens is that if thread which have not written cache value, get to call dataCache.asMap().containsKey(), it doesn't see value written in other thread (value wasn't safely published)

Synchronizing on token make safe publishing possible and value is visible to all threads upon entering synchronized block

@wendigo
Copy link
Contributor

wendigo commented Sep 13, 2024

@oskar-szwajkowski can you test that with a lightweight locking scheme? This will be a default in JDK 23 and I don't want to revisit this PR once the JDK is updated.

@wendigo
Copy link
Contributor

wendigo commented Sep 13, 2024

https://bugs.openjdk.org/browse/JDK-8305999 (XX:Locking=2)

@oskar-szwajkowski
Copy link
Contributor Author

@oskar-szwajkowski can you test that with a lightweight locking scheme? This will be a default in JDK 23 and I don't want to revisit this PR once the JDK is updated.

Oh crap
image

But without this flag it also failed for me once

image

So I guess there is still some race but its very hard to reproduce, its better than it was (here is result with old code)
image

@oskar-szwajkowski oskar-szwajkowski force-pushed the osz/fix-remove-dangling-race-in-evictable-cache branch from 59f566b to 16c2399 Compare September 13, 2024 19:47
@oskar-szwajkowski
Copy link
Contributor Author

@oskar-szwajkowski can you test that with a lightweight locking scheme? This will be a default in JDK 23 and I don't want to revisit this PR once the JDK is updated.

Okay I have synchronized few places which looked to me like could contribute to race conditions (and are pretty short code paths that read/write only local maps so shouldn't contribute that much to performance loss)

After running over milion of those tests (both with and without -XX:LockingMode=2) there was no single failure

I can't come up with single test that reproduces this issue tho, but this code should be much more thread-safe than previous one so I'd say merging it won't hurt even if there is no perfect test case yet

@wendigo
Copy link
Contributor

wendigo commented Sep 13, 2024

@oskar-szwajkowski JMH?

@findepi
Copy link
Member

findepi commented Sep 14, 2024

are build failures related?

@oskar-szwajkowski
Copy link
Contributor Author

are build failures related?

Unlikely, it was all green before I added synchronize around ongoing loadings

This is likely flake, but I cannot retry it, let me know if I should rebase and push

@findepi
Copy link
Member

findepi commented Sep 14, 2024

This is likely flake,

If we just ignore them, they will come for us tomorrow.
Pleas triage, create necessary issues -- and see whether you can fix.

but I cannot retry it, let me know if I should rebase and push

please add an empty commit.
this strategy is a bit expensive (re-runs whole build), but (0) it self-service, doesn't require a maintainer, (1) does't include rebase, so doesn't mess with review process and (2) makes it very clear there are flakes, since you can easily compare build results between code-equivalent commits. 0 + 1 + 2 => pure awesomeness.

@oskar-szwajkowski
Copy link
Contributor Author

oskar-szwajkowski commented Sep 14, 2024

Raptor test flakiness has been already reported: #23399
When it comes to single test in test-hive, it seems like IO issue when reading a file, will see if it happened again before

Update, trino-hive test is also already reported: #22861

@oskar-szwajkowski oskar-szwajkowski force-pushed the osz/fix-remove-dangling-race-in-evictable-cache branch 5 times, most recently from a7240c3 to 2c77163 Compare September 15, 2024 12:26
@wendigo
Copy link
Contributor

wendigo commented Sep 16, 2024

@oskar-szwajkowski can you write a JMH benchmark that will check the performance of the additional synchronization?

@oskar-szwajkowski
Copy link
Contributor Author

@oskar-szwajkowski can you write a JMH benchmark that will check the performance of the additional synchronization?

yes, will do and post results here and in the commit

@oskar-szwajkowski oskar-szwajkowski force-pushed the osz/fix-remove-dangling-race-in-evictable-cache branch from 2c77163 to 321e44a Compare September 16, 2024 11:16
@oskar-szwajkowski
Copy link
Contributor Author

@wendigo added JMH benchmark I used, and also uploaded result with new code in test resources, for easier comparison

Here I put results with older code:
EvictableCacheBenchmark-result-2024-09-16T12:16:22.517751.json

As you can see, only if cache loading takes 0 time, there is penalty of few microseconds of each cache get

but as soon as cache loading time is measurable, then cache miss contributes greatly to execution time, with new synchronization cache retrieval is constant with cache load time

@wendigo
Copy link
Contributor

wendigo commented Sep 16, 2024

Can you visualize before/after using https://jmh.morethan.io/ ?

@oskar-szwajkowski
Copy link
Contributor Author

Before:
image

After:
image

Comparison:
image
image

As you can see, slight decline (or no change) is when cache loading takes 0 seconds, but if cache loading takes any time, synchronized version is better

@oskar-szwajkowski oskar-szwajkowski force-pushed the osz/fix-remove-dangling-race-in-evictable-cache branch 2 times, most recently from 95dcbc0 to 23a137c Compare September 16, 2024 13:54
@wendigo
Copy link
Contributor

wendigo commented Sep 16, 2024

@oskar-szwajkowski can you also post benchmarks for comparison of JDK 22 vs 23?

@wendigo wendigo force-pushed the osz/fix-remove-dangling-race-in-evictable-cache branch from 23a137c to 27c0598 Compare September 16, 2024 21:01
@wendigo
Copy link
Contributor

wendigo commented Sep 16, 2024

I've renamed benchmark to match other benchmarks (BenchmarkX instead of XBenchmark)

Previously, dataCache.asMap().containsKey() reported false when
cache entry was loaded for key, removing the token. Then after
value was loaded, token didn't exist so next thread checking the cache
had to regenerate the token itself, making cache less usable for long
loading operations

Delete ongoingLoads map in favor of variable inside
Token instance
Add test JMH dependencies and benchmark results
@wendigo wendigo force-pushed the osz/fix-remove-dangling-race-in-evictable-cache branch from 27c0598 to 72e3055 Compare September 17, 2024 06:20
@wendigo wendigo merged commit 6ac6f9c into trinodb:master Sep 17, 2024
2 of 12 checks passed
@github-actions github-actions bot added this to the 458 milestone Sep 17, 2024
@wendigo
Copy link
Contributor

wendigo commented Sep 17, 2024

I hope that fixes it @jkylling

@oskar-szwajkowski
Copy link
Contributor Author

oskar-szwajkowski commented Sep 17, 2024

Here is comparison between 22.0.1 temurin and 23.ea.29-open

Most scenarios are unchanged, when it comes to no-load-time scenarios jdk 23 is better, it might be because of new locking mechanism that play role in such close timing scenarios (its faster but its still faster by 50 microseconds per operation), some scenarios looks like declined by up to 17%, but this could be just data noise as most of them is unchanged

image image

@mosabua
Copy link
Member

mosabua commented Sep 17, 2024

i feel like we need an RN entry for this.. also @oskar-szwajkowski are you on trino slack?

@oskar-szwajkowski
Copy link
Contributor Author

i feel like we need an RN entry for this.. also @oskar-szwajkowski are you on trino slack?

just joined, I'll update description to include RN, unless you want me to put them somewhere else

@wendigo
Copy link
Contributor

wendigo commented Sep 17, 2024

@oskar-szwajkowski already done

@oskar-szwajkowski
Copy link
Contributor Author

@oskar-szwajkowski already done

I have updated description, if you did as well I might have overriden your changes (not sure if there is versioning on github's side that would not accept my update)

anyway, thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Concurrent access issue in EvictableCache
5 participants