-
Notifications
You must be signed in to change notification settings - Fork 484
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(pageserver): gc-compaction race with read #10543
Conversation
cc @VladLazar I cannot think about a better way to fix it other than hacking somewhere. Please help take a look, thanks :) |
87c5fad
to
f371186
Compare
7414 tests run: 7063 passed, 0 failed, 351 skipped (full report)Flaky tests (8)Postgres 17
Postgres 16
Postgres 14
Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
9cb3d1f at 2025-01-29T19:24:08.396Z :recycle: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with merging this to see if it solves the bug.
Regardling deadlock risk: the locking rule that you introduce here is
Timeline::gc_compaction_layer_update_lock before Timeline::layers
.
I think it's adhered to in all places right now, so, there shouldn't be a deadlock.
But
- that's hard to maintain and
- tokio deadlocks are practically undebuggable,
so I'm quite weary of this one.
I think semantically, the issue is that bottommost-comapction aka gc_compaction isn't integrated with the latest_gc_cutoff_lsn
RCU machinery, right? Can we make it integrated with it instead of adding a new lock? All we'd need to do is lock out the gc task while doing gc_compaction (?)
neon/pageserver/src/tenant/timeline.rs
Lines 5377 to 5383 in f371186
// We need to ensure that no one tries to read page versions or create | |
// branches at a point before latest_gc_cutoff_lsn. See branch_timeline() | |
// for details. This will block until the old value is no longer in use. | |
// | |
// The GC cutoff should only ever move forwards. | |
let waitlist = { | |
let write_guard = self.latest_gc_cutoff_lsn.lock_for_write(); |
Another idea would be multi-versioned layer map; elaborated on it in Slack, let's discuss there: https://neondb.slack.com/archives/C033RQ5SPDH/p1738104970588119?thread_ts=1737996099.942049&cid=C033RQ5SPDH
It's kind-of integrated. Gc-compaction only compact data below the cutoff lsn. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refreshed my memory on why the read path worked with the legacy gc and compaction:
Legacy compaction has two stages: (1) level 0 compaction and (2) image layer creation.
(1) Adds new delta layers to the layer map and removes the old ones. Let's say that we are reading from one L0 and when we next check the layer map it was replaced with a tall one. This is fine, the read path will read from the tall layer at the correct LSN.
(2) Image layer creation. Deltas aren't removed as part of compaction so all good.
As for GC, it is indeed problematic to remove layers while a read is on-going as Chi has discovered. My understanding is that we held an RcuReadGuard
for latest_gc_cutoff_lsn
while doing the read (here). This prevented gc from kicking in and removing layers while the read is ongoing.
When batching was introduced, we lost this property. I'll open a PR to add it back.
@skyzh is it possible to call use Timeline::latest_gc_cutoff_lsn
in the GC compaction like the legacy compaction (see here)?
f371186
to
95250e9
Compare
@VladLazar I did a little experiment and I don't think using I've updated the code with the two scenarios I'm thinking about and also the constraints I want to enforce during gc-compaction. Let me know if you have any ideas to make this simpler. Otherwise I'd like to get this patch merged first to see whether this resolves the "key not found" issue and we have fixed all existing gc-compaction bugs. We can optimize this lock away later with layer map refactors or using better methods. Note that if gc-compaction doesn't run at all, no tasks will be blocked on the newly-added rwlock. |
This prevents gc from kicking in but didn't prevent gc removing layers while read is ongoing. Note that gc only holds the Rcu write guard for a small amount of time when updating the cutoff, and then it proceeds without holding the lock. The reason why it doesn't trigger race condition in the current codebase: gc only removes layers that are fully covered by image layers. In other words, gc only removes layers that won't be accessed by the read path. We know that the read path stops at the image layer. However, gc-compaction will rewrite layers that can be accessed by the read path. In case (1) described in the code comment, we could do something like update the layer map to add image layers, sleep for a few seconds, and then update the layer map to remove stale delta layers. However, in case (2), where gc-compaction operates on a branch, it has to rewrite layers that are accessible on the read path, and I think we will have fix the problem on the read path in the end. |
95250e9
to
f91c61e
Compare
Signed-off-by: Alex Chi Z <[email protected]>
f91c61e
to
9cb3d1f
Compare
An alternative idea is to hack the Rcu and add a new functionality that first waits for all reads to finish, then allows user to run some code (i.e., update layer map), and then unblocks reads. However, note that the update layer map code is async, and |
The Rcu satisfies this as far as I can tell.
The layer map rw lock gives you this property. If we are updating, the layer map, we can't get a read lock and begin traversal.
Again, layer map lock covers this.
The important bit there is not the write guard, but the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving since I don't want to block your fix, but I'm not convinced that it can't be done with the primitives that we already have. I'm not a fan of adding a new lock, but perhaps I'm missing something - happy to jump on a call.
Okay I'll merge this patch first and let's have a call at some point :) |
Had a call with Vlad and we agreed that this is the only way to quickly fix it; in the long term we need to make layer map copy-on-write. |
Problem
close #10482
Summary of changes
Add an extra lock on the read path to protect against races. The read path has an implication that only certain kind of compactions can be performed. Garbage keys must first have an image layer covering the range, and then being gc-ed -- they cannot be done in one operation. An alternative to fix this is to move the layers read guard to be acquired at the beginning of
get_vectored_reconstruct_data_timeline
, but that was intentionally optimized out and I don't want to regress.The race is not limited to image layers. Gc-compaction will consolidate deltas automatically and produce a flat delta layer (i.e., when we have retain_lsns below the gc-horizon). The same race would also cause behaviors like getting an un-replayable key history as in #10049.