fix(pageserver): gc-compaction race with read #10543

skyzh · 2025-01-28T19:05:47Z

Problem

Summary of changes

Add an extra lock on the read path to protect against races. The read path has an implication that only certain kind of compactions can be performed. Garbage keys must first have an image layer covering the range, and then being gc-ed -- they cannot be done in one operation. An alternative to fix this is to move the layers read guard to be acquired at the beginning of get_vectored_reconstruct_data_timeline, but that was intentionally optimized out and I don't want to regress.

The race is not limited to image layers. Gc-compaction will consolidate deltas automatically and produce a flat delta layer (i.e., when we have retain_lsns below the gc-horizon). The same race would also cause behaviors like getting an un-replayable key history as in #10049.

skyzh · 2025-01-28T19:09:48Z

cc @VladLazar I cannot think about a better way to fix it other than hacking somewhere. Please help take a look, thanks :)

github-actions · 2025-01-28T20:19:43Z

7414 tests run: 7063 passed, 0 failed, 351 skipped (full report)

Flaky tests (8)

Postgres 17

test_pgdata_import_smoke[8-1024-RelBlockSize.MULTIPLE_RELATION_SEGMENTS]: debug-x86-64-without-lfc, release-arm64-with-lfc, release-arm64-without-lfc
test_pgdata_import_smoke[None-1024-RelBlockSize.MULTIPLE_RELATION_SEGMENTS]: debug-x86-64-without-lfc, release-arm64-without-lfc, release-arm64-with-lfc

Postgres 16

test_compute_migrations_retry: release-arm64-with-lfc

Postgres 14

test_compute_migrations_retry: release-arm64-with-lfc

Code coverage* (full report)

functions: 33.3% (8491 of 25495 functions)
lines: 49.1% (71411 of 145539 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
9cb3d1f at 2025-01-29T19:24:08.396Z :recycle:}

problame

I'm ok with merging this to see if it solves the bug.

Regardling deadlock risk: the locking rule that you introduce here is
Timeline::gc_compaction_layer_update_lock before Timeline::layers.

I think it's adhered to in all places right now, so, there shouldn't be a deadlock.

But

that's hard to maintain and
tokio deadlocks are practically undebuggable,

so I'm quite weary of this one.

I think semantically, the issue is that bottommost-comapction aka gc_compaction isn't integrated with the latest_gc_cutoff_lsn RCU machinery, right? Can we make it integrated with it instead of adding a new lock? All we'd need to do is lock out the gc task while doing gc_compaction (?)

neon/pageserver/src/tenant/timeline.rs

Lines 5377 to 5383 in f371186

    
           // We need to ensure that no one tries to read page versions or create 
        
           // branches at a point before latest_gc_cutoff_lsn. See branch_timeline() 
        
           // for details. This will block until the old value is no longer in use. 
        
           // 
        
           // The GC cutoff should only ever move forwards. 
        
           let waitlist = { 
        
               let write_guard = self.latest_gc_cutoff_lsn.lock_for_write();

Another idea would be multi-versioned layer map; elaborated on it in Slack, let's discuss there: https://neondb.slack.com/archives/C033RQ5SPDH/p1738104970588119?thread_ts=1737996099.942049&cid=C033RQ5SPDH

pageserver/src/tenant/timeline.rs

skyzh · 2025-01-28T23:22:46Z

I think semantically, the issue is that bottommost-comapction aka gc_compaction isn't integrated with the latest_gc_cutoff_lsn RCU machinery, right? Can we make it integrated with it instead of adding a new lock? All we'd need to do is lock out the gc task while doing gc_compaction (?)

It's kind-of integrated. Gc-compaction only compact data below the cutoff lsn.

VladLazar

I refreshed my memory on why the read path worked with the legacy gc and compaction:

Legacy compaction has two stages: (1) level 0 compaction and (2) image layer creation.
(1) Adds new delta layers to the layer map and removes the old ones. Let's say that we are reading from one L0 and when we next check the layer map it was replaced with a tall one. This is fine, the read path will read from the tall layer at the correct LSN.
(2) Image layer creation. Deltas aren't removed as part of compaction so all good.

As for GC, it is indeed problematic to remove layers while a read is on-going as Chi has discovered. My understanding is that we held an RcuReadGuard for latest_gc_cutoff_lsn while doing the read (here). This prevented gc from kicking in and removing layers while the read is ongoing.

When batching was introduced, we lost this property. I'll open a PR to add it back.
@skyzh is it possible to call use Timeline::latest_gc_cutoff_lsn in the GC compaction like the legacy compaction (see here)?

skyzh · 2025-01-29T17:44:28Z

@VladLazar I did a little experiment and I don't think using latest_gc_cutoff would work. The constraint I would like to achieve with the locks is that (1) all ongoing reads should finish before we update the gc-compaction layer map (2) all pending reads should block before we update the gc-compaction layer map (3) pending reads should only proceed after we finish update the layer map. Rcu can only achieve (2) but cannot do (1) and (3). Furthermore, gc-compaction doesn't update latest_gc_cutoff at all, so I feel like doing locking around latest_gc_cutoff is more like mocking a RwLock...

I've updated the code with the two scenarios I'm thinking about and also the constraints I want to enforce during gc-compaction. Let me know if you have any ideas to make this simpler. Otherwise I'd like to get this patch merged first to see whether this resolves the "key not found" issue and we have fixed all existing gc-compaction bugs. We can optimize this lock away later with layer map refactors or using better methods. Note that if gc-compaction doesn't run at all, no tasks will be blocked on the newly-added rwlock.

skyzh · 2025-01-29T17:50:11Z

This prevented gc from kicking in and removing layers while the read is ongoing.

This prevents gc from kicking in but didn't prevent gc removing layers while read is ongoing. Note that gc only holds the Rcu write guard for a small amount of time when updating the cutoff, and then it proceeds without holding the lock.

The reason why it doesn't trigger race condition in the current codebase: gc only removes layers that are fully covered by image layers. In other words, gc only removes layers that won't be accessed by the read path. We know that the read path stops at the image layer.

However, gc-compaction will rewrite layers that can be accessed by the read path. In case (1) described in the code comment, we could do something like update the layer map to add image layers, sleep for a few seconds, and then update the layer map to remove stale delta layers. However, in case (2), where gc-compaction operates on a branch, it has to rewrite layers that are accessible on the read path, and I think we will have fix the problem on the read path in the end.

Signed-off-by: Alex Chi Z <[email protected]>

skyzh · 2025-01-29T17:55:17Z

An alternative idea is to hack the Rcu and add a new functionality that first waits for all reads to finish, then allows user to run some code (i.e., update layer map), and then unblocks reads. However, note that the update layer map code is async, and RcuGuard is not Send.

VladLazar · 2025-01-30T11:16:02Z

(1) all ongoing reads should finish before we update the gc-compaction layer map

The Rcu satisfies this as far as I can tell. RcuWriteGuard::store_and_unlock followed by RcuWaitList::wait will wait until all the ongoing reads are done since each read holds one RcuReadGuard.

(2) all pending reads should block before we update the gc-compaction layer map

The layer map rw lock gives you this property. If we are updating, the layer map, we can't get a read lock and begin traversal.

(3) pending reads should only proceed after we finish update the layer map

Again, layer map lock covers this.

This prevented gc from kicking in and removing layers while the read is ongoing.

This prevents gc from kicking in but didn't prevent gc removing layers while read is ongoing. Note that gc only holds the Rcu write guard for a small amount of time when updating the cutoff, and then it proceeds without holding the lock.

The important bit there is not the write guard, but the RcuWaitList::wait call. That's where the waiting happens.

VladLazar

Approving since I don't want to block your fix, but I'm not convinced that it can't be done with the primitives that we already have. I'm not a fan of adding a new lock, but perhaps I'm missing something - happy to jump on a call.

skyzh · 2025-01-30T15:25:24Z

Okay I'll merge this patch first and let's have a call at some point :)

skyzh · 2025-01-30T16:51:49Z

Had a call with Vlad and we agreed that this is the only way to quickly fix it; in the long term we need to make layer map copy-on-write.

skyzh requested a review from a team as a code owner January 28, 2025 19:05

skyzh requested review from arpad-m and VladLazar January 28, 2025 19:05

skyzh force-pushed the skyzh/fix-gc-compaction-read branch from 87c5fad to f371186 Compare January 28, 2025 19:10

skyzh mentioned this pull request Jan 28, 2025

feat(pageserver): dump read path on missing key error #10528

Open

arpad-m removed their request for review January 28, 2025 21:17

problame reviewed Jan 28, 2025

View reviewed changes

pageserver/src/tenant/timeline.rs Outdated Show resolved Hide resolved

VladLazar reviewed Jan 29, 2025

View reviewed changes

skyzh force-pushed the skyzh/fix-gc-compaction-read branch from f371186 to 95250e9 Compare January 29, 2025 17:42

skyzh force-pushed the skyzh/fix-gc-compaction-read branch from 95250e9 to f91c61e Compare January 29, 2025 17:51

fix(pageserver): gc-compaction race with read

9cb3d1f

Signed-off-by: Alex Chi Z <[email protected]>

skyzh force-pushed the skyzh/fix-gc-compaction-read branch from f91c61e to 9cb3d1f Compare January 29, 2025 17:53

VladLazar approved these changes Jan 30, 2025

View reviewed changes

skyzh added this pull request to the merge queue Jan 30, 2025

Merged via the queue into main with commit cf6dee9 Jan 30, 2025
84 checks passed

skyzh deleted the skyzh/fix-gc-compaction-read branch January 30, 2025 15:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(pageserver): gc-compaction race with read #10543

fix(pageserver): gc-compaction race with read #10543

skyzh commented Jan 28, 2025 •

edited

Loading

skyzh commented Jan 28, 2025

github-actions bot commented Jan 28, 2025 •

edited

Loading

Postgres 17

Postgres 16

Postgres 14

problame left a comment

skyzh commented Jan 28, 2025

VladLazar left a comment

skyzh commented Jan 29, 2025

skyzh commented Jan 29, 2025 •

edited

Loading

skyzh commented Jan 29, 2025

VladLazar commented Jan 30, 2025

VladLazar left a comment

skyzh commented Jan 30, 2025

skyzh commented Jan 30, 2025

	// We need to ensure that no one tries to read page versions or create
	// branches at a point before latest_gc_cutoff_lsn. See branch_timeline()
	// for details. This will block until the old value is no longer in use.
	//
	// The GC cutoff should only ever move forwards.
	let waitlist = {
	let write_guard = self.latest_gc_cutoff_lsn.lock_for_write();

fix(pageserver): gc-compaction race with read #10543

fix(pageserver): gc-compaction race with read #10543

Conversation

skyzh commented Jan 28, 2025 • edited Loading

Problem

Summary of changes

skyzh commented Jan 28, 2025

github-actions bot commented Jan 28, 2025 • edited Loading

7414 tests run: 7063 passed, 0 failed, 351 skipped (full report)

Postgres 17

Postgres 16

Postgres 14

Code coverage* (full report)

problame left a comment

Choose a reason for hiding this comment

skyzh commented Jan 28, 2025

VladLazar left a comment

Choose a reason for hiding this comment

skyzh commented Jan 29, 2025

skyzh commented Jan 29, 2025 • edited Loading

skyzh commented Jan 29, 2025

VladLazar commented Jan 30, 2025

VladLazar left a comment

Choose a reason for hiding this comment

skyzh commented Jan 30, 2025

skyzh commented Jan 30, 2025

skyzh commented Jan 28, 2025 •

edited

Loading

github-actions bot commented Jan 28, 2025 •

edited

Loading

skyzh commented Jan 29, 2025 •

edited

Loading