Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(pageserver): gc-compaction race with read #10543

Merged
merged 1 commit into from
Jan 30, 2025
Merged

Conversation

skyzh
Copy link
Member

@skyzh skyzh commented Jan 28, 2025

Problem

close #10482

Summary of changes

Add an extra lock on the read path to protect against races. The read path has an implication that only certain kind of compactions can be performed. Garbage keys must first have an image layer covering the range, and then being gc-ed -- they cannot be done in one operation. An alternative to fix this is to move the layers read guard to be acquired at the beginning of get_vectored_reconstruct_data_timeline, but that was intentionally optimized out and I don't want to regress.

The race is not limited to image layers. Gc-compaction will consolidate deltas automatically and produce a flat delta layer (i.e., when we have retain_lsns below the gc-horizon). The same race would also cause behaviors like getting an un-replayable key history as in #10049.

@skyzh skyzh requested a review from a team as a code owner January 28, 2025 19:05
@skyzh skyzh requested review from arpad-m and VladLazar January 28, 2025 19:05
@skyzh
Copy link
Member Author

skyzh commented Jan 28, 2025

cc @VladLazar I cannot think about a better way to fix it other than hacking somewhere. Please help take a look, thanks :)

Copy link

github-actions bot commented Jan 28, 2025

7414 tests run: 7063 passed, 0 failed, 351 skipped (full report)


Flaky tests (8)

Postgres 17

Postgres 16

Postgres 14

Code coverage* (full report)

  • functions: 33.3% (8491 of 25495 functions)
  • lines: 49.1% (71411 of 145539 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
9cb3d1f at 2025-01-29T19:24:08.396Z :recycle:

@arpad-m arpad-m removed their request for review January 28, 2025 21:17
Copy link
Contributor

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with merging this to see if it solves the bug.

Regardling deadlock risk: the locking rule that you introduce here is
Timeline::gc_compaction_layer_update_lock before Timeline::layers.

I think it's adhered to in all places right now, so, there shouldn't be a deadlock.

But

  1. that's hard to maintain and
  2. tokio deadlocks are practically undebuggable,

so I'm quite weary of this one.


I think semantically, the issue is that bottommost-comapction aka gc_compaction isn't integrated with the latest_gc_cutoff_lsn RCU machinery, right? Can we make it integrated with it instead of adding a new lock? All we'd need to do is lock out the gc task while doing gc_compaction (?)

// We need to ensure that no one tries to read page versions or create
// branches at a point before latest_gc_cutoff_lsn. See branch_timeline()
// for details. This will block until the old value is no longer in use.
//
// The GC cutoff should only ever move forwards.
let waitlist = {
let write_guard = self.latest_gc_cutoff_lsn.lock_for_write();


Another idea would be multi-versioned layer map; elaborated on it in Slack, let's discuss there: https://neondb.slack.com/archives/C033RQ5SPDH/p1738104970588119?thread_ts=1737996099.942049&cid=C033RQ5SPDH

pageserver/src/tenant/timeline.rs Outdated Show resolved Hide resolved
@skyzh
Copy link
Member Author

skyzh commented Jan 28, 2025

I think semantically, the issue is that bottommost-comapction aka gc_compaction isn't integrated with the latest_gc_cutoff_lsn RCU machinery, right? Can we make it integrated with it instead of adding a new lock? All we'd need to do is lock out the gc task while doing gc_compaction (?)

It's kind-of integrated. Gc-compaction only compact data below the cutoff lsn.

Copy link
Contributor

@VladLazar VladLazar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refreshed my memory on why the read path worked with the legacy gc and compaction:

Legacy compaction has two stages: (1) level 0 compaction and (2) image layer creation.
(1) Adds new delta layers to the layer map and removes the old ones. Let's say that we are reading from one L0 and when we next check the layer map it was replaced with a tall one. This is fine, the read path will read from the tall layer at the correct LSN.
(2) Image layer creation. Deltas aren't removed as part of compaction so all good.

As for GC, it is indeed problematic to remove layers while a read is on-going as Chi has discovered. My understanding is that we held an RcuReadGuard for latest_gc_cutoff_lsn while doing the read (here). This prevented gc from kicking in and removing layers while the read is ongoing.

When batching was introduced, we lost this property. I'll open a PR to add it back.
@skyzh is it possible to call use Timeline::latest_gc_cutoff_lsn in the GC compaction like the legacy compaction (see here)?

@skyzh skyzh force-pushed the skyzh/fix-gc-compaction-read branch from f371186 to 95250e9 Compare January 29, 2025 17:42
@skyzh
Copy link
Member Author

skyzh commented Jan 29, 2025

@VladLazar I did a little experiment and I don't think using latest_gc_cutoff would work. The constraint I would like to achieve with the locks is that (1) all ongoing reads should finish before we update the gc-compaction layer map (2) all pending reads should block before we update the gc-compaction layer map (3) pending reads should only proceed after we finish update the layer map. Rcu can only achieve (2) but cannot do (1) and (3). Furthermore, gc-compaction doesn't update latest_gc_cutoff at all, so I feel like doing locking around latest_gc_cutoff is more like mocking a RwLock...

I've updated the code with the two scenarios I'm thinking about and also the constraints I want to enforce during gc-compaction. Let me know if you have any ideas to make this simpler. Otherwise I'd like to get this patch merged first to see whether this resolves the "key not found" issue and we have fixed all existing gc-compaction bugs. We can optimize this lock away later with layer map refactors or using better methods. Note that if gc-compaction doesn't run at all, no tasks will be blocked on the newly-added rwlock.

@skyzh
Copy link
Member Author

skyzh commented Jan 29, 2025

This prevented gc from kicking in and removing layers while the read is ongoing.

This prevents gc from kicking in but didn't prevent gc removing layers while read is ongoing. Note that gc only holds the Rcu write guard for a small amount of time when updating the cutoff, and then it proceeds without holding the lock.

The reason why it doesn't trigger race condition in the current codebase: gc only removes layers that are fully covered by image layers. In other words, gc only removes layers that won't be accessed by the read path. We know that the read path stops at the image layer.

However, gc-compaction will rewrite layers that can be accessed by the read path. In case (1) described in the code comment, we could do something like update the layer map to add image layers, sleep for a few seconds, and then update the layer map to remove stale delta layers. However, in case (2), where gc-compaction operates on a branch, it has to rewrite layers that are accessible on the read path, and I think we will have fix the problem on the read path in the end.

@skyzh skyzh force-pushed the skyzh/fix-gc-compaction-read branch from 95250e9 to f91c61e Compare January 29, 2025 17:51
@skyzh skyzh force-pushed the skyzh/fix-gc-compaction-read branch from f91c61e to 9cb3d1f Compare January 29, 2025 17:53
@skyzh
Copy link
Member Author

skyzh commented Jan 29, 2025

An alternative idea is to hack the Rcu and add a new functionality that first waits for all reads to finish, then allows user to run some code (i.e., update layer map), and then unblocks reads. However, note that the update layer map code is async, and RcuGuard is not Send.

@VladLazar
Copy link
Contributor

(1) all ongoing reads should finish before we update the gc-compaction layer map

The Rcu satisfies this as far as I can tell. RcuWriteGuard::store_and_unlock followed by RcuWaitList::wait will wait until all the ongoing reads are done since each read holds one RcuReadGuard.

(2) all pending reads should block before we update the gc-compaction layer map

The layer map rw lock gives you this property. If we are updating, the layer map, we can't get a read lock and begin traversal.

(3) pending reads should only proceed after we finish update the layer map

Again, layer map lock covers this.


This prevented gc from kicking in and removing layers while the read is ongoing.

This prevents gc from kicking in but didn't prevent gc removing layers while read is ongoing. Note that gc only holds the Rcu write guard for a small amount of time when updating the cutoff, and then it proceeds without holding the lock.

The important bit there is not the write guard, but the RcuWaitList::wait call. That's where the waiting happens.

Copy link
Contributor

@VladLazar VladLazar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving since I don't want to block your fix, but I'm not convinced that it can't be done with the primitives that we already have. I'm not a fan of adding a new lock, but perhaps I'm missing something - happy to jump on a call.

@skyzh
Copy link
Member Author

skyzh commented Jan 30, 2025

Okay I'll merge this patch first and let's have a call at some point :)

@skyzh skyzh added this pull request to the merge queue Jan 30, 2025
Merged via the queue into main with commit cf6dee9 Jan 30, 2025
84 checks passed
@skyzh skyzh deleted the skyzh/fix-gc-compaction-read branch January 30, 2025 15:26
@skyzh
Copy link
Member Author

skyzh commented Jan 30, 2025

Had a call with Vlad and we agreed that this is the only way to quickly fix it; in the long term we need to make layer map copy-on-write.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test_pageserver_gc_compaction_smoke: could not find data for key
3 participants