Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: fix race cleaning up timeline files when shut down during bootstrap #10532

Merged
merged 3 commits into from
Jan 30, 2025

Conversation

jcsp
Copy link
Collaborator

@jcsp jcsp commented Jan 28, 2025

Problem

Timeline bootstrap starts a flush loop, but doesn't reliably shut down the timeline (incl. waiting for flush loop to exit) before destroying UninitializedTimeline, and that destructor tries to clean up local storage. If local storage is still being written to, then this is unsound.

Currently the symptom is that we see a "Directory not empty" error log, e.g. https://neon-github-public-dev.s3.amazonaws.com/reports/main/12966756686/index.html#testresult/5523f7d15f46f7f7/retries

Summary of changes

  • Move fallible IO part of bootstrap into a function (notably, this is fallible in the case of the tenant being shut down while creation is happening)
  • When that function returns an error, call shutdown() on the timeline

@jcsp jcsp requested a review from a team as a code owner January 28, 2025 10:44
@jcsp jcsp requested a review from skyzh January 28, 2025 10:44
@jcsp jcsp changed the title Jcsp/issue 10389 cleanup race pageserver: fix race cleaning up timeline files when shut down during bootstrap Jan 28, 2025
@jcsp jcsp force-pushed the jcsp/issue-10389-cleanup-race branch from e0c6856 to e27ba6f Compare January 28, 2025 10:52
Copy link

github-actions bot commented Jan 28, 2025

7414 tests run: 7063 passed, 0 failed, 351 skipped (full report)


Flaky tests (7)

Postgres 17

Postgres 14

Code coverage* (full report)

  • functions: 33.3% (8496 of 25492 functions)
  • lines: 49.1% (71432 of 145523 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
65b60cb at 2025-01-29T22:18:53.197Z :recycle:

@skyzh
Copy link
Member

skyzh commented Jan 28, 2025

the test_import_from_vanilla test failure doesn't seem transient?

@jcsp jcsp force-pushed the jcsp/issue-10389-cleanup-race branch from e27ba6f to d5c8e08 Compare January 29, 2025 15:16
@jcsp jcsp marked this pull request as draft January 29, 2025 15:16
@jcsp
Copy link
Collaborator Author

jcsp commented Jan 29, 2025

the test_import_from_vanilla test failure doesn't seem transient?

Yeah, the issue was more pervasive than I realised, so once I made the drop() strict about checking for shutdown, other things started failing.

Reworked this to more reliably enforce rules about shutdown, by wrapping everything in an UninitializedTimeline.write() helper that spawns flush loop but also ensures it gets shut down on failures.

@jcsp jcsp force-pushed the jcsp/issue-10389-cleanup-race branch 2 times, most recently from ee2a0c2 to d837aa7 Compare January 29, 2025 18:04
@jcsp jcsp force-pushed the jcsp/issue-10389-cleanup-race branch from d837aa7 to f032330 Compare January 29, 2025 19:57
@jcsp jcsp marked this pull request as ready for review January 30, 2025 09:18
@jcsp jcsp added this pull request to the merge queue Jan 30, 2025
Merged via the queue into main with commit 6da7c55 Jan 30, 2025
85 checks passed
@jcsp jcsp deleted the jcsp/issue-10389-cleanup-race branch January 30, 2025 20:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants