-
Notifications
You must be signed in to change notification settings - Fork 485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pageserver: fix race cleaning up timeline files when shut down during bootstrap #10532
Conversation
e0c6856
to
e27ba6f
Compare
7414 tests run: 7063 passed, 0 failed, 351 skipped (full report)Flaky tests (7)Postgres 17
Postgres 14
Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
65b60cb at 2025-01-29T22:18:53.197Z :recycle: |
the |
e27ba6f
to
d5c8e08
Compare
Yeah, the issue was more pervasive than I realised, so once I made the drop() strict about checking for shutdown, other things started failing. Reworked this to more reliably enforce rules about shutdown, by wrapping everything in an UninitializedTimeline.write() helper that spawns flush loop but also ensures it gets shut down on failures. |
ee2a0c2
to
d837aa7
Compare
d837aa7
to
f032330
Compare
… bootstrap (neondatabase#10532) ## Problem Timeline bootstrap starts a flush loop, but doesn't reliably shut down the timeline (incl. waiting for flush loop to exit) before destroying UninitializedTimeline, and that destructor tries to clean up local storage. If local storage is still being written to, then this is unsound. Currently the symptom is that we see a "Directory not empty" error log, e.g. https://neon-github-public-dev.s3.amazonaws.com/reports/main/12966756686/index.html#testresult/5523f7d15f46f7f7/retries ## Summary of changes - Move fallible IO part of bootstrap into a function (notably, this is fallible in the case of the tenant being shut down while creation is happening) - When that function returns an error, call shutdown() on the timeline
Problem
Timeline bootstrap starts a flush loop, but doesn't reliably shut down the timeline (incl. waiting for flush loop to exit) before destroying UninitializedTimeline, and that destructor tries to clean up local storage. If local storage is still being written to, then this is unsound.
Currently the symptom is that we see a "Directory not empty" error log, e.g. https://neon-github-public-dev.s3.amazonaws.com/reports/main/12966756686/index.html#testresult/5523f7d15f46f7f7/retries
Summary of changes