-
Notifications
You must be signed in to change notification settings - Fork 983
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation of a Primary/Archive setup for object storage. #13397
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not an exhaustive review, by any means, but looks like a reasonable approach to start with.
There's a little bit of risk for disk exhaustion of too many tasks run concurrently on a worker and download too many files before they can be closed out and removed, but I guess we'll have other alarms go off if that happens. We have disk alarms, right?
I don't see the "reconciliation" task in here yet - are you imagining this automated, or potentially an Admin page button to help with the occasional manual overrides?
In regular operation it's not really any more concerning than our current exposure on the upload nodes (which write the uploaded file out to temporary storage) unless we're in a "catchup" scenario.
I'm imagining a weekly (ish) automated run with reporting via metrics. |
f842eed
to
fef6dd2
Compare
I hope to get that done rapidly
|
1f9c8f7
to
83677a7
Compare
2d6c458
to
fc56c9f
Compare
fc56c9f
to
2228129
Compare
OK, this is ready for full review. The metric from https://github.com/pypi/warehouse/pull/13397/files#diff-d1bab9cf44a4f8cead347679b4d5efc2949fe9a7cbaf97a381e26d90c08dff2dR47-R58 combined with the logs from pypi/infra@2b101c5...6c0e47f should be enough to give us visibility into failover at least to start. I plan to postpone implementation of the auto reconciliation task for the time being as I am 1) exhausted 2) out of time for this 3) interested to see what failure modes arise before trying to imagine them. |
70226fc
to
906a1e7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, this is looking awesome!
I had one non-blocking question - if it works, it works! 😁
This implements a Primary/Archive setup for our object storage.
Uploads are synchronously uploaded to the Primary object storage (namely Backblaze B2 as target) and then archived to Archive object store via task.
A task is implemented here which reports the number of files marked as having not been archived yet, by way of the new database column File.archived.
Additionally our CDN configuration is updated in pypi/infra@2b101c5...6c0e47f to report a log event to our metrics provider whenever a file is fetched from Archival storage rather than Primary. This will be useful to determine if we missed any files in migration, but also in the long run to catch missed archive tasks.
One notable missing piece from design is a task to automatically reconcile the state between the two buckets. I plan to use the visibility from the task and logs to design something that handles real failure modes, rather than trying to preempt.