Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of a Primary/Archive setup for object storage. #13397

Merged
merged 14 commits into from
Apr 11, 2023

Conversation

ewdurbin
Copy link
Member

@ewdurbin ewdurbin commented Apr 9, 2023

This implements a Primary/Archive setup for our object storage.

Uploads are synchronously uploaded to the Primary object storage (namely Backblaze B2 as target) and then archived to Archive object store via task.

A task is implemented here which reports the number of files marked as having not been archived yet, by way of the new database column File.archived.

Additionally our CDN configuration is updated in pypi/infra@2b101c5...6c0e47f to report a log event to our metrics provider whenever a file is fetched from Archival storage rather than Primary. This will be useful to determine if we missed any files in migration, but also in the long run to catch missed archive tasks.

One notable missing piece from design is a task to automatically reconcile the state between the two buckets. I plan to use the visibility from the task and logs to design something that handles real failure modes, rather than trying to preempt.

Copy link
Member

@miketheman miketheman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not an exhaustive review, by any means, but looks like a reasonable approach to start with.

There's a little bit of risk for disk exhaustion of too many tasks run concurrently on a worker and download too many files before they can be closed out and removed, but I guess we'll have other alarms go off if that happens. We have disk alarms, right?

I don't see the "reconciliation" task in here yet - are you imagining this automated, or potentially an Admin page button to help with the occasional manual overrides?

docker-compose.yml Show resolved Hide resolved
requirements/main.in Show resolved Hide resolved
warehouse/packaging/models.py Outdated Show resolved Hide resolved
warehouse/packaging/tasks.py Show resolved Hide resolved
@ewdurbin
Copy link
Member Author

ewdurbin commented Apr 9, 2023

There's a little bit of risk for disk exhaustion of too many tasks run concurrently on a worker and download too many files before they can be closed out and removed, but I guess we'll have other alarms go off if that happens. We have disk alarms, right?

In regular operation it's not really any more concerning than our current exposure on the upload nodes (which write the uploaded file out to temporary storage) unless we're in a "catchup" scenario.

I don't see the "reconciliation" task in here yet - are you imagining this automated, or potentially an Admin page button to help with the occasional manual overrides?

I'm imagining a weekly (ish) automated run with reporting via metrics.

@ewdurbin ewdurbin force-pushed the primary_archive_storage branch from f842eed to fef6dd2 Compare April 9, 2023 16:45
@ppolewicz
Copy link

ppolewicz commented Apr 10, 2023 via email

@ewdurbin ewdurbin force-pushed the primary_archive_storage branch from 1f9c8f7 to 83677a7 Compare April 10, 2023 22:32
@ewdurbin ewdurbin force-pushed the primary_archive_storage branch 2 times, most recently from 2d6c458 to fc56c9f Compare April 10, 2023 22:47
@ewdurbin ewdurbin force-pushed the primary_archive_storage branch from fc56c9f to 2228129 Compare April 10, 2023 22:51
@ewdurbin ewdurbin marked this pull request as ready for review April 10, 2023 23:05
@ewdurbin ewdurbin requested a review from a team as a code owner April 10, 2023 23:05
@ewdurbin
Copy link
Member Author

OK, this is ready for full review.

The metric from https://github.com/pypi/warehouse/pull/13397/files#diff-d1bab9cf44a4f8cead347679b4d5efc2949fe9a7cbaf97a381e26d90c08dff2dR47-R58 combined with the logs from pypi/infra@2b101c5...6c0e47f should be enough to give us visibility into failover at least to start.

I plan to postpone implementation of the auto reconciliation task for the time being as I am 1) exhausted 2) out of time for this 3) interested to see what failure modes arise before trying to imagine them.

@ewdurbin ewdurbin changed the title initial working primary/archive setup Implementation of a Primary/Archive setup for object storage. Apr 10, 2023
@ewdurbin ewdurbin force-pushed the primary_archive_storage branch from 70226fc to 906a1e7 Compare April 11, 2023 10:17
pyproject.toml Outdated Show resolved Hide resolved
Copy link
Member

@miketheman miketheman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this is looking awesome!

I had one non-blocking question - if it works, it works! 😁

tests/unit/packaging/test_services.py Show resolved Hide resolved
@ewdurbin ewdurbin merged commit 629e546 into main Apr 11, 2023
@ewdurbin ewdurbin deleted the primary_archive_storage branch April 11, 2023 11:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants