You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
currently, our replication audit code compares the checksum we have stored for an archive part with the checksum the S3 provider stores in their metadata for the zip part (see PreservationCatalog::S3::Audit#compare_checksum_metadata).
however, the checksum stored in AWS metadata is just the one we computed and provided to them, so we're only checking to see that the metadata hasn't drifted between the two sources. this check is cheap to do, since we're already reaching out to the AWS to see if the archived part is still available from the cloud as expected. but it's also not a super-meaningful check to have pass.
more meaningful would be random spot-checks of archive contents for fixity. that is, randomly pull down archived copies every so often. make sure the checksums we recompute for the retrieved parts match the checksums we have stored, and that the internal checksums all match the content in the Moab when the zip parts are put back together and re-inflated. we don't want to do that for every zip during the course of regular replication auditing, because that'd be expensive, and overkill.
but some occasional retrieval of content and re-computation of checksums would provide extra peace of mind that our replication strategy is working and that the cloud archives will be usable if needed.
The text was updated successfully, but these errors were encountered:
ndushay
changed the title
replication audit: perform spot checks of cloud archives for fixity
replication audit: perform spot checks of cloud archives for entire moab fixity
Dec 2, 2019
also, something something fargate task that checksums within AWS infra and doesn't incur egress charges.
either way, good to keep on the radar, but likely out of scope for the 2022 maintenance work, which is more about making what's there more maintainable.
currently, our replication audit code compares the checksum we have stored for an archive part with the checksum the S3 provider stores in their metadata for the zip part (see
PreservationCatalog::S3::Audit#compare_checksum_metadata
).however, the checksum stored in AWS metadata is just the one we computed and provided to them, so we're only checking to see that the metadata hasn't drifted between the two sources. this check is cheap to do, since we're already reaching out to the AWS to see if the archived part is still available from the cloud as expected. but it's also not a super-meaningful check to have pass.
more meaningful would be random spot-checks of archive contents for fixity. that is, randomly pull down archived copies every so often. make sure the checksums we recompute for the retrieved parts match the checksums we have stored, and that the internal checksums all match the content in the Moab when the zip parts are put back together and re-inflated. we don't want to do that for every zip during the course of regular replication auditing, because that'd be expensive, and overkill.
but some occasional retrieval of content and re-computation of checksums would provide extra peace of mind that our replication strategy is working and that the cloud archives will be usable if needed.
The text was updated successfully, but these errors were encountered: