-
Notifications
You must be signed in to change notification settings - Fork 7
Checksum Verification
The fastest way to verify that all the files from Scholarsphere 3 were correctly migrated is to compare the etag calculated by Amazon's S3 service with the original md5 checksum that was calculated by Fits when the file was added to Scholarsphere 3.
While the etag verification process can be used for other purposes, it will not the be the only checksum verification method in Scholarsphere. We will need to created separate checksums, such as sha256, and store those with the file's metadata for future reference.
Scholarsphere::Migration::Resource.where(model: 'FileSet').where('exception IS NULL').each do |resource|
begin
FileSet.find(resource.pid)
rescue Ldp::Gone
resource.update(exception: 'Ldp::Gone', error: nil)
end
end
These are files that were uploaded via Box but were never ingested due to timeout issues.
Scholarsphere::Migration::Resource.where(model: 'FileSet').where('exception IS NULL').each do |resource|
begin
resource.update(exception: 'ArgumentError', error: 'original_file is nil') if FileSet.find(resource.pid).original_file.nil?
rescue StandardError => e
puts "Failed to update #{resource.pid}: #{e.message}"
end
end
Scholarsphere::Migration::Resource.where(model: 'FileSet').where('exception IS NULL').each do |resource|
FileSet.find(resource.pid).original_checksum
end
class EtagError < StandardError; end
def update_file_set(resource)
url = URI("https://scholarsphere.psu.edu/api/public")
https = Net::HTTP.new(url.host, url.port)
https.use_ssl = true
request = Net::HTTP::Post.new(url)
request["x-api-key"] = ENV['SS4_API_KEY']
request["Content-Type"] = "application/json"
request.body = "{\"query\":\"{\\n file(pid: \\\"#{resource.pid}\\\") {\\n etag\\n }\\n}\\n\",\"variables\":{}}"
result = https.request(request)
resource.update(client_status: result.code, client_message: result.read_body)
rescue StandardError => e
resource.update(exception: "EtagError", message: e.message)
end
Scholarsphere::Migration::Resource.where(model: 'FileSet').where('exception IS NULL and client_status IS NULL').each do |resource|
update_file_set(resource)
end
See https://teppen.io/2018/06/23/aws_s3_etags/
def check_md5(pid:, etag:)
file_set = FileSet.find(pid)
if file_set.original_checksum.first == etag
"passed"
else
"failed"
end
rescue
"unknown"
end
report = {}
Scholarsphere::Migration::Resource.where(model: 'FileSet').each do |resource|
if resource.client_status == "200"
etag = resource.message.dig('data', 'file', 'etag')
if etag.nil?
report[resource.pid] = "Etag is missing. It's been removed from Scholarsphere 4"
elsif etag.match?('-')
report[resource.pid] = "Skipped"
else
report[resource.pid] = check_md5(pid: resource.pid, etag: etag)
end
else
report[resource.pid] = "#{resource.exception}: #{resource.error}"
end
end
The md5 checksum that was from the Fits report does not match the checksum in Scholarsphere 4, so we re-calculate the checksum from the existing file in Scholarsphere 3. This is most likely because the a newer version of the file was uploaded, but Fits was never run or updated.
report.select { |k, v| v == "failed" }.keys.map do |pid|
resource = Scholarsphere::Migration::Resource.find_by(pid: pid)
etag = resource.message.dig('data', 'file', 'etag')
file_set = FileSet.find(pid)
location = FileSetDiskLocation.new(file_set)
md5 = Digest::MD5.hexdigest File.read(location.path)
if etag == md5
report[resource.pid] = "passed"
else
report[resource.pid] = "failed"
end
end
The Etag isn't the actual md5, it is a custom etag created for multipart uploads. In this case, we calculate the custom etag locally and compare it to the original from Amazon.
total = report.select { |k, v| v == "failed" }.keys.count
counter = 1
report.select { |k, v| v == "failed" }.keys.map do |pid|
resource = Scholarsphere::Migration::Resource.find_by(pid: pid)
etag = resource.message.dig('data', 'file', 'etag')
file_set = FileSet.find(pid)
location = FileSetDiskLocation.new(file_set)
command = "./s3md5 --etag #{etag} 50 #{location.path}"
print "Checking #{counter}/#{total}..."
stdout, stderr, status = Open3.capture3(command)
if status == 0 && stdout
report[resource.pid] = "passed"
else
report[resource.pid] = "failed"
end
puts "done!"
counter = counter + 1
end