Skip to content

Checksum Verification

Adam Wead edited this page Apr 27, 2021 · 17 revisions

Strategy

The fastest way to verify that all the files from Scholarsphere 3 were correctly migrated is to compare the etag calculated by Amazon's S3 service with the original md5 checksum that was calculated by Fits when the file was added to Scholarsphere 3.

While the etag verification process can be used for other purposes, it will not the be the only checksum verification method in Scholarsphere. We will need to created separate checksums, such as sha256, and store those with the file's metadata for future reference.

Getting Fits Checksums

Find Deleted Files

Scholarsphere::Migration::Resource.where(model: 'FileSet').where('exception IS NULL').each do |resource|
  begin
    FileSet.find(resource.pid)
  rescue Ldp::Gone
    resource.update(exception: 'Ldp::Gone', error: nil)
  end
end

Find Blank Files

These are files that were uploaded via Box but were never ingested due to timeout issues.

Scholarsphere::Migration::Resource.where(model: 'FileSet').where('exception IS NULL').each do |resource|
  begin
    resource.update(exception: 'ArgumentError', error: 'original_file is nil') if FileSet.find(resource.pid).original_file.nil?
  rescue StandardError => e
    puts "Failed to update #{resource.pid}: #{e.message}"
  end
end

MD5 Checksums

Scholarsphere::Migration::Resource.where(model: 'FileSet').where('exception IS NULL').each do |resource|
  FileSet.find(resource.pid).original_checksum
end

Get Etag

class EtagError < StandardError; end

def update_file_set(resource)
  url = URI("https://scholarsphere.psu.edu/api/public")

  https = Net::HTTP.new(url.host, url.port)
  https.use_ssl = true

  request = Net::HTTP::Post.new(url)
  request["x-api-key"] = ENV['SS4_API_KEY']
  request["Content-Type"] = "application/json"
  request.body = "{\"query\":\"{\\n  file(pid: \\\"#{resource.pid}\\\") {\\n    etag\\n  }\\n}\\n\",\"variables\":{}}"

  result = https.request(request)
   
  resource.update(client_status: result.code, client_message: result.read_body)
rescue StandardError => e
  resource.update(exception: "EtagError", message: e.message)  
end

Scholarsphere::Migration::Resource.where(model: 'FileSet').where('exception IS NULL and client_status IS NULL').each do |resource|
  update_file_set(resource)
end

Compare Checksums

See https://teppen.io/2018/06/23/aws_s3_etags/

checked = []
missing = []

Scholarsphere::Migration::Resource.where(model: 'FileSet').where(client_status: "200").each do |resource|
  etag = resource.message.dig('data', 'file', 'etag')

  if etag
    checked << resource.pid if etag.match?('-')
  else
    missing << resource.pid
  end
end
Clone this wiki locally