Skip to content
This repository has been archived by the owner on Nov 18, 2020. It is now read-only.

Fixity in Sufia with Fedora 4

Hector edited this page Aug 22, 2014 · 7 revisions

Fedora 4 provides Fixity checks out of the box. Below is a description of the steps to update Sufia's implementation of Fixity checks to use Fedora 4's built-in mechanism.

Current Implementation

In Sufia we use a queue to schedule Fixity checks on documents. The queueing and execution is done via Resque.

The code to queue a job is in sufia-models/app/models/concerns/sufia/generic_file/audit.rb and it is (conditionally?) executed when users view a file

def audit.audit(version, force = false)
	Sufia.queue.push(AuditJob.new(version.pid, version.dsid, version.versionID))
end

A background worker picks jobs from the queue and executes them. This code is in sufia-models/app/jobs/audit_job.rb:

def ActiveJob.run
	datastream = generic_file.datastreams[datastream_id]
	version =  datastream.versions.select { |v| v.versionID == version_id}.first
	log = run_audit(version)
end

def run_audit(version)
	object.class.run_audit(version)
end

The call to object.class.run_audit(version) goes back to sufia-models/app/models/concerns/sufia/generic_file/audit.rb to perform the actual Fixity check via dsChecksumValid:

def audit.run_audit(version)
	if version.dsChecksumValid
		# blah blah blah
	else
		# blah blah blah
	end
	check = ChecksumAuditLog.create!(pass: passing, pid: version.pid, dsid: version.dsid, version: version.versionID)
	check
end

The call to version.dsChecksumValid in turn goes to Rubydora https://github.com/projecthydra/rubydora/blob/master/lib/rubydora/datastream.rb#L54

def dsChecksumValid
    profile(:validateChecksum=>true)['dsChecksumValid']
end

The profile method is also defined in Rubydora's Datastream class https://github.com/projecthydra/rubydora/blob/master/lib/rubydora/datastream.rb#L256-268

# Retrieve the datastream profile as a hash (and cache it)
# @param opts [Hash] :validateChecksum if you want fedora to validate the checksum
# @return [Hash] see Fedora #getDatastream documentation for keys
def profile opts= {}
    if @profile && !(opts[:validateChecksum] && [email protected]_key?('dsChecksumValid'))
        ## Force a recheck of the profile if they've passed :validateChecksum and we don't have dsChecksumValid
        return @profile
    end
    return @profile = {} unless digital_object.respond_to? :repository
    @profile = repository.datastream_profile(pid, dsid, opts[:validateChecksum], asOfDateTime)
end

The following quote describes how the checksumming process works in Fedora 2.2 and seems to be relevant for Fedora 3:

When automatic checksumming is enabled, whenever a object is ingested into Fedora, as each datastream is processed, all of the bytes comprising the content of the datastream are passed to the appropriate checksumming algorithm. This algorithm will compute and return a digital signature for the content of the datastream. [...] These computed datastream checksums will then be stored in the XML representation of the digital object. Additionally, whenever a new datastream is added to an existing object (via addDatastream), and whenever a existing datastream is modified (via modifyDatastreamByValue or modifyDatastreambyReference) a new checksum will be computed and stored in the object. http://www.fedora.info/download/2.2/userdocs/server/features/checksumming.html

TODO: I am not sure yet how dsChecksumValid in Rubydora (Fedora 3?) is different from the new Fixity endpoint in Fedora 4.

Fixity in Fedora 4

Fedora 4 provides a built-in mechanism to perform Fixity checks.

If your repository is configured to retain multiple copies of binary content, when you request a fixity check of that content, Fedora will run fixity checks against each copy it stores. It will also "self-heal" all copies of the content, if it has a good copy available. https://wiki.duraspace.org/display/FF/Durability

We can execute these checks via calls to the HTTP API as described on this page https://wiki.duraspace.org/display/FF/RESTful+HTTP+API#RESTfulHTTPAPI-Fixity. The basic call is an HTTP GET request to the path of a document plus the "/fcr:fixity" sufix, for example:

HTTP GET http://someserver/rest/path/to/document/fcr:fixity

The response will include the result of the check, either SUCCESS or BAD_CHECKSUM (Question: are there other kinds of failures other that BAD_CHECKSUM?)

Below is a partial XML+RDF response for a successful Fixity check. Notice the SUCCESS text in the status element.

<rdf:Description rdf:about="http://localhost:8080/rest/good_datastream#fixity/1408035440423">
	<status xmlns="http://fedora.info/definitions/v4/repository#" rdf:datatype="http://www.w3.org/2001/XMLSchema#string">SUCCESS</status>
	<hasMessageDigest xmlns="http://www.loc.gov/premis/rdf/v1#" rdf:resource="urn:sha1:fef212288b3c2423ab2c43a39f73031de6c0b057"/>
	<hasSize xmlns="http://www.loc.gov/premis/rdf/v1#" rdf:datatype="http://www.w3.org/2001/XMLSchema#int">38</hasSize>
</rdf:Description>

Below is a partial XML+RDF response for a Fixity check that found errors. Notice the BAD_CHECKSUM text in the status element.

<rdf:Description rdf:about="http://localhost:8080/rest/bad_datastream#fixity/1408035437845">
	<status xmlns="http://fedora.info/definitions/v4/repository#" rdf:datatype="http://www.w3.org/2001/XMLSchema#string">BAD_CHECKSUM</status>
	<hasMessageDigest xmlns="http://www.loc.gov/premis/rdf/v1#" rdf:resource="urn:sha1:1ad61d5f1cdacac66c10051ab55c37130aa849c1"/>
	<hasSize xmlns="http://www.loc.gov/premis/rdf/v1#" rdf:datatype="http://www.w3.org/2001/XMLSchema#int">27</hasSize>
</rdf:Description>

The API returns HTTP 200 status code for both successful and failed Fixity checks.

If the path/to/document in the HTTP GET points to a non-existing document the response will have an HTTP 404 status code and the body of the response will be an HTML page (that we probably want to ignore.)

You can only execute Fixity checks on "datastream" nodes. Attempting to execute a Fixity check on a "object" node will result in an HTTP 404 response. This shouldn't be a problem for us since we should only need to execute Fixity checks on datastreams. It would be nicer if Fedora would return HTTP 400 (Bad Request) or 415 (Unsupported Media Type) in these instances, though.

A word of caution, on Fedora version 4.0.0-beta-01 executing the "check fixity" option in the REST API web page does not seem to report the result of the check even though the API supports the functionality.

According to the documentation Fedora 4 will include a fixity queueing and reporting service but as of August/2014 the current implementation has marked as outdated since it was last changed a year ago. https://github.com/fcrepo4-labs/fcrepo-fixity

New Implementation

If we are going to stop using Rubydora once we migrate to Fedora 4 we will need to update the call to version.dsChecksumValid to call our new mechanism to kick off the Fixity check in Fedora 4. Not sure if this new mechanism will go into ActiveFedora, Sufia, or somewhere else.

We might still need to leave the Fixity checks as background jobs since according to the Fedora documentation

Checking fixity requires retrieving the content from the binary store and may take some time. https://wiki.duraspace.org/display/FF/RESTful+HTTP+API#RESTfulHTTPAPI-Fixity

In our current Fedora 3 data model we have datastreams for many things beside content (e.g. rights, FITS, Rels-Ext, desc_metadata) and I understand that we currently are performing checksum checks on them. If my understanding is correct and the plan is to migrate those other datastreams to Fedora 4 properties we won't be able to perform Fixity checks on them as Fixity checks are only available for datastreams. Will this be a problem for us? Do we need Fixity checks for those non-content datastreams?

Clone this wiki locally