Investigating timeouts from large objects (many files)

It is possible for objects with many files, and therefore very large cocina representations, can timeout during processing. This is because some steps are asynchronous and need to load/pass along very large data structures over API calls, which can time out.

An investigation started with an object with >6k resources (and >24k files) was timing out during assemblyWF. This is the ticket https://github.com/sul-dlss/common-accessioning/issues/1005 and the object https://argo.stanford.edu/view/druid:rm242zg2446 and the HB alert https://app.honeybadger.io/projects/52894/faults/88727290

The fix was was extend the timeout for dor-services-app calls made via dor-services-client: https://github.com/sul-dlss/dor-services-client/pull/306

Investigation into the issue involved looking at the HB alert, determining the timeout was when posting data to dor-services, and then walking through the code on both the common-accessioning console and then dor-services-app to see what was taking a long time. This also involved saving large data structures to disk and reloading them on DSA.

Create the checksum data: (the checksum-compute robot does this if you look at the code in common-accessioning)

ssh lyberadmin@common-accessioning-prod-b
ROBOT_ENVIRONMENT=production bin/console
druid='druid:rm242zg2446'

robot = Robots::DorRepo::Assembly::ChecksumCompute.new
assembly_item = robot.send(:item,druid)
cocina_model = assembly_item.cocina_model;nil

# this took about 1 hour and produced data to save (which would have then been posted to DSA)
file_sets = robot.send(:compute_checksums, assembly_item, cocina_model);nil

# save data to a file
File.open("../../shared/tmp/#{druid}_file_sets.dump", 'wb') { |f| f.write(Marshal.dump(file_sets)) }

# if you want to keep working with this data later on common-accessioning without re-creating it, you can reload it
# file_sets = Marshal.load(File.read("../../shared/tmp/druid:rm242zg2446_file_sets.dump"));nil # load up the data

# this would generated the updated cocina that is posted and is what was timing out
updated = cocina_model.new(structural: cocina_model.structural.new(contains: file_sets));nil

Now move data to DSA:

ssh [email protected]
cd dor_services/shared/tmp
sftp lyberadmin@common-accessioning-prod-b
cd common-accessioning/shared/tmp
get druid:rm242zg2446_file_sets.dump tmp/druid:rm242zg2446_file_sets.dump
exit

Now load up the data in DSA and try the update:

bundle exec rails c -e p

conf.echo = false # prevents echoing of large data structures to the console

druid='druid:rm242zg2446'

file_sets = Marshal.load(File.read("../../shared/tmp/druid:rm242zg2446_file_sets.dump")) # load up the data

# this is the first thing DSA does on the API call and can take a while, potentially causing a timeout
cocina_model = CocinaObjectStore.find(druid) 

# this is what the checksum-compute robot creates to post to DSA 
# (ie this bit happens in the robot, just redoing it on DSA so we can use the data):
# https://github.com/sul-dlss/common-accessioning/blob/main/lib/robots/dor_repo/assembly/checksum_compute.rb#L17
updated = cocina_model.new(structural: cocina_model.structural.new(contains: file_sets));

# this is done by dor-services-client to prep the data to post to DSA
params = Cocina::Models.without_metadata(updated).to_json

# this is what the update action in DSA does (when it gets called from dor-services client):
# https://github.com/sul-dlss/dor-services-app/blob/main/app/controllers/objects_controller.rb#L33
# this line should build the updated object, but just blows up with a ton of output to the console, which
#  we have not figured out how to suppress...suspect it is from a conversion from Stuct to Hash under the hood somewhere
cocina_object = Cocina::Models.build(params.except(:action, :controller, :id).to_unsafe_h)

# save the object (can never get here because of all the output above
UpdateObjectService.update(cocina_object, skip_lock: true);

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigating timeouts from large objects (many files)

Clone this wiki locally