Skip to content

Investigating timeouts from large objects (many files)

Peter Mangiafico edited this page Oct 6, 2022 · 4 revisions

It is possible for objects with many files, and therefore very large cocina representations, can timeout during processing. This is because some steps are asynchronous and need to load/pass along very large data structures over API calls, which can time out.

An investigation started with an object with >6k resources (and >24k files) was timing out during assemblyWF. This is the ticket https://github.com/sul-dlss/common-accessioning/issues/1005 and the object https://argo.stanford.edu/view/druid:rm242zg2446 and the HB alert https://app.honeybadger.io/projects/52894/faults/88727290

The fix was was extend the timeout for dor-services-app calls made via dor-services-client: https://github.com/sul-dlss/dor-services-client/pull/306

Investigation into the issue involved looking at the HB alert, determining the timeout was when posting data to dor-services, and then walking through the code on both the common-accessioning console and then dor-services-app to see what was taking a long time. This also involved saving large data structures to disk and reloading them on DSA.

Create the checksum data: (the checksum-compute robot does this if you look at the code in common-accessioning)

ssh lyberadmin@common-accessioning-prod-b
ROBOT_ENVIRONMENT=production bin/console
druid='druid:rm242zg2446'

robot = Robots::DorRepo::Assembly::ChecksumCompute.new
assembly_item = robot.send(:item,druid)
cocina_model = assembly_item.cocina_model;nil

# this took about 1 hour and produced data to save (which would have then been posted to DSA)
file_sets = robot.send(:compute_checksums, assembly_item, cocina_model);nil

# save data to a file
File.open("../../shared/tmp/#{druid}_file_sets.dump", 'wb') { |f| f.write(Marshal.dump(file_sets)) }

# if you want to keep working with this data later on common-accessioning without re-creating it, you can reload it
# file_sets = Marshal.load(File.read("../../shared/tmp/druid:rm242zg2446_file_sets.dump"));nil # load up the data

# this would generated the updated cocina that is posted and is what was timing out
updated = cocina_model.new(structural: cocina_model.structural.new(contains: file_sets));nil

Now move data to DSA:

ssh [email protected]
cd dor_services/shared/tmp
sftp lyberadmin@common-accessioning-prod-b
cd common-accessioning/shared/tmp
get druid:rm242zg2446_file_sets.dump tmp/druid:rm242zg2446_file_sets.dump
exit

Now load up the data in DSA and try the update:

bundle exec rails c -e p

conf.echo = false # prevents echoing of large data structures to the console

druid='druid:rm242zg2446'

file_sets = Marshal.load(File.read("../../shared/tmp/druid:rm242zg2446_file_sets.dump")) # load up the data

# this is the first thing DSA does on the API call and can take a while, potentially causing a timeout
cocina_model = CocinaObjectStore.find(druid) 

# this is what the checksum-compute robot creates to post to DSA 
# (ie this bit happens in the robot, just redoing it on DSA so we can use the data):
# https://github.com/sul-dlss/common-accessioning/blob/main/lib/robots/dor_repo/assembly/checksum_compute.rb#L17
updated = cocina_model.new(structural: cocina_model.structural.new(contains: file_sets));

# this is done by dor-services-client to prep the data to post to DSA
params = Cocina::Models.without_metadata(updated).to_json

# this is what the update action in DSA does (when it gets called from dor-services client):
# https://github.com/sul-dlss/dor-services-app/blob/main/app/controllers/objects_controller.rb#L33
# this line should build the updated object, but just blows up with a ton of output to the console, which
#  we have not figured out how to suppress...suspect it is from a conversion from Stuct to Hash under the hood somewhere
cocina_object = Cocina::Models.build(params.except(:action, :controller, :id).to_unsafe_h)

# save the object (can never get here because of all the output above
UpdateObjectService.update(cocina_object, skip_lock: true);