-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve handling of publish/shelve for deposits with 5,000+ files #5187
Comments
Some additional observations: I noticed today that files continue to be written to Stacks for tm782sf2963 even when sidekiq says there's no I confirmed that there are a lot of duplicates in
|
The core problem is that DSA is making a synchronous call to purl-fetcher for publish/shelving, during which it waits for purl-fetcher to complete shelving. Since for large objects, shelving involves copying large number of files, this synchronous call is prone to timeout. This may leave behind a lock file (that blocks republishing attempts) and extra files in the While the timeout on the call from DSA can be increased, that is a fragile solution. The most straight-forward fix is to make purl-fetcher's publish/shelve work by an asynchronous job. DSA can either poll for the status or be notified by a web hook when completed. |
I haven't confirmed this from examining the logs, but I suspect that purl-fetcher continues its work even after DSA's request has timed out. When DSA's request times out it turns around and tries again, thus encountering the |
@justinlittman to prevent unnecessary recopying of files to |
@edsu with the current technology in place (activestorage disk driver), that is not possible. It is making a unique token for each anticipated upload. |
I've run a bunch of tests on stage to try to identify
How I tested I used the sdr-api client to send in incrementally larger deposits and then watched how long publishing took. It's not the easiest to monitor the publish step without doing something like watching the logs, but I tested enough to identify patterns.
A typical set of files is here: https://argo-stage.stanford.edu/view/xt030km6983 What I found Each publish involves two copying operations:
The first copying operation is consistently fast or at least fast enough:
The real slowdown in publish happens when files are copied from
I'm not sure why that is the case, but it's consistent across dozens of tests. It happens when the druid with many files is the only thing publishing and it happens when there's another druid publishing. It seems to be a per-druid slowdown that's not clearly affected by other activity on the file system. Could there be a process that is doing something like checking all the previous shelving work each time each individual file is copied to tl;dr Anyway, if I were targeting one aspect of |
Trying to summarize the issues that have been identified while working through these many-file deposits. This is roughly in order of severity as I see it, but I recognize it may not be the best order in which to approach addressing the issues.
|
As an additional data point, I just came across an item that contains 36,000 files and took only about an hour to shelve/publish in July 2024, before the versioning changes were released. |
Based on a discussion with Andrew, I ran a few quick filesystem write tests. I used the sample druid
I then checked file write latencies using NFS (NetApp) was faster, but not by enough to explain the discrepancies seen by Andrew. The headline from those tests was: This was roughly in line with the timings I was seeing from my manual test copying of the sample druid, e.g.: Copying from local filesystem to NFS mount:
Copying from local filesystem to Weka mount:
This all makes sense to me; the Weka system is better at large block writes than Netapp. The same file with 1mb block size: The TL;DR: is it doesn't appear to be an issue with the storage layer. Given Andrew's description of increasing slowness, my suspicion would be there's some kind of loop in the code that's taking longer to finish after each execution. Guess: After every file is written, the code's not doing an |
Thanks @julianmorley for that analysis. Here's the code: https://github.com/sul-dlss/purl-fetcher/blob/main/app/services/versioned_files_service/update_action.rb#L71-L78, where the actual move of a file is performed at https://github.com/sul-dlss/purl-fetcher/blob/main/app/services/versioned_files_service/contents.rb#L16-L19. If a move across filesystems is a copy then a delete, then perhaps the delete is getting slower? Seems like the next step is to add some benchmark logging to isolate the exact line of code that is slow. If that proves to be the move, then we can try splitting it into separate copy and deletes with logging. |
Yes, getting some timings into logs is a good idea. It could be the delete portion of the I bet if I re-wrote that |
Turns out it was neither the |
The issue
Deposits with 5,000+ files are taking a long time to publish. (I updated this from 15,000 to 5,000 based on testing on stage.) Even if the files are very small, it can take multiple hours to shelve 5,000 files.
We recently changed a timeout from 15 minutes to 10 hours in an attempt to get an item with 23,000 file deposited, and even then it took over 20 hours (two timeouts and three retries) to shelve the last 10,000 files in for that item. The item happened to be in progress when there was a systems outage, so it is not clear how long it had spend in shelving before the outage.
Production items that have had this problem recently:
There are multiple HB errors associated with these many-file deposits:
Observing the progress of these deposits, I think this is what's happening:
/access-transfer
- many more than are contained in the original deposit. It's likely the retries are rewriting the same files repeatedly.The result is that transfers that should taking much longer than they should, with the most extreme cases taking multiple days and requiring careful monitoring and attention.
See my comment below for more data on how long the copying steps are taking.
To reproduce
Accession an item on stage that has 5,000+ files and watch the shelving progress. You can see that copying files into Stacks takes multiple hours. For example, publish took 2 hours on this item that has 5000 files but only 23 KB of data. An item with 16,000 files took almost an entire day.
Additional background
We don't get many deposits over 5,000 files - and we ask people not to exceed 25,000 files in the H2 deposit form - but we have been supporting deposits in that range since we launched the H2-Globus deposit integration. We did not see this constellation of issues before we re-did the publish/shelve approach during the recent versioning work.
The text was updated successfully, but these errors were encountered: