Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve handling of publish/shelve for deposits with 5,000+ files #5187

Open
andrewjbtw opened this issue Oct 9, 2024 · 12 comments
Open

Improve handling of publish/shelve for deposits with 5,000+ files #5187

andrewjbtw opened this issue Oct 9, 2024 · 12 comments
Labels

Comments

@andrewjbtw
Copy link

andrewjbtw commented Oct 9, 2024

The issue

Deposits with 5,000+ files are taking a long time to publish. (I updated this from 15,000 to 5,000 based on testing on stage.) Even if the files are very small, it can take multiple hours to shelve 5,000 files.

We recently changed a timeout from 15 minutes to 10 hours in an attempt to get an item with 23,000 file deposited, and even then it took over 20 hours (two timeouts and three retries) to shelve the last 10,000 files in for that item. The item happened to be in progress when there was a systems outage, so it is not clear how long it had spend in shelving before the outage.

Production items that have had this problem recently:

  • zg292rq7608 (recently resolved by increasing timeout)
  • tm782sf2963 (finished after timing out twice with a 10 hour timeout after the item had already been running for 15+ hours previously)

There are multiple HB errors associated with these many-file deposits:

Observing the progress of these deposits, I think this is what's happening:

  • while shelving: the publish timeout results in the job retrying before it can complete, which has led us to increase the timeout to 10 hours in order to avoid rapid retries
  • files very slowly get copied to Stacks (the process starts out fast, but slows to <100 every 5 minutes)
  • if there are retries, many files are created in /access-transfer - many more than are contained in the original deposit. It's likely the retries are rewriting the same files repeatedly.
  • when the files finally are all copied to Stacks, the job attempts to publish the purl metadata
  • the purl metadata publish takes >15 minutes, which led us to increase a timeout there to 30 minutes (possibly more now?)

The result is that transfers that should taking much longer than they should, with the most extreme cases taking multiple days and requiring careful monitoring and attention.

See my comment below for more data on how long the copying steps are taking.

To reproduce

Accession an item on stage that has 5,000+ files and watch the shelving progress. You can see that copying files into Stacks takes multiple hours. For example, publish took 2 hours on this item that has 5000 files but only 23 KB of data. An item with 16,000 files took almost an entire day.

Additional background

We don't get many deposits over 5,000 files - and we ask people not to exceed 25,000 files in the H2 deposit form - but we have been supporting deposits in that range since we launched the H2-Globus deposit integration. We did not see this constellation of issues before we re-did the publish/shelve approach during the recent versioning work.

@andrewjbtw
Copy link
Author

andrewjbtw commented Oct 10, 2024

Some additional observations:

I noticed today that files continue to be written to Stacks for tm782sf2963 even when sidekiq says there's no publish job running. A few thousand files have been written since about 5 PM today (it's now midnight) even though I killed the publish job in the sidekiq dashboard. That makes me think jobs continue to run after an error occurs and it seems likely that multiple retries have been writing files at the same time.

I confirmed that there are a lot of duplicates in /access-transfers. I make a list of ~35,000 files under 1 MB in size, then generated md5 checksums for them. Many files have been written 5+ times. These are the most frequently occurring hashes (representing a duplicated file) in my sample:

      7 e0f4840994c4c9d61471167d9c08d17c
      7 e693e2e76c85e331c5adef362464fbfb
      7 eb363fb32762ae24766b175794a3f858
      7 ec02fcacee1d8f14ea2a1bef1b2adb3b
      7 ee9006556d9e52bb11afa42d3bade517
      7 f047ad6338d8781affeaad92e9ac9e6f
      7 f104e16c7054ff0d87870d31d8862640
      7 f2b60bf782507ed6caa5db7807539275
      7 f32376261c00309395f9e4d2c1bdb336
      7 f32c5b47051a1b01b7679ca31ddd2055
      7 f385cf3217c97da56fd39a356186e057
      7 f4be3e869db7e50513ebff1527ff9323
      7 f5fea83e6ca111e189dd94c170271965
      7 fc5e1796a511d542ba28f2559465484c
      7 fc65fa6e23e715875357c356228811c9
      7 fca07daac869f4029e78100468e1685c
      7 fccb6254959b36a6f64c5f6e87eba8b0
      7 fd4b6d4c5aaddb606f744f25456a6aec
      8 019249d843b38662c5267491fea866d0
      8 01ae3b389207afbce9452023bb720e19
      8 0ce8c834154420f8577726405465c410
      8 0f8687440780e694a6c1a6bc69dce77c
      8 1b313cdfdd44b8687c893742427157b9
      8 274f98e8cf959ae4a3013845eefabdf6
      8 32ac8c39805ef474c16a31a833b51f78
      8 45e90a76fea53ca343472250fbeb8ec4
      8 4d7df8a1df670059661637faf560209e
      8 4f7ecfb977801cf6013ac01596ead441
      8 6b52b6a47ef5af678c92f52491b21910
      8 6c98060fa71cf54b439b21a34be59225
      8 72b5b40cfc7280de740810786ff046cd
      8 749f43c9f6bc20c9aacae04d3b80d8b6
      8 8096a323ab1b5acb8510a419f5fef8d9
      8 8149eb4f6a7940e422c68923392eb89f
      8 855f5e2a607a046ee43618565266c656
      8 89b32fd2a4e06ebc6c444a81d99ee8d6
      8 8e948776d7744629e068eec645d4e30b
      8 a8e270344d441b021c24dc3329b35449
      8 ac44af4b9095c2b90a08ea6e077d5fdf
      8 ac7b04f1427813b67e0210f45e26d4d6
      8 ae8e350b47bcab50e4e760b6e9624e02
      8 bb9c5c46c9f26b722ad5264e72799ca1
      8 c1f21e9e996cf8b69eed2390754bcde0
      8 c71390d1c7e9f9ce945fcbd47524f985
      8 cd67f1bc593dd57e48e3f665b63b997b
      8 d2f56c3d7a11c0c44ec18630b8fc4c3e
      8 ea5ca469a6d144015e273754d879d43f
      9 2c90d38043382c3cfca399b3149d55b7
      9 a07c8962e02fb5b0b58395b3339e8373
     37 d41d8cd98f00b204e9800998ecf8427e

@justinlittman
Copy link
Contributor

The core problem is that DSA is making a synchronous call to purl-fetcher for publish/shelving, during which it waits for purl-fetcher to complete shelving. Since for large objects, shelving involves copying large number of files, this synchronous call is prone to timeout. This may leave behind a lock file (that blocks republishing attempts) and extra files in the /access-transfer file system (which are cleaned up automatically after a week).

While the timeout on the call from DSA can be increased, that is a fragile solution. The most straight-forward fix is to make purl-fetcher's publish/shelve work by an asynchronous job. DSA can either poll for the status or be notified by a web hook when completed.

@edsu
Copy link
Contributor

edsu commented Oct 10, 2024

I haven't confirmed this from examining the logs, but I suspect that purl-fetcher continues its work even after DSA's request has timed out. When DSA's request times out it turns around and tries again, thus encountering the .lock problem until purl-fetcher has finished and removed the .lock.

@edsu
Copy link
Contributor

edsu commented Oct 10, 2024

@justinlittman to prevent unnecessary recopying of files to /access-transfer would it be feasible for ShelvableFilesStager or TransferStager to know if the files have already been copied, and turn the copy into a no-op?

https://github.com/sul-dlss/dor-services-app/blob/main/app/services/publish/metadata_transfer_service.rb#L74-L77

@jcoyne
Copy link
Contributor

jcoyne commented Oct 10, 2024

@edsu with the current technology in place (activestorage disk driver), that is not possible. It is making a unique token for each anticipated upload.

@andrewjbtw
Copy link
Author

I've run a bunch of tests on stage to try to identify

  • where publish/shelve is taking so long
  • possible reasons for the lengthy jobs

How I tested

I used the sdr-api client to send in incrementally larger deposits and then watched how long publishing took. It's not the easiest to monitor the publish step without doing something like watching the logs, but I tested enough to identify patterns.

  • Size of deposits: started with 100 files and tested in increments of 100 until reaching 3000 files.
  • Conditions under which I tested:
    • I started with "clean" filesystem (only a few files in /access-transfer each time)
    • Ran tests again while the /access-transfer filesystem was in use by other processes. Last set of tests ran while there were already 40,000+ files in /access-transfer
  • file size: all the files are small text files. The only variable is the size of the files.

A typical set of files is here: https://argo-stage.stanford.edu/view/xt030km6983

What I found

Each publish involves two copying operations:

  1. copy from /dor/workspace into /access-transfer (on stage it's /access-transfer-stage)
  2. copy from /access-transfer into the /stacks/druid/tree/content location

The first copying operation is consistently fast or at least fast enough:

  • I did not see any noticeable delays in copying into /access-transfer
  • items with <1000 files copied almost instantly and all of accessioning was completed in less than 5 minutes
  • on prod, I've seen 16k - 23k files take about an hour or so to copy into/access-transfer, which is not terrible given that's real, multi-GB data

The real slowdown in publish happens when files are copied from /access-transfer to /stacks'. This behavior is somewhat unexpected because copying doesn't happen at a constant rate. Instead, it gets slower over time. I had a script check selected druids on /stacks` every 1 minute and count the number of shelved files each time.

  • Around 500 files would copy in the first minute
  • Fewer files would be copied in each successive minute: 500, 300, 200, etc.
  • By the time you get to 2000 files, the rate is closer to 50 per minute
  • The more the files, the more the rate continues to drop

I'm not sure why that is the case, but it's consistent across dozens of tests. It happens when the druid with many files is the only thing publishing and it happens when there's another druid publishing. It seems to be a per-druid slowdown that's not clearly affected by other activity on the file system.

Could there be a process that is doing something like checking all the previous shelving work each time each individual file is copied to /stacks? I could see that getting incrementally slower because there's more previous work to check the longer the shelving runs.

tl;dr

Anyway, if I were targeting one aspect of publish to speed up, it would be the copying from /access-transfer to /stacks.

@andrewjbtw andrewjbtw changed the title Improve handling of publish/shelve for deposits with 15,000+ files Improve handling of publish/shelve for deposits with 5,000+ files Oct 16, 2024
@andrewjbtw
Copy link
Author

Trying to summarize the issues that have been identified while working through these many-file deposits. This is roughly in order of severity as I see it, but I recognize it may not be the best order in which to approach addressing the issues.

  1. Include a file size check when shelving and retry incomplete files Retry shelving a file if the file size on /stacks doesn't match the original file #5191 (do not serve incorrect files to the public)
  2. Address slow copying from /access-transfer to /stacks
  3. Make shelving asynchronous (get away from timeout issues)
  4. Avoid recopying data into /access-transfer when there's a retry (which can be caused by timeout)

@andrewjbtw
Copy link
Author

As an additional data point, I just came across an item that contains 36,000 files and took only about an hour to shelve/publish in July 2024, before the versioning changes were released.

@julianmorley
Copy link
Member

Based on a discussion with Andrew, I ran a few quick filesystem write tests.

I used the sample druid dk313fd838 on Stage. I tested the time it takes to copy the druid using filesystem tools from local disk to Weka, local disk to NFS, NFS disk to NFS and Weka disk to Weka.

  1. 33,000 files were copied each time. There was no incremental slowdown of the copying speed seen in any of the tests.
  2. Weka was slower than NFS, but not by an amount that should make any reasonable difference.

I then checked file write latencies using fio against NFS and Weka mounts. I used an 86mb test file (the same size as all content in dk313fd838) and 4KB block size, with a random write pattern, to emulate writing 86mb of 4KB files.

NFS (NetApp) was faster, but not by enough to explain the discrepancies seen by Andrew. The headline from those tests was:
NetApp: [w=9253KiB/s][w=2313 IOPS]
Weka: [w=5745KiB/s][w=1436 IOPS]

This was roughly in line with the timings I was seeing from my manual test copying of the sample druid, e.g.:

Copying from local filesystem to NFS mount:

root@dor-techmd-qa-a:/home/jmorley# time cp -R dk313fd838 /services-disk-stage/tmp/
real	1m24.178s
user	0m0.276s
sys	0m4.619s

Copying from local filesystem to Weka mount:

root@sul-purl-stage:/home/jmorley/tmp# time cp -R /home/jmorley/tmp/dk313fd838 /purl/tmp/
real	2m36.405s
user	0m0.284s
sys	0m6.508s

This all makes sense to me; the Weka system is better at large block writes than Netapp. The same file with 1mb block size:
NetApp write: IOPS=206, BW=207MiB/s (217MB/s)(86.0MiB/416msec); 0 zone resets
Weka write: IOPS=1409, BW=1410MiB/s (1478MB/s)(86.0MiB/61msec); 0 zone resets <--- x7 faster than NetApp

The TL;DR: is it doesn't appear to be an issue with the storage layer. Given Andrew's description of increasing slowness, my suspicion would be there's some kind of loop in the code that's taking longer to finish after each execution. Guess: After every file is written, the code's not doing an ls or other similar action on the destination directory, is it? That'd get worse over time.

@justinlittman
Copy link
Contributor

Thanks @julianmorley for that analysis.

Here's the code: https://github.com/sul-dlss/purl-fetcher/blob/main/app/services/versioned_files_service/update_action.rb#L71-L78, where the actual move of a file is performed at https://github.com/sul-dlss/purl-fetcher/blob/main/app/services/versioned_files_service/contents.rb#L16-L19.

If a move across filesystems is a copy then a delete, then perhaps the delete is getting slower?

Seems like the next step is to add some benchmark logging to isolate the exact line of code that is slow. If that proves to be the move, then we can try splitting it into separate copy and deletes with logging.

@julianmorley
Copy link
Member

Yes, getting some timings into logs is a good idea.

https://github.com/sul-dlss/purl-fetcher/blob/6aae8a76d5d931301cccbe8337f7550d468a766a/app/services/versioned_files_service/contents.rb#L17C5-L17C38

It could be the delete portion of the mv causing issues, but also having a mkdir -p in each mv block implies a dir existence check for each move, which has the potential to snowball depending on how, exactly, this is happening. That implied dir exists? check could become increasingly slow as the contents of the dir in question gets larger.

I bet if I re-wrote that cp test to instead do an ls of all the files I wanted to move from the source dir, then looped through them one at a time starting with a mkdir -p it'd be much slower.

@justinlittman
Copy link
Contributor

Turns out it was neither the mv or the mkdir -- there was a hidden ls.

See sul-dlss/purl-fetcher#942

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Ready (Ordered by Priority)
Development

No branches or pull requests

5 participants