Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BLOCKED: Large collection (druid:qr157jm7296) is failing to complete collection release #4283

Open
andrewjbtw opened this issue Nov 21, 2023 · 4 comments
Labels

Comments

@andrewjbtw
Copy link

andrewjbtw commented Nov 21, 2023

NOTE: this ticket is blocked while we wait for a resolution of the load-balancer/networking issue being worked on by Ops and UIT.

Describe the bug
A collection object with 14,000+ items is failing to complete the releaseWF: https://argo.stanford.edu/view/druid:qr157jm7296

The release-members step runs for a few minutes and then hits an error. Retrying the step just results in an error a few minutes later. The error message is always:

unable to reach dor-services-app: Connection reset by peer

User Impact
It's not going to be possible to release large collections.

To Reproduce
Steps to reproduce the behavior:

  1. Go to https://argo.stanford.edu/view/druid:qr157jm7296
  2. Click on releaseWF to view the workflow
  3. Choose "rerun" from the dropdown menu.
  4. Wait a few minutes
  5. See the error again

Expected behavior
The release-members step for a collection release will trigger the release of all items in the collection. That step may take a while and once it's complete, the collection moves on to the next step in WF.

In the past, we have successfully released much larger collections, with 140,000 being the largest that I remember.

Additional context
This might be a similar class of errors to what we've seen in accessionWF.

@andrewjbtw andrewjbtw added the bug label Nov 21, 2023
@andrewjbtw andrewjbtw moved this to New Issues (Needs Triage) in Infrastructure Portfolio Production Priorities Nov 21, 2023
@andrewjbtw andrewjbtw changed the title Large collection release (druid:qr157jm7296) is failing to complete collection release Large collection (druid:qr157jm7296) is failing to complete collection release Nov 21, 2023
@justinlittman
Copy link
Contributor

@andrewjbtw
Copy link
Author

Based on further observation following repeated retries, release-members runs for about two minutes before failing.

@ndushay
Copy link
Contributor

ndushay commented Dec 15, 2023

Julian is supposed to be pinging UIT for help with this.

Under load, the load-balancer drops connections without queuing the requests. (It should send 503 responses at a minimum)

@ndushay
Copy link
Contributor

ndushay commented Dec 15, 2023

Does this open a new http connection for each item, or does it send all the requests over one connection? It should work either way.

@andrewjbtw andrewjbtw changed the title Large collection (druid:qr157jm7296) is failing to complete collection release BLOCKED: Large collection (druid:qr157jm7296) is failing to complete collection release Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: New Issues (Needs Triage)
Development

No branches or pull requests

3 participants