Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review and improve/correct our Celery task retry logic #1391

Open
3 tasks
ccostino opened this issue Nov 4, 2024 · 0 comments
Open
3 tasks

Review and improve/correct our Celery task retry logic #1391

ccostino opened this issue Nov 4, 2024 · 0 comments
Assignees

Comments

@ccostino
Copy link
Contributor

ccostino commented Nov 4, 2024

With the increased amount of Celery worker processes and memory available to each of them, we're starting to notice a new pattern emerge with the task retries in Celery for trying to send messages or retrieve their information.

There's a fairly consistent spike of celery.exceptions:Retry exceptions being thrown at set intervals (see New Relic for more details), and this points to our handling of retrying tasks. Looking into the Celery docs around message (task) sending, it would seem that tasks should generally be retried automatically without needing to do a lot on our part. This is a bit contradictory with how to configure a task though, so I looked a little further into this.

What I found in this Stack Overflow post seems to point to us doing a bit too much with our retry logic in the Celery tasks. A bit more investigation led me to this article about Celery retry logic, which points to a few different approaches (and some based on the version you're using - we're on 5.4.x).

We need to make some adjustments to make sure we're letting Celery properly handle task retries and that we're not causing additional churn with extra exception handling.

Implementation Sketch and Acceptance Criteria

  • Look further into these articles and other sources to see what would be the right approach for handling retry logic of our Celery tasks, especially accounting for things like max retries, retry delays, and retry back off/anti-jitter (especially for things like S3 interactions)
  • Take a look at our code in app/celery (especially tasks.py and provider_tasks.py) and see where we might need to make adjustments
  • Create a PR with the proposed adjustments so we can talk through them and determine if they'd be appropriate, and how we're going to test them and validate that the changes will have the desired effect

Security Considerations

  • We want to make sure our asynchronous tasks are handled correctly so we don't introduce additional errors or performance issues in the system that cause instability.
@ccostino ccostino moved this from 🌱 New to ⬇ Up-Next in Notify.gov product board Nov 4, 2024
@xlorepdarkhelm xlorepdarkhelm self-assigned this Nov 18, 2024
@xlorepdarkhelm xlorepdarkhelm moved this from ⬇ Up-Next to 🏗 In progress (WIP: ≤ 3 per person) in Notify.gov product board Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 🏗 In progress (WIP: ≤ 3 per person)
Development

No branches or pull requests

2 participants