You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Helper task aggregation counters are currently updated in a separate DB transaction from the rest of the aggregation writes, in order to avoid unnecessary write contention. Each update is started in its own background (tokio) task, and will use its own DB transaction.
During a recent incident, we saw this transaction failing due to failure to retrieve a DB connection from the pool. In the event of overload, we could queue up an arbitrary number of these tasks, each awaiting a DB transaction -- there is no limiting factor to the number of background tasks we would spawn.
Instead, we should switch to a model where we have a single background task that receives & occasionally writes updates to the task aggregation counters, similar to how application-level metrics are handled for report uploads. This would cap the number of tasks & DB transactions used to update to 1. And this background task would also be able to batch writes to the counters, as well as handling retries in the case that the counters cannot be written.
The text was updated successfully, but these errors were encountered:
similar to how application-level metrics are handled for report uploads
FWIW report upload metrics don't run in their own task and are handled as part of report uploading. (In that sense, they're more in-line than even aggregation job handling at the moment). So for this issue we may also want to align align report uploads with this model.
Well, report upload metrics are handled in a batched fashion, in a single background task, with a controlled amount of concurrency. You are correct that the report uploads happen in that background task as well -- I'm less confident we should adopt a separate task for report upload metrics just to separate these, though it wouldn't be too hard to implement.
Helper task aggregation counters are currently updated in a separate DB transaction from the rest of the aggregation writes, in order to avoid unnecessary write contention. Each update is started in its own background (tokio) task, and will use its own DB transaction.
During a recent incident, we saw this transaction failing due to failure to retrieve a DB connection from the pool. In the event of overload, we could queue up an arbitrary number of these tasks, each awaiting a DB transaction -- there is no limiting factor to the number of background tasks we would spawn.
Instead, we should switch to a model where we have a single background task that receives & occasionally writes updates to the task aggregation counters, similar to how application-level metrics are handled for report uploads. This would cap the number of tasks & DB transactions used to update to 1. And this background task would also be able to batch writes to the counters, as well as handling retries in the case that the counters cannot be written.
The text was updated successfully, but these errors were encountered: