Liveness issue when no reports are being uploaded #3427

divergentdave · 2024-10-03T23:22:02Z

Currently, if a time interval task has some number of reports uploaded, and then report uploads stop, it's possible for aggregation and collection of the existing reports to get stuck. (at least until clients upload more reports)

When the report uploads stop, if there are fewer unaggregated reports than min_aggregation_job_size, then the aggregation job creator will not create any aggregation jobs. Thus, these reports will remain unaggregated. If a collection job is submitted with an interval that includes any such unaggregated report, the collection job driver will not process the job until all unaggregated reports in the batch interval have been processed (and all outstanding aggregation jobs have been finished or abandoned). Taken together, this means it's possible for a collection job to get stuck, even if we have sufficient valid reports to complete it. Getting into this state depends on race conditions between the clients' uploads and the aggregation job creator. We expect that tasks using the time interval query type will typically be for continuous metrics tasks, so extended periods with zero uploaded reports may be unusual.

We could fix this with new heuristics or conditions to allow creating an under-sized aggregation job, though how we do so may impact overhead from more smaller aggregation jobs and write contention during the ensuing aggregation. Thus, we'll want to only create under-sized aggregation jobs in limited situations.

The text was updated successfully, but these errors were encountered:

branlwyd · 2024-10-03T23:54:09Z

Implementation idea, based on off-issue discussion:

I think we'd implement it as: after "normal" creation of aggregation jobs, we might have a few "straggler" reports left in-hand that aren't numerous enough to permit creation of another aggregation job. Check for existing collection jobs for the time windows associated with these straggler reports; create an aggregation job using the reports whose time windows have a collection job.

Things I'd want to think about more deeply before implementing:

If we're going to create a "stragglers" agg job, maybe we want to go ahead and throw as many reports as possible, including remaining reports for time windows that don't have a collection job, to increase the overall average agg job size. This would increase the number of batches touched by these aggregation jobs, however.
Do we really just want one straggler agg job, or would multiple agg jobs be better for write contention? Creating multiple aggregation jobs, one per batch, would increase the number of aggregation jobs but reduce the number of batches touched by each aggregation job.

(These two points are in contention with one another.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Liveness issue when no reports are being uploaded #3427

Liveness issue when no reports are being uploaded #3427

divergentdave commented Oct 3, 2024

branlwyd commented Oct 3, 2024

Liveness issue when no reports are being uploaded #3427

Liveness issue when no reports are being uploaded #3427

Comments

divergentdave commented Oct 3, 2024

branlwyd commented Oct 3, 2024