Fix writing aggregation jobs touching GC'ed batches. #2467

branlwyd · 2024-01-10T04:43:29Z

This issue should only exist in the time-interval query type, as fixed-size is arranged such that aggregation jobs touching a given batch must be GC'ed before the batch. I include a guard to ensure that the new codepath is only taken in the expected case of an already-GC'ed batch for a time-interval query, as otherwise we might drop batch writes if we fell into it unexpectedly.

branlwyd · 2024-01-10T04:48:14Z

Closes #2464. (though I still need to evaluate the Helper codepath, which I should split out to a separate issue if there is work needed there)

I'm not totally happy with this fix as it's something of a hack, but I think it's about the best we can do -- the best other solution discussed was to return an is_gc'ed flag from get_batch, but we can't accurately compute that flag for the fixed-size query type.

I verified that the new test reproduces the issue by removing the fix & observing the test fail with an underflow error.

aggregator/src/aggregator/aggregation_job_driver.rs

tgeoghegan · 2024-01-10T15:41:38Z

aggregator/src/aggregator/aggregation_job_driver.rs

+                    tx.put_aggregator_task(&leader_task).await?;
+                    tx.put_client_report(vdaf.borrow(), &first_report).await?;
+                    tx.put_client_report(vdaf.borrow(), &second_report).await?;


nit: I think it's better to call unwrap() here, because then if any of these three puts does fail, it's obvious which one it was. If we propagate the error then it blows up when we unwrap the result from run_unnamed_tx which is less helpful.

Good call, I've been making this change as I touch code -- it is indeed much better to fail sooner rather than later in tests. I missed these as I adapted a lot of this code from a previously-existing test.

tgeoghegan · 2024-01-10T15:43:54Z

aggregator/src/aggregator/aggregation_job_writer.rs

+                            if !Q::to_batch_interval(batch_identifier)
+                                .map(|interval| interval.end() < tx.clock().now())
+                                .unwrap_or(false)


This works fine, but we have self.task in scope, so can't we more directly check the query type?

I suppose we should check both -- I also want to guard against a time-interval task unexpectedly not finding a batch that shouldn't be GC'ed.

branlwyd requested a review from a team as a code owner January 10, 2024 04:43

tgeoghegan approved these changes Jan 10, 2024

View reviewed changes

tgeoghegan mentioned this pull request Jan 10, 2024

Leader: Batch GC can cause an aggregation job to become unwritable. #2464

Closed

inahga approved these changes Jan 10, 2024

View reviewed changes

code review

dcd4bc0

branlwyd enabled auto-merge (squash) January 10, 2024 19:19

branlwyd merged commit f257d1e into main Jan 10, 2024
7 checks passed

branlwyd deleted the bran/fix-batch-gc-bug branch January 10, 2024 20:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix writing aggregation jobs touching GC'ed batches. #2467

Fix writing aggregation jobs touching GC'ed batches. #2467

branlwyd commented Jan 10, 2024

branlwyd commented Jan 10, 2024

tgeoghegan Jan 10, 2024

branlwyd Jan 10, 2024

tgeoghegan Jan 10, 2024

branlwyd Jan 10, 2024

Fix writing aggregation jobs touching GC'ed batches. #2467

Fix writing aggregation jobs touching GC'ed batches. #2467

Conversation

branlwyd commented Jan 10, 2024

branlwyd commented Jan 10, 2024

tgeoghegan Jan 10, 2024

Choose a reason for hiding this comment

branlwyd Jan 10, 2024

Choose a reason for hiding this comment

tgeoghegan Jan 10, 2024

Choose a reason for hiding this comment

branlwyd Jan 10, 2024

Choose a reason for hiding this comment