DAP-05 ping-pong #1811

tgeoghegan · 2023-08-26T00:54:22Z

This change implements the DAP-05 ping-pong topology in which aggregators take turns preprocessing prepare shares into prepare messages. While this topology first appeared in DAP-05, this implementation follows the changes in 1, which should appear in DAP-06.

This change depends on the implementation of the VDAF ping-pong topology added to crate prio in 2, which in turn conforms to the specification added after VDAF-06 and further tweaked in 3 (we expect this to be published soon as VDAF-07).

This commit makes some changes to what intermediate values are stored by aggregators. In the case where an aggregator is continuing, it will have computed a prepare state, a prepare message for the current round and a
prepare share for the next round. The existing implementation would store all three objects in the database, significantly increasing the per-report storage requirements. In particular, this makes things worse for the Helper, which previously never needed to store a prepare share because the Leader always took responsibility for combining prepare
shares.

To mitigate this, we instead have aggregators store a prio::ping_pong::topology::Transition, which will contain a prepare
state and a prepare message (both of which are generally much smaller than prepare shares), from which the next prepare state and importantly prepare share can be recomputed.

The main benefit of this change is to reduce how many round trips between aggregators are needed to prepare reports. Quite a few tests used Prio3 but depended on having the leader or helper in the Waiting state after running aggregation initialization. Accordingly, those tests are changed to run Poplar1, which now takes 2 rounds.

Part of #1669

tgeoghegan · 2023-08-27T14:55:35Z

This is ready for review. Sadly the diff is large because the changes to test support code (like core::test_util::VdafTranscript) and moving a number of tests to Poplar1 introduces a lot of noise, compounded by cargo fmt changing a lot of code layout even though nothing has really changed.

tgeoghegan · 2023-08-28T14:51:16Z

aggregator/src/aggregator/error.rs

+        ref error @ PingPongError::StateMismatch(_, _) => (
+            format!("{error}"),
+            format!("{peer_role}_ping_pong_message_state_mismatch"),
+            // TODO(timg): is this the right error if state mismatch?


This needs to be resolved: VdafPrepError is correct in this case, because the DAP text says that any time a VDAF-level ping-pong routine yields state Rejected(), the aggregator should construct a PrepareResp with VdafPrepError. So for that reason, the cases above where we return UnrecognizedMessage need to be fixed. It's a shame, because in libprio we actually have specific knowledge of what went wrong, but per the specification we have to discard that information and flatten everything into VdafPrepError. At least we can still increment a specific Prometheus metric.

branlwyd

From the PR description:

The existing implementation would store all three objects in the database, significantly increasing the per-report storage requirements.

What existing implementation is this referring to? The pre-ping-pong implementation didn't store all three objects (since only two objects were relevant), and the experimental ping-pong implementation didn't store all three objects (since it implemented something like the same optimization we now implement by splitting continued to output a Transition which has an evaluate which finishes the continuation process).

branlwyd · 2023-08-28T21:53:59Z

Cargo.toml

@@ -42,7 +42,9 @@ janus_messages = { version = "0.5", path = "messages" }
 k8s-openapi = { version = "0.18.0", features = ["v1_24"] }  # keep this version in sync with what is referenced by the indirect dependency via `kube`
 kube = { version = "0.82.2", default-features = false, features = ["client", "rustls-tls"] }
 opentelemetry = { version = "0.20", features = ["metrics"] }
-prio = { version = "0.14.1", features = ["multithreaded"] }
+# TODO(timg): go back to a released version of prio


reminder: need to resolve this TODO before merging (which will require merging divviup/libprio-rs#683 & cutting a libprio-rs release)

aggregator/src/aggregator.rs

aggregator/src/aggregator/error.rs

aggregator/src/aggregator.rs

db/00000000000001_initial_schema.up.sql

branlwyd · 2023-08-29T21:26:34Z

messages/Cargo.toml

@@ -20,7 +20,9 @@ hex = "0.4"
 num_enum = "0.7.0"
 # We can't pull prio in from the workspace because that would enable default features, and we do not
 # want prio/crypto-dependencies
-prio = { version = "0.14.1", features = ["multithreaded"] }
+# TODO(timg): go back to a released version of prio


reminder: need to resolve this TODO before merging (which will require merging divviup/libprio-rs#683 & cutting a libprio-rs release)

aggregator_core/src/datastore/models.rs

tgeoghegan · 2023-08-31T22:19:11Z

From the PR description:

The existing implementation would store all three objects in the database, significantly increasing the per-report storage requirements.

What existing implementation is this referring to? The pre-ping-pong implementation didn't store all three objects (since only two objects were relevant), and the experimental ping-pong implementation didn't store all three objects (since it implemented something like the same optimization we now implement by splitting continued to output a Transition which has an evaluate which finishes the continuation process).

"Existing" is a poor word choice on my part. I think I meant "the implementation I had before I introduced PingPongTransition".

tgeoghegan · 2023-08-31T22:44:24Z

I've addressed the outstanding comments here, but this needs further changes to deal with whatever we end up doing about Transition::evaluate in divviup/libprio-rs#683, so it's not ready for re-review yet.

tgeoghegan · 2023-09-11T18:49:08Z

There are merge conflicts I need to clean up here. Also, there have been some changes to libprio-rs that are causing compilation issues. A lot of that has nothing to do with ping-pong, so I'm going to make a new PR that adopts libprio 0.15 (which will have most of VDAF-07 in it), then rebase this change on that.

branlwyd

Only one comment left, but it might require some refactoring. Getting ideal (minimal) storage & compute is tricky with this abstraction.

branlwyd

Looks good, just a few final nitpicky comments.

branlwyd · 2023-09-13T17:08:02Z

aggregator/src/aggregator/aggregation_job_driver.rs

                ReportAggregationState::Finished => saw_finished = true,
-                ReportAggregationState::Failed(_) => (), // ignore failed aggregations
+                _ => (), // ignore failed aggregations


Should we return an error (or maybe panic, though I think returning an error is wiser) if we see WaitingHelper in Leader code or vice-versa?

aggregator_core/src/datastore/tests.rs

db/00000000000001_initial_schema.up.sql

This change implements the DAP-05 ping-pong topology in which aggregators take turns preprocessing prepare shares into prepare messages. While this topology first appeared in DAP-05, this implementation follows DAP-06. This change depends on the implementation of the VDAF ping-pong topology added to crate `prio` in [1], which in turn conforms to the specification in VDAF-07. In the ping-pong topology, each DAP-layer step of aggregation now spans two VDAF rounds. An aggregator will use the prepare message it gets from its peer to advance by one VDAF round, and then can use the prepare share it just computed along with the peer's prepare share to advance by another. This incurs some changes to what intermediate values are stored by aggregators. In the case where a leader is continuing/waiting, it will have computed a prepare state, a prepare message for the current round and a prepare share for the next round. The naive implementation would store all three objects in the database, significantly increasing the per-report storage use. To mitigate this, the leader stores a `prio::topology::ping_pong::PingPongTransition`, which will contain a prepare state and a prepare message (both of which are generally much smaller than prepare shares), from which the next prepare state and importantly prepare share can be recomputed. On the helper side, there's no way around storing the prepare share: we store the most recently computed `PrepareResp` so that we can handle aggregation jobs idempotently. But to avoid storing prepare messages twice, the continuing/waiting helper stores just a prepare state and a `PingPongMessage`. The main benefit of this change is to reduce how many round trips between aggregators are needed to prepare reports. Quite a few tests used Prio3 but depended on having the leader or helper in the `Waiting` state after running aggregation initialization. Accordingly, those tests are changed to run Poplar1, which now takes 2 rounds. [1]: divviup/libprio-rs#683 Co-authored-by: Tim Geoghegan <[email protected]> Part of #1669

tgeoghegan mentioned this pull request Aug 26, 2023

Draft: DAP-05 ping-pong #1741

Closed

tgeoghegan added the allow-changed-migrations Override the ci-migrations check to allow migrations that have changed. label Aug 26, 2023

tgeoghegan force-pushed the timg/ping-pong branch from 41fd0b6 to c5b91f8 Compare August 27, 2023 14:51

tgeoghegan marked this pull request as ready for review August 27, 2023 14:53

tgeoghegan requested a review from a team as a code owner August 27, 2023 14:53

tgeoghegan changed the title ~~Draft: DAP-05 ping-pong~~ DAP-05 ping-pong Aug 27, 2023

tgeoghegan commented Aug 28, 2023

View reviewed changes

tgeoghegan mentioned this pull request Aug 29, 2023

Adopt DAP-05 error types #1853

Merged

branlwyd suggested changes Aug 29, 2023

View reviewed changes

tgeoghegan mentioned this pull request Aug 29, 2023

VDAF-06+ ping-pong topology divviup/libprio-rs#683

Merged

tgeoghegan force-pushed the timg/ping-pong branch from 90efb46 to 384ed5e Compare August 31, 2023 22:42

tgeoghegan requested a review from branlwyd September 1, 2023 19:04

tgeoghegan mentioned this pull request Sep 1, 2023

Implement draft-ietf-ppm-dap-07 #1669

Closed

11 tasks

branlwyd suggested changes Sep 11, 2023

View reviewed changes

tgeoghegan requested a review from branlwyd September 13, 2023 01:19

branlwyd approved these changes Sep 13, 2023

View reviewed changes

tgeoghegan force-pushed the timg/ping-pong branch from fff1d93 to 07d8ce8 Compare September 13, 2023 18:44

tgeoghegan force-pushed the timg/ping-pong branch from 9e74231 to a155d81 Compare September 13, 2023 21:25

tgeoghegan merged commit e1ffcf0 into main Sep 13, 2023
7 checks passed

tgeoghegan deleted the timg/ping-pong branch September 13, 2023 22:23

tgeoghegan mentioned this pull request Sep 25, 2023

[experimental] Implement one-helper, "ping-pong" aggregation. #1234

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DAP-05 ping-pong #1811

DAP-05 ping-pong #1811

tgeoghegan commented Aug 26, 2023 •

edited

Loading

tgeoghegan commented Aug 27, 2023

tgeoghegan Aug 28, 2023

branlwyd left a comment

branlwyd Aug 28, 2023

branlwyd Aug 29, 2023

tgeoghegan commented Aug 31, 2023

tgeoghegan commented Aug 31, 2023

tgeoghegan commented Sep 11, 2023

branlwyd left a comment

branlwyd left a comment

branlwyd Sep 13, 2023

DAP-05 ping-pong #1811

DAP-05 ping-pong #1811

Conversation

tgeoghegan commented Aug 26, 2023 • edited Loading

tgeoghegan commented Aug 27, 2023

tgeoghegan Aug 28, 2023

Choose a reason for hiding this comment

branlwyd left a comment

Choose a reason for hiding this comment

branlwyd Aug 28, 2023

Choose a reason for hiding this comment

branlwyd Aug 29, 2023

Choose a reason for hiding this comment

tgeoghegan commented Aug 31, 2023

tgeoghegan commented Aug 31, 2023

tgeoghegan commented Sep 11, 2023

branlwyd left a comment

Choose a reason for hiding this comment

branlwyd left a comment

Choose a reason for hiding this comment

branlwyd Sep 13, 2023

Choose a reason for hiding this comment

tgeoghegan commented Aug 26, 2023 •

edited

Loading