Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Group limit benchmarks #793

Merged
merged 45 commits into from
Jun 12, 2024
Merged

Group limit benchmarks #793

merged 45 commits into from
Jun 12, 2024

Conversation

insipx
Copy link
Contributor

@insipx insipx commented May 29, 2024

local gRPC benchmarks: benchmark report

Benchmarks against dev network: report

Message Size Limits

Issue: #812

This isn't exact, but we eventually reach the gRPC message size limit
For this PR, I added in chunking for the send_welcomes fn which solves this, but we're still sending lots of data over the wire + encrypting a large welcome message, and very large group sizes will still hit the limit (it would require the ratchet tree itself to reach greater than the gRPC limit).

Benchmarking add_to_empty_group/500: Warming up for 3.0000 sAdding 500 members
2024-05-31T02:23:04.084639Z ERROR xmtp_mls::groups::sync: post commit error Api(xmtp::error::Error(MlsError, Status { code: ResourceExhausted, message: "grpc: received message larger than max (136352000 vs. 52428800)", metadata: MetadataMap { headers: {"content-type": "application/grpc"} }, source: None }))
2024-05-31T02:23:04.084662Z ERROR xmtp_mls::groups::sync: error syncing group Sync([Api(xmtp::error::Error(MlsError, Status { code: ResourceExhausted, message: "grpc: received message larger than max (136352000 vs. 52428800)", metadata: MetadataMap { headers: {"content-type": "application/grpc"} }, source: None }))])
2024-05-31T02:23:05.355537Z ERROR xmtp_mls::groups::sync: post commit error Api(xmtp::error::Error(MlsError, Status { code: ResourceExhausted, message: "grpc: received message larger than max (136352000 vs. 52428800)", metadata: MetadataMap { headers: {"content-type": "application/grpc"} }, source: None }))
2024-05-31T02:23:05.355565Z ERROR xmtp_mls::groups::sync: error syncing group Sync([Api(xmtp::error::Error(MlsError, Status { code: ResourceExhausted, message: "grpc: received message larger than max (136352000 vs. 52428800)", metadata: MetadataMap { headers: {"content-type": "application/grpc"} }, source: None }))])
2024-05-31T02:23:06.580907Z ERROR xmtp_mls::groups::sync: post commit error Api(xmtp::error::Error(MlsError, Status { code: ResourceExhausted, message: "grpc: received message larger than max (136352000 vs. 52428800)", metadata: MetadataMap { headers: {"content-type": "application/grpc"} }, source: None }))
2024-05-31T02:23:06.580934Z ERROR xmtp_mls::groups::sync: error syncing group Sync([Api(xmtp::error::Error(MlsError, Status { code: ResourceExhausted, message: "grpc: received message larger than max (136352000 vs. 52428800)", metadata: MetadataMap { headers: {"content-type": "application/grpc"} }, source: None }))])
thread 'main' panicked at xmtp_mls/benches/group_limit_empty.rs:59:65:
called `Result::unwrap()` on an `Err` value: Sync([Api(xmtp::error::Error(MlsError, Status { code: ResourceExhausted, message: "grpc: received message larger than max (136352000 vs. 52428800)", metadata: MetadataMap { headers: {"content-type": "application/grpc"} }, source: None }))])
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Flamegraph (All flamegraphs are local gRPC)

CPU Flamegraph for member addition
flamegraph

interactive flamegraph link

Add to empty group

tracing-flamegraph
interactive link

Add to 100 member group

tracing-flamegraph
interactive

Remove all members

tracing-flamegraph

interactive

Add 1 member to group

tracing-flamegraph
interactive

Bugs

Verdict

It's difficult to recommend increasing the group size much above 400 if it takes > 3-4 seconds to add all those members -- although this also depends how long users would actually be willing to wait to add large amounts of members.

Adding members seems to be the slowest so performance focus should be on making that faster. Removing members (even large amounts) seems to be performing well compared to member addition.

The flamegraph indicates we are spending lots of time on encrypt_welcome, so this could mean that focusing on decreasing the total welcome size could result in a more performant experience for member addition.

After generating flamegraphs for each benchmark individually, it's clear that there are a few main areas that would benefit from optimization work:

  • making get_identity_updates_v2 faster would speed up every benchmark
  • making welcome message constant size would solve gRPC limit and speed things up, but not as much as optimizing get_identity_updates_v2
  • doing both these thing we would probably see significant performance improvement

fixes #810

insipx added 5 commits June 3, 2024 21:00
- set `MAX_GROUP_SIZE` to 300
- use all identity samples for benchmarks
- reference github issue in TODO comment
@insipx insipx marked this pull request as ready for review June 4, 2024 14:43
@insipx insipx requested a review from a team as a code owner June 4, 2024 14:43
@neekolas
Copy link
Contributor

neekolas commented Jun 4, 2024

This is great work @insipx. This has come a long way since my dumb benchmarks. The flamegraphs are super helpful.

While the data would be noisier, I'd be interested in seeing what these numbers look like when we are communicating with the dev network where network latency plays a bigger factor. Those probably offer a better view of what real-world performance looks like.

@insipx
Copy link
Contributor Author

insipx commented Jun 4, 2024

Definitely can make a report using the dev network!

@saulmc
Copy link
Member

saulmc commented Jun 4, 2024

It's difficult to recommend increasing the group size much above 400 if it takes > 3-4 seconds to add all those members -- although this also depends how long users would actually be willing to wait to add large amounts of members.

This seems quite manageable. As a group creator, if I already have a list of 5,000 addresses that I want to create a group from, there's no way I'm entering them by hand. It's no sweat waiting a minute while that group is created in the background by whatever bulk upload tool I'm using. I just expect to be notified when it's done. On the other hand, waiting an hour would be unacceptable.

There's currently a bug for adding to a populated group. Any benches that add to an already-populated group are not included and annotated with #[allow(dead_code)] until fixed.

I'm very interested to see how this benchmarks once the bug is fixed, as this is the scenario that concerns me more. Adding 1 address to a group of 400 is a by-hand operation that will be expected to complete instantly.

@insipx
Copy link
Contributor Author

insipx commented Jun 4, 2024

I'm very interested to see how this benchmarks once the bug is fixed, as this is the scenario that concerns me more. Adding 1 address to a group of 400 is a by-hand operation that will be expected to complete instantly.

That makes sense! That bug is an easier fix too, so it makes sense seeing those benchmarks to ensure adding to an already-large group is fast enough. According to Cryspens early benchmarks (which don't have the network overhead here, since it's only the openMLS library), adding to a 1000 member group takes 734ms 12ms, so I would expect it to take at least that much time, but most likely more

@insipx insipx marked this pull request as draft June 4, 2024 21:27
@insipx
Copy link
Contributor Author

insipx commented Jun 5, 2024

@saulmc

benchmark
adding 1 member to large group, up to 3200 members (past 3200, it takes way too long to add that many members to a group to make the benchmark worth it for now)

Adding one member to a group is much faster than creating a large group from scratch but could still use improvement. After ~4000 members, it will start to take more than 1 second to add a new member.

benchmarks for encrypt_welcome.

The average size of the welcome message at ~1600 members is 626,120 bytes which according to these benchmarks takes ~2ms. The network overhead in send_welcomes is more significant in this benchmark at around ~50ms from some improvised measurement, and there could be improvements made elsewhere in the codepath too. Making welcome size constant is sort of a kill 2 birds with one stone here which should bring it down some, but that's still not the whole story for making adding one member faster

@franziskuskiefer
Copy link
Contributor

past 3200, it takes way too long to add that many members to a group to make the benchmark worth it for now

For the openmls benchmarks I generate groups separately and write them out to a file. Then you only need to read them in the benchmarks. Even that takes some time, but it allows testing larger groups. Maybe that's something you could do here as well?

@insipx
Copy link
Contributor Author

insipx commented Jun 7, 2024

For the openmls benchmarks I generate groups separately and write them out to a file. Then you only need to read them in the benchmarks. Even that takes some time, but it allows testing larger groups. Maybe that's something you could do here as well?

Yeah I already have logic to pre-generate the identities and write to a file, separately for local and our dev networks. I think most of this time is spent between the local libxmtp client and our backend service

Will update this with updated flamegraphs and dev network benchmarks today

@insipx insipx marked this pull request as ready for review June 7, 2024 20:05
@insipx insipx requested a review from humanagent as a code owner June 7, 2024 20:05
@insipx
Copy link
Contributor Author

insipx commented Jun 7, 2024

@neekolas

dev benches

Updated PR/Verdict with benchmarks against dev net + flamegraphs for the individual benchmarks, was only able to get the group size up to 450. After 450 ran into errors with broken pipes and benchmark would not complete. I assume this is because it takes a long time and the gRPC server just closes the connection after a while.

So far, all the MLS stuff is really fast and barely shows up in the benchmarks. Most of the bottlenecks are things around gRPC and fetching identities/sending welcomes etc.

@insipx insipx merged commit 00287f6 into main Jun 12, 2024
7 checks passed
@insipx insipx deleted the insipx/benchmark-group-limit branch June 12, 2024 19:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: SQLite stack overflows when adding members to group that already has members
4 participants