Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix broken parallelism in quic-client #2526

Merged
merged 2 commits into from
Oct 2, 2024

Conversation

ripatel-fd
Copy link

@ripatel-fd ripatel-fd commented Aug 9, 2024

Fixes excessive fragmentation by TPU clients leading to a large
number of streams per conn in 'sending' state simultaneously.
This, in turn, requires excessive in-memory buffering server-side
to reassemble fragmented transactions.

  • Simplifies QuicClient::send_batch to enqueue send operations
    in sequential order
  • Removes the "max_parallel_streams" config option

The quic-client now produces an ordered fragment stream when
scheduling send operations from a single-thread.

@ripatel-fd ripatel-fd requested a review from alessandrod August 9, 2024 12:39
@ripatel-fd ripatel-fd force-pushed the quic-client-improvements branch 2 times, most recently from f6dd865 to 12e9caf Compare August 9, 2024 15:08
@ripatel-fd
Copy link
Author

ripatel-fd commented Aug 10, 2024

Let's hold off on merging this until we've gathered some benchmarks. I suggest

  1. bench-tps against a local test validator (to measure max throughput under optimal conditions)
  2. bench-tps against a remote validator transatlantic (to see if this change is sensitive to latency)
  3. bench-tps against a Firedancer v0.1 validator. The Firedancer team can help test this against a cluster instead of just individual validators for more realistic results
  4. Blind spam against fd_quic (I already wrote this test)

For each test, I'll attach bench-tps reported counters and fragmentation rates (depends on #2539).

@lijunwangs @alessandrod Can you think of any other tests or datapoints we'd need?

send_stream.write_all(data).await?;
send_stream.finish().await?;
// Finish sending this stream before sending any new stream
_ = send_stream.set_priority(1000);
Copy link
Author

@ripatel-fd ripatel-fd Aug 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably report the bug that requires this priority hack to the quinn maintainers.
There shouldn't be any reordering in a simple loop like this:

loop {
    let mut send_stream = connection.open_uni().await?;
    send_stream.write_all(some_buffer()).await?;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed?

Copy link

@alessandrod alessandrod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, just wondering why the set_priority thing is needed

send_stream.write_all(data).await?;
send_stream.finish().await?;
// Finish sending this stream before sending any new stream
_ = send_stream.set_priority(1000);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed?

@ripatel-fd
Copy link
Author

why is [set_priority] needed?

@alessandrod It's a workaround for a quinn bug.

If you do this you get reordering much worse than the parallel send method before this patch.

loop {
    let mut send_stream = connection.open_uni().await?;
    send_stream.write_all(data).await?;
}

I suspect the reason is that the WriteAll future completes when some stream data was not serialized to QUIC frames and some internal state object for that stream sticks around. As the loop iterations run, those internal stream objects then stick around in some unordered data structure. I added the set_priority statement as an experiment and it fixes the issue.

Here is a writeup on the issue: Firedancer-2024-08-10 solana-quic-client fragmentation-110824-162447.pdf

Fixes excessive fragmentation by TPU clients leading to a large
number of streams per conn in 'sending' state simultaneously.
This, in turn, requires excessive in-memory buffering server-side
to reassemble fragmented transactions.

- Simplifies QuicClient::send_batch to enqueue send operations
  in sequential order
- Removes the "max_parallel_streams" config option

The quic-client now produces an ordered fragment stream when
scheduling send operations from a single-thread.
@alessandrod alessandrod force-pushed the quic-client-improvements branch from 12e9caf to 204657f Compare October 1, 2024 12:44
Copy link

@alessandrod alessandrod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have rebased this and removed set_priority since I'm not seeing the issue @ripatel-fd was seeing on fd. The PR is obviously correct. I've been doing all my testing in the last month with similar code (no parallel streams).

Copy link

@alessandrod alessandrod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests need work, I'll fixup

Copy link

@alessandrod alessandrod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok good to go now

@lijunwangs
Copy link

Can you show some of the results of testing with/without the changes?

@KirillLykov
Copy link

KirillLykov commented Oct 1, 2024

I'm using similar code for quite some time, see #2905 and it respects flow control properly (while multistream doesn't not).

  1. bench-tps against a local test validator (to measure max throughput under optimal conditions)

In this case the connection will be limited by 512 concurrent streams and 512*1232 receive window size. So this setup will not show max throughput. What can be done instead is using some mock server which respects the protocol but doesn't limit the connection.

  1. bench-tps against a remote validator transatlantic (to see if this change is sensitive to latency).

Well, what is sensitive to latency is our server side because we fix RW to 512*1232 regardless of rtt.

@KirillLykov
Copy link

KirillLykov commented Oct 2, 2024

Setup

Server is in Korea, client in Amsterdam. RTT: 251ms

Run 1: new vs old with staked connection (100% stake)

The connection is limited by our qos: 512 streams, 512*1232 RW size.
As expected, there is no difference between two runs: in both cases tps is ~16k.

On the plot below the new code is on the left, the old on the right:

Screenshot 2024-10-02 at 11 24 20

Run 2: new vs old with mock-server

In parallel with agave-validator on the same machine I launched mock-server which listens port 1234.
On the client side, I modified TPU client to send txs to 1234 instead of port obtained from gossip so that bench-tps uses normal rpc but our mock server as TPU endpoint.

In this case, TPS with old client is 149520, while with new is 165520 (10% improvement).

How I was running it

Client arguments:

-u http://<IP>:8899 --identity /home/klykov/wrks_ext/bench-tps-dos-test/test-for-pr2526/testnet-dos-funder.json --read-client-keys accounts.yml --duration 320 --tx_count 5000 --keypair-multiplier 2 --thread-batch-sleep-ms 10 --client-node-id id.json --block-data-file block.csv --use-tpu-client --threads 4 --tpu-connection-pool-size 4 --sustained

Agave server arguments (configured with multinode scripts):

./target/release/agave-validator --enable-rpc-transaction-history --enable-extended-tx-metadata-storage --require-tower --ledger /home/klykov/sol/wrks_ext/solana/net/../config/bootstrap-validator --rpc-port 8899 --snapshot-interval-slots 200 --no-incremental-snapshots --identity /home/klykov/sol/wrks_ext/solana/net/../config/bootstrap-validator/identity.json --vote-account /home/klykov/sol/wrks_ext/solana/net/../config/bootstrap-validator/vote-account.json --rpc-faucet-address 127.0.0.1:9900 --no-poh-speed-test --no-os-network-limits-test --no-wait-for-vote-to-start-leader --full-rpc-api --allow-private-addr --gossip-port 8001 --gossip-host <PUBLIC_IPv4>  --log - 2> err.txt

Mock server arguments:

./target/release/server --listen 0.0.0.0:1234 --receive-window-size 630784000  --max-concurrent-streams 512000 --stream-receive-window-size 1232000

@lijunwangs lijunwangs merged commit 76cbf1a into anza-xyz:master Oct 2, 2024
40 checks passed
@ripatel-fd ripatel-fd deleted the quic-client-improvements branch November 19, 2024 15:57
@bw-solana bw-solana added the v2.0 Backport to v2.0 branch label Nov 19, 2024
Copy link

mergify bot commented Nov 19, 2024

Backports to the stable branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule.

mergify bot pushed a commit that referenced this pull request Nov 19, 2024
* Fix broken parallelism in quic-client

Fixes excessive fragmentation by TPU clients leading to a large
number of streams per conn in 'sending' state simultaneously.
This, in turn, requires excessive in-memory buffering server-side
to reassemble fragmented transactions.

- Simplifies QuicClient::send_batch to enqueue send operations
  in sequential order
- Removes the "max_parallel_streams" config option

The quic-client now produces an ordered fragment stream when
scheduling send operations from a single-thread.

* quic-client: remove outdated test

---------

Co-authored-by: Richard Patel <[email protected]>
Co-authored-by: Alessandro Decina <[email protected]>
(cherry picked from commit 76cbf1a)
alessandrod pushed a commit that referenced this pull request Nov 21, 2024
Fix broken parallelism in quic-client (#2526)

* Fix broken parallelism in quic-client

Fixes excessive fragmentation by TPU clients leading to a large
number of streams per conn in 'sending' state simultaneously.
This, in turn, requires excessive in-memory buffering server-side
to reassemble fragmented transactions.

- Simplifies QuicClient::send_batch to enqueue send operations
  in sequential order
- Removes the "max_parallel_streams" config option

The quic-client now produces an ordered fragment stream when
scheduling send operations from a single-thread.

* quic-client: remove outdated test

---------

Co-authored-by: Richard Patel <[email protected]>
Co-authored-by: Alessandro Decina <[email protected]>
(cherry picked from commit 76cbf1a)

Co-authored-by: ripatel-fd <[email protected]>
ray-kast pushed a commit to abklabs/agave that referenced this pull request Nov 27, 2024
* Fix broken parallelism in quic-client

Fixes excessive fragmentation by TPU clients leading to a large
number of streams per conn in 'sending' state simultaneously.
This, in turn, requires excessive in-memory buffering server-side
to reassemble fragmented transactions.

- Simplifies QuicClient::send_batch to enqueue send operations
  in sequential order
- Removes the "max_parallel_streams" config option

The quic-client now produces an ordered fragment stream when
scheduling send operations from a single-thread.

* quic-client: remove outdated test

---------

Co-authored-by: Richard Patel <[email protected]>
Co-authored-by: Alessandro Decina <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants