Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use quinn 0.11.x #1641

Merged
merged 5 commits into from
Sep 14, 2024
Merged

Use quinn 0.11.x #1641

merged 5 commits into from
Sep 14, 2024

Conversation

lijunwangs
Copy link

@lijunwangs lijunwangs commented Jun 7, 2024

Problem

Update Quinn 0.11.x release. There is important improvement in Quinn 0.11.x which enables the application layer to filter QUIC connections earlier before crypto handshaking efficiently. This is proved by running the integration against a connection load test tool which shows it can much more effectively shedding the spamming connection loads.

Summary of Changes

The change is mostly due to the Quinn 0.11.x requires newer rustls version 0.23.12 which make it more explicit about using "dangerous" interfaces.

I have annotated the PR with comments explaining why the change is done.

  1. A incoming connection can be ignored now when we do rate limiting -- which is more efficient -- does not require queueing outgoing packets.
  2. rustls interface changes of ServerCertVerifier applied to SkipServerVerification implementations
  3. changes in Cargo.toml to handle curve25519-dalek 3.2.1 patching because of zeroize version constraint due to the newer rustls. The workaround is applied to downstream tests.
  4. Quinn 0.11.x introduced new error codes which we need to handle.
  5. stream finish is no longer an async function.

Tests:

bench-tps with quic connection load tool.

Fixes #

@lijunwangs lijunwangs force-pushed the use_quinn_0.11.x branch 6 times, most recently from cd01252 to 657ec2e Compare June 13, 2024 00:01
@lijunwangs lijunwangs marked this pull request as draft June 19, 2024 06:39
@@ -305,7 +305,7 @@ reqwest-middleware = "0.2.5"
rolling-file = "0.2.0"
rpassword = "7.3"
rustc_version = "0.4"
rustls = { version = "0.21.12", default-features = false, features = ["quic"] }
rustls = { version = "0.23.9", default-features = false }
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Required by the new quinn lib version

@lijunwangs lijunwangs force-pushed the use_quinn_0.11.x branch 2 times, most recently from 9d2e8b0 to 6443eda Compare August 15, 2024 17:47
let mut config = rustls::ClientConfig::builder()
.with_safe_defaults()
.dangerous()
Copy link
Author

@lijunwangs lijunwangs Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is due to the rustls 0.23.12 change. It made the unsafe interfaces more explicit and put under "dangerous"

@@ -564,7 +589,7 @@ async fn send_request(
const READ_TIMEOUT_DURATION: Duration = Duration::from_secs(10);
let (mut send_stream, mut recv_stream) = connection.open_bi().await?;
send_stream.write_all(&bytes).await?;
send_stream.finish().await?;
send_stream.finish()?;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because finish is made a synchronous call in Quinn 0.11.x.

Error::WriteError(WriteError::ZeroRttRejected) => {
add_metric!(stats.write_error_zero_rtt_rejected)
}
Error::ConnectError(ConnectError::CidsExhausted) => {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is due to new error code introduced in Quinn 0.11.x

@@ -3032,6 +3061,26 @@ version = "1.0.9"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "af150ab688ff2122fcef229be89cb50dd66af9e01a4ff320cc137eecc9bacc38"

[[package]]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cascading changes are all because of Quinn version update which requires rustls update.

Cargo.toml Outdated Show resolved Hide resolved
@@ -1274,6 +1274,12 @@ dependencies = [
"libc",
]

[[package]]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cascading changes due to Quinn version update which triggers rustls version update.

@@ -924,6 +924,12 @@ dependencies = [
"libc",
]

[[package]]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cascading changes due to rustls version update.

@@ -4458,9 +4576,9 @@ dependencies = [

[[package]]
name = "slab"
version = "0.4.2"
version = "0.4.6"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new Quinn 0.11.x requires slab 0.4.6 or later to work

"sct",
]

[[package]]
name = "rustls"
version = "0.23.9"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why rustls version here is 0.23.9 while it is 0.23.12 in the main Cargo.lock?

Arc::new(Self)
}
}

impl rustls::client::ServerCertVerifier for SkipServerVerification {
impl rustls::client::danger::ServerCertVerifier for SkipServerVerification {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

due to rustls interface change. it is moved under "danger"

@@ -69,6 +69,7 @@ anchor() {
patch_crates_io_solana Cargo.toml "$solana_dir"
patch_spl_crates . Cargo.toml "$spl_dir"

sed -i '/\[patch.crates-io\]/a curve25519-dalek = { git = "https://github.com/anza-xyz/curve25519-dalek.git", rev = "b500cdc2a920cd5bff9e2dd974d7b97349d61464" }' ./Cargo.toml
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is required to build downstream to ensure Anza's curve25519-dalek is used


let mut server_config = ServerConfig::with_crypto(Arc::new(server_tls_config));
server_config.concurrent_connections(max_concurrent_connections as u32);
Copy link
Author

@lijunwangs lijunwangs Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

concurrent_connections configuration has been removed from the Quinn ServerConfig. It can be enforced by the application layer now. To be improved in the following PR.

@@ -91,12 +96,14 @@ pub fn new_dummy_x509_certificate(keypair: &Keypair) -> (rustls::Certificate, ru
]);

(
rustls::Certificate(cert_der),
rustls::PrivateKey(key_pkcs8_der),
rustls::pki_types::CertificateDer::from(cert_der),
Copy link
Author

@lijunwangs lijunwangs Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

due to rustls version update

@@ -331,6 +331,7 @@ async fn run_server(
stats
.connection_rate_limited_across_all
.fetch_add(1, Ordering::Relaxed);
connection.ignore();
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use connection.ignore which can more efficiently drop connection without notifying the clients.

@lijunwangs lijunwangs force-pushed the use_quinn_0.11.x branch 2 times, most recently from e5169f2 to 25c5322 Compare August 16, 2024 18:31
Copy link

@ripatel-fd ripatel-fd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just leaving some thoughts here. I know the PR isn't ready for review yet but maybe some of it is useful. :)

quic-client/src/nonblocking/quic_client.rs Outdated Show resolved Hide resolved
quic-client/src/nonblocking/quic_client.rs Outdated Show resolved Hide resolved
config
.transport_config(Arc::new(new_transport_config()))
.use_retry(true)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, why disable retry? (Retries are cheap and don't require any X25519 or Ed25519 cryptography)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use_retry is removed from the Config, it is now implemented by the application layer.

@alessandrod alessandrod self-requested a review August 17, 2024 11:21
@lijunwangs
Copy link
Author

Can this change be broken to any smaller parts?

For example, there are multiple dependencies updated zeroize, rustls, curve25519-dalek, etc. Can any of these be updated separately?

Also new error types (and their respective metrics) in {repair,turbine}/quic_endpoint can be ignore for now, and added in a followup PR. (Or I can add them myself if you prefer).

The curve25519-dalek change and zeroize changes are due to rustls change. Unfortunately -- the rustl version update has to be done with the new Quinn version update. The older Quinn (0.11.x) does not work with the newer rustls: 0.23.12.

I have annotated the changes in the PR explaining why they were done.

@lijunwangs lijunwangs force-pushed the use_quinn_0.11.x branch 2 times, most recently from 7cb7c7f to 3669f10 Compare August 26, 2024 15:34
@lijunwangs lijunwangs force-pushed the use_quinn_0.11.x branch 2 times, most recently from e7c65b0 to 040b434 Compare September 4, 2024 19:40
use quinn 0.11.x

Fixed additional comp errors

Try change zeroize version

zeroize depdency issues

Fixed slab dependency issue

revert change to zk-sdk/Cargo.toml

revert change to sdk/program/Cargo.toml

format cargo.toml

Fixed unit tests due to rustls change and stream finish interface change

Updated changes

downstream test fix on curve25519-dalek

Try patching curve25519-dalek for anchor tests

Use anza-xyz for curve25519-dalek

Fixed achor down stream tests

use anza's curve25519-dalek

regenerate Cargo.lock files

Ignore excessive connections

use debug version of quinn

use debug version of quinn in Endpoint::accept

use b8d9a8762deb848a17ccc1651cdb09519d226fb3 -- reduce noise on logging of quinn

use 9a2a51efddc8000482f74b15be09d8b2cd58ee5e

use crude filtering local address to check if the rate limiting itself is impacting performance of polling socket

use governor rate limiter

removed naive rate limter implementations

fix test failures

use release quinn 0.11.x

removed local git dependency for quinn

update variable names to show incoming, connecting

rebuild with the latest quinn 0.11.x release

Fix slab version to 0.4.9

Changed SkipServerVerification and SkipClientVerification to use default CryptoProvider to verify signatures

use retry

disable retry logic to investigate a panic in dropping Endpoint in local cluster test

reenable retry to investigate the panic

Updated quinn to fix the stateless retry connection corruption

Fixed cargo.lock in programs/sbf

downstream dalekcurve25519-dalek comp error

fmt issues

fmt issues
@@ -2032,7 +2048,7 @@ version = "1.0.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "c762bae6dcaf24c4c84667b8579785430908723d5c889f469d76a41d59cc7a9d"
dependencies = [
"curve25519-dalek 3.2.1",
"curve25519-dalek 3.2.0",
Copy link
Author

@lijunwangs lijunwangs Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of zeroize dependency. curve25519-dalek 3.2.1 requires zeroize < 1.4 which 3.2.0 does not requires. downgrade done by cargo

@lijunwangs lijunwangs marked this pull request as ready for review September 6, 2024 01:02
@lijunwangs
Copy link
Author

Test results show with 0.11.x the UDP kernel queue length is more quickly processed and better bench-tps results than the results of maser based off Quinn 0.10.x

Old vs Quinn 0.11.x

New

[2024-09-10T16:53:47.476072852Z INFO solana_bench_tps::bench] ---------------------+---------------+--------------------
[2024-09-10T16:53:47.476075960Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 31307.24 | 954052
[2024-09-10T16:53:47.476081810Z INFO solana_bench_tps::bench]
Average max TPS: 31307.24, 0 nodes had 0 TPS
[2024-09-10T16:53:47.476084832Z INFO solana_bench_tps::bench]
Highest TPS: 31307.24 sampling period 1s max transactions: 954052 clients: 1 drop rate: 0.81
[2024-09-10T16:53:47.476088300Z INFO solana_bench_tps::bench] Average TPS: 7792.3936

[2024-09-10T17:16:34.543049632Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 74028.56 | 1073732
[2024-09-10T17:16:34.543060400Z INFO solana_bench_tps::bench]
Average max TPS: 74028.56, 0 nodes had 0 TPS
[2024-09-10T17:16:34.543066989Z INFO solana_bench_tps::bench]
Highest TPS: 74028.56 sampling period 1s max transactions: 1073732 clients: 1 drop rate: 0.79
[2024-09-10T17:16:34.543066568Z INFO solana_metrics::metrics] datapoint: bench-tps-lamport_balance balance=30651782650i
[2024-09-10T17:16:34.543072964Z INFO solana_bench_tps::bench] Average TPS: 8919.59

[2024-09-10T17:19:59.476841034Z INFO solana_bench_tps::bench] Node address | Max TPS | Total Transactions
[2024-09-10T17:19:59.476844474Z INFO solana_bench_tps::bench] ---------------------+---------------+--------------------
[2024-09-10T17:19:59.476849890Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 52411.65 | 929564
[2024-09-10T17:19:59.476856095Z INFO solana_bench_tps::bench]
Average max TPS: 52411.65, 0 nodes had 0 TPS
[2024-09-10T17:19:59.476869147Z INFO solana_bench_tps::bench]
Highest TPS: 52411.65 sampling period 1s max transactions: 929564 clients: 1 drop rate: 0.82
[2024-09-10T17:19:59.476876500Z INFO solana_bench_tps::bench] Average TPS: 7721.242

Under QUIC connection load tool:

UDP port queue length

UNCONN 0 0 0.0.0.0:8009 0.0.0.0:*
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
UNCONN 25344 0 0.0.0.0:8009 0.0.0.0:*
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
UNCONN 138240 0 0.0.0.0:8009 0.0.0.0:*
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
UNCONN 0 0 0.0.0.0:8009 0.0.0.0:*
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
UNCONN 0 0 0.0.0.0:8009 0.0.0.0:*
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
UNCONN 135936 0 0.0.0.0:8009 0.0.0.0:*
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
UNCONN 0 0 0.0.0.0:8009 0.0.0.0:*
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
UNCONN 184320 0 0.0.0.0:8009 0.0.0.0:*
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
UNCONN 0 0 0.0.0.0:8009 0.0.0.0:*
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
UNCONN 0 0 0.0.0.0:8009 0.0.0.0:*
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
UNCONN 99072 0 0.0.0.0:8009 0.0.0.0:*

Bench-tps results:

[2024-09-10T17:25:17.685245874Z INFO solana_bench_tps::bench] Node address | Max TPS | Total Transactions
[2024-09-10T17:25:17.685248770Z INFO solana_bench_tps::bench] ---------------------+---------------+--------------------
[2024-09-10T17:25:17.685257275Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 127881.71 | 1136500
[2024-09-10T17:25:17.685264284Z INFO solana_bench_tps::bench]
Average max TPS: 127881.71, 0 nodes had 0 TPS
[2024-09-10T17:25:17.685267451Z INFO solana_bench_tps::bench]
Highest TPS: 127881.71 sampling period 1s max transactions: 1136500 clients: 1 drop rate: 0.62
[2024-09-10T17:25:17.685271525Z INFO solana_bench_tps::bench] Average TPS: 9443.683
[2024-09-10T17:25:17.685271310Z INFO solana_metrics::metrics] datapoint: bench-tps-lamport_balance balance=32765540150i

[2024-09-10T17:31:46.298186698Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 76908.80 | 607075
[2024-09-10T17:31:46.298188877Z INFO solana_metrics::metrics] datapoint: bench-tps-lamport_balance balance=33021750150i
[2024-09-10T17:31:46.298191607Z INFO solana_bench_tps::bench]
Average max TPS: 76908.80, 0 nodes had 0 TPS
[2024-09-10T17:31:46.298201498Z INFO solana_bench_tps::bench]
Highest TPS: 76908.80 sampling period 1s max transactions: 607075 clients: 1 drop rate: 0.81
[2024-09-10T17:31:46.298204916Z INFO solana_bench_tps::bench] Average TPS: 5008.901

[2024-09-10T17:38:07.613480246Z INFO solana_bench_tps::bench] Token balance: 33858885150
[2024-09-10T17:38:07.613534312Z INFO solana_bench_tps::bench] Node address | Max TPS | Total Transactions
[2024-09-10T17:38:07.613538902Z INFO solana_bench_tps::bench] ---------------------+---------------+--------------------
[2024-09-10T17:38:07.613542702Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 51919.68 | 1019270
[2024-09-10T17:38:07.613555178Z INFO solana_bench_tps::bench]
Average max TPS: 51919.68, 0 nodes had 0 TPS
[2024-09-10T17:38:07.613560734Z INFO solana_bench_tps::bench]
Highest TPS: 51919.68 sampling period 1s max transactions: 1019270 clients: 1 drop rate: 0.75
[2024-09-10T17:38:07.613565588Z INFO solana_bench_tps::bench] Average TPS: 8292.879

Before: (98c8853)

Without QUIC connection load test:

[2024-09-10T18:05:20.898492879Z INFO solana_bench_tps::bench] Node address | Max TPS | Total Transactions
[2024-09-10T18:05:20.898495407Z INFO solana_bench_tps::bench] ---------------------+---------------+--------------------
[2024-09-10T18:05:20.898497692Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 60455.27 | 1024703
[2024-09-10T18:05:20.898503032Z INFO solana_bench_tps::bench]
Average max TPS: 60455.27, 0 nodes had 0 TPS
[2024-09-10T18:05:20.898505976Z INFO solana_bench_tps::bench]
Highest TPS: 60455.27 sampling period 1s max transactions: 1024703 clients: 1 drop rate: 0.82
[2024-09-10T18:05:20.898509919Z INFO solana_bench_tps::bench] Average TPS: 8443.125
[2024-09-10T18:05:20.898538079Z INFO solana_metrics::metrics] datapoint: bench-tps-lamport_balance balance=29720450150i

[2024-09-10T19:04:58.595399026Z INFO solana_bench_tps::bench] Node address | Max TPS | Total Transactions
[2024-09-10T19:04:58.595401617Z INFO solana_bench_tps::bench] ---------------------+---------------+--------------------
[2024-09-10T19:04:58.595404020Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 41560.51 | 1038074
[2024-09-10T19:04:58.595410131Z INFO solana_bench_tps::bench]
Average max TPS: 41560.51, 0 nodes had 0 TPS
[2024-09-10T19:04:58.595414624Z INFO solana_bench_tps::bench]
Highest TPS: 41560.51 sampling period 1s max transactions: 1038074 clients: 1 drop rate: 0.81
[2024-09-10T19:04:58.595418704Z INFO solana_bench_tps::bench] Average TPS: 8603.87

[2024-09-10T19:24:53.173200122Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 66418.59 | 1113740
[2024-09-10T19:24:53.173204832Z INFO solana_bench_tps::bench]
Average max TPS: 66418.59, 0 nodes had 0 TPS
[2024-09-10T19:24:53.173209268Z INFO solana_bench_tps::bench]
Highest TPS: 66418.59 sampling period 1s max transactions: 1113740 clients: 1 drop rate: 0.80
[2024-09-10T19:24:53.173214678Z INFO solana_bench_tps::bench] Average TPS: 9236.512

The queue under attack: see persistent high Recv-Q length

State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
UNCONN 196352 0 0.0.0.0:8009 0.0.0.0:*
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
UNCONN 164096 0 0.0.0.0:8009 0.0.0.0:*
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
UNCONN 186880 0 0.0.0.0:8009 0.0.0.0:*
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
UNCONN 206336 0 0.0.0.0:8009 0.0.0.0:*
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
UNCONN 187392 0 0.0.0.0:8009 0.0.0.0:*
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
UNCONN 175104 0 0.0.0.0:8009 0.0.0.0:*
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
UNCONN 156160 0 0.0.0.0:8009 0.0.0.0:*
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
UNCONN 153088 0 0.0.0.0:8009 0.0.0.0:*

Under attack bench-tps results;

[2024-09-10T19:38:19.596617604Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 40387.21 | 331225
[2024-09-10T19:38:19.596634595Z INFO solana_bench_tps::bench]
Average max TPS: 40387.21, 0 nodes had 0 TPS
[2024-09-10T19:38:19.596638304Z INFO solana_bench_tps::bench]
Highest TPS: 40387.21 sampling period 1s max transactions: 331225 clients: 1 drop rate: 0.34
[2024-09-10T19:38:19.596645234Z INFO solana_bench_tps::bench] Average TPS: 2733.686

[2024-09-10T19:42:07.547202338Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 55878.07 | 487310
[2024-09-10T19:42:07.547208059Z INFO solana_bench_tps::bench]
Average max TPS: 55878.07, 0 nodes had 0 TPS
[2024-09-10T19:42:07.547211239Z INFO solana_bench_tps::bench]
Highest TPS: 55878.07 sampling period 1s max transactions: 487310 clients: 1 drop rate: 0.55
[2024-09-10T19:42:07.547214937Z INFO solana_bench_tps::bench] Average TPS: 3036.5737

[2024-09-10T19:45:23.950704693Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 53402.19 | 413426
[2024-09-10T19:45:23.950712458Z INFO solana_bench_tps::bench]
Average max TPS: 53402.19, 0 nodes had 0 TPS
[2024-09-10T19:45:23.950715941Z INFO solana_bench_tps::bench]
Highest TPS: 53402.19 sampling period 1s max transactions: 413426 clients: 1 drop rate: 0.19
[2024-09-10T19:45:23.950720528Z INFO solana_bench_tps::bench] Average TPS: 3347.0845

[2024-09-10T20:09:48.869452792Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 50056.67 | 489929
[2024-09-10T20:09:48.869461452Z INFO solana_bench_tps::bench]
Average max TPS: 50056.67, 0 nodes had 0 TPS
[2024-09-10T20:09:48.869464725Z INFO solana_bench_tps::bench]
Highest TPS: 50056.67 sampling period 1s max transactions: 489929 clients: 1 drop rate: 0.37
[2024-09-10T20:09:48.869468196Z INFO solana_bench_tps::bench] Average TPS: 2877.4485

Copy link

@behzadnouri behzadnouri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the change lgtm,
but please let @ripatel-fd to also do a review.

}
Err(error) => {
debug!(
"Error while accepting incoming connection: {error:?} from {}",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix

Comment on lines -1499 to -1501
s2.finish()
.await
.expect_err("shouldn't be able to open 2 connections");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why remove this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Quinn 0.11.x finish is made synchronous, it does no longer return error under this condition.

@@ -1712,7 +1720,6 @@ pub mod test {
// Test that more writes to the stream will fail (i.e. the stream is no longer writable
// after the timeouts)
assert!(s1.write_all(&[0u8]).await.is_err());
assert!(s1.finish().await.is_err());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why remove this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finish is made a sync function. it is no longer returning error in this in 0.11.x under this case.

@behzadnouri
Copy link

the change lgtm, but please let @ripatel-fd to also do a review.

Also it would be great if you can please run a staked node on testnet and confirm that both the client and the server are compatible.

@lijunwangs
Copy link
Author

the change lgtm, but please let @ripatel-fd to also do a review.

Also it would be great if you can please run a staked node on testnet and confirm that both the client and the server are compatible.

I have tested the different server/client combination: server running the code with this branch and client of the master (without Quinn and rustls update) and server with master and client built from this branch. They all work fine running bench-tps.

Copy link

@ripatel-fd ripatel-fd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read through this PR, here are my thoughts.

There is one high priority issue.

  • The list of allowed signature verification algorithms for rustls is unrestricted. These should be limited to Ed25519 before merging. I will send a PR to this PR with the fix. EDIT: To clarify, this is the same before and after the PR (not a regression).

Other than that, just some general observations:

  • The new cert verification policy otherwise seems fine. SkipServerVerification and SkipClientVerification accept any certificate but still verify the signature regardless of what the pubkey is. (So the pubkey can be trusted later on)
  • In the future, it would be nice to replace the cert verify algorithm with ed25519_dalek. But that's probably out of scope for this version bump PR
  • QUIC retries were disabled. Is this intentional?
  • verify_tls12_signature should always return an error. All TPU peers today support TLS 1.3. Any TLS 1.2 logic is dead code.
  • Using AtomicU64 for high-frequency metrics could cause excessive cache traffic when these counters get updated from multiple CPUs. (But probably out of scope for this PR)

Copy link

@ripatel-fd ripatel-fd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, @alessandrod could you also take a look?

@ripatel-fd
Copy link

ripatel-fd commented Sep 13, 2024

Confirmed that fd_quic is compatible with this change (both client and server).

The throughput in my simple "Agave client => fd_quic server" spam test over localhost decreased from ~80-120k TPS (Agave 2.0.4) to 70k-85k TPS (this branch). There is definitely a performance regression somewhere. I haven't bisected whether the regression is caused by this PR, something else, or some weird interaction that's fd_quic's fault.

This is the test code:

    const BUF: [u8; 1232] = [0u8; 1232];

    let conn_cache = ConnectionCache::new_quic("test", 16);
    let conn = conn_cache.get_connection(&SocketAddr::new(
        IpAddr::V4(Ipv4Addr::new(127, 0, 0, 1)),
        listen_port,
    ));

    let mut batch = Vec::<Vec<u8>>::with_capacity(1024);

    let mut rng = rand::thread_rng();
    loop {
        let cnt: usize = rng.gen_range(1..batch.capacity());
        batch.clear();
        for _ in 0..cnt {
            batch.push(BUF[0..rng.gen_range(1..BUF.len())].to_vec());
        }
        if let Err(err) = conn.send_data_batch(&batch) {
            eprintln!("{:?}", err);
        }
    }

@lijunwangs
Copy link
Author

lijunwangs commented Sep 13, 2024

Some results of bench TPS:

New Server: Quinn 0.11.x based server (this branch)
New Client: bench-tps tpu client of Quinn 0.11.x based client (this branch)

Old Server: Quinn 0.10.x based server (2f8f910)
Old client: bench-tps tpu client of Quinn 0.10.x (2f8f910)

Old Server/ New Client

[2024-09-13T04:09:54.104627665Z INFO solana_bench_tps::bench] Token balance: 28871270150
[2024-09-13T04:09:54.104644655Z INFO solana_bench_tps::bench] Node address | Max TPS | Total Transactions
[2024-09-13T04:09:54.104655925Z INFO solana_bench_tps::bench] ---------------------+---------------+--------------------
[2024-09-13T04:09:54.104657563Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 10696.78 | 77621
[2024-09-13T04:09:54.104663689Z INFO solana_bench_tps::bench]
Average max TPS: 10696.78, 0 nodes had 0 TPS
[2024-09-13T04:09:54.104667099Z INFO solana_bench_tps::bench]
Highest TPS: 10696.78 sampling period 1s max transactions: 77621 clients: 1 drop rate: 0.00
[2024-09-13T04:09:54.104668446Z INFO solana_metrics::metrics] datapoint: bench-tps-lamport_balance balance=28871270150i
[2024-09-13T04:09:54.104673823Z INFO solana_bench_tps::bench] Average TPS: 2501.3164

[2024-09-13T04:11:32.688330444Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 10110.44 | 80240
[2024-09-13T04:11:32.688334872Z INFO solana_bench_tps::bench]
Average max TPS: 10110.44, 0 nodes had 0 TPS
[2024-09-13T04:11:32.688337961Z INFO solana_bench_tps::bench]
Highest TPS: 10110.44 sampling period 1s max transactions: 80240 clients: 1 drop rate: 0.00
[2024-09-13T04:11:32.688336691Z INFO solana_metrics::metrics] datapoint: bench-tps-lamport_balance balance=28928285150i
[2024-09-13T04:11:32.688342241Z INFO solana_bench_tps::bench] Average TPS: 2585.821

[2024-09-13T04:12:50.993465813Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 11216.83 | 81288
[2024-09-13T04:12:50.993469640Z INFO solana_bench_tps::bench]
Average max TPS: 11216.83, 0 nodes had 0 TPS
[2024-09-13T04:12:50.993472280Z INFO solana_bench_tps::bench]
Highest TPS: 11216.83 sampling period 1s max transactions: 81288 clients: 1 drop rate: 0.00
[2024-09-13T04:12:50.993475638Z INFO solana_bench_tps::bench] Average TPS: 2619.439

[2024-09-13T04:17:05.360114465Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 7769.16 | 78252
[2024-09-13T04:17:05.360119606Z INFO solana_bench_tps::bench]
Average max TPS: 7769.16, 0 nodes had 0 TPS
[2024-09-13T04:17:05.360123174Z INFO solana_bench_tps::bench]
Highest TPS: 7769.16 sampling period 1s max transactions: 78252 clients: 1 drop rate: 0.00
[2024-09-13T04:17:05.360127601Z INFO solana_bench_tps::bench] Average TPS: 2521.779

Old Server/Old Client

[2024-09-13T04:14:04.560341893Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 14687.67 | 101264
[2024-09-13T04:14:04.560347170Z INFO solana_bench_tps::bench]
Average max TPS: 14687.67, 0 nodes had 0 TPS
[2024-09-13T04:14:04.560354646Z INFO solana_bench_tps::bench]
Highest TPS: 14687.67 sampling period 1s max transactions: 101264 clients: 1 drop rate: 0.00
[2024-09-13T04:14:04.560358051Z INFO solana_bench_tps::bench] Average TPS: 3263.6912

[2024-09-13T04:15:11.696875557Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 8072.19 | 100284
[2024-09-13T04:15:11.696879351Z INFO solana_bench_tps::bench]
Average max TPS: 8072.19, 0 nodes had 0 TPS
[2024-09-13T04:15:11.696885482Z INFO solana_bench_tps::bench]
Highest TPS: 8072.19 sampling period 1s max transactions: 100284 clients: 1 drop rate: 0.00
[2024-09-13T04:15:11.696886084Z INFO solana_metrics::metrics] datapoint: bench-tps-lamport_balance balance=29134105150i
[2024-09-13T04:15:11.696891413Z INFO solana_bench_tps::bench] Average TPS: 3232.214

[2024-09-13T04:16:10.595767307Z INFO solana_metrics::metrics] datapoint: bench-tps-lamport_balance balance=29168640150i
[2024-09-13T04:16:10.595769732Z INFO solana_bench_tps::bench]
Average max TPS: 15582.78, 0 nodes had 0 TPS
[2024-09-13T04:16:10.595773779Z INFO solana_bench_tps::bench]
Highest TPS: 15582.78 sampling period 1s max transactions: 100177 clients: 1 drop rate: 0.01
[2024-09-13T04:16:10.595776615Z INFO solana_bench_tps::bench] Average TPS: 3322.5144

[2024-09-13T04:18:34.354558202Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 13677.37 | 104204
[2024-09-13T04:18:34.354564583Z INFO solana_bench_tps::bench]
Average max TPS: 13677.37, 0 nodes had 0 TPS
[2024-09-13T04:18:34.354567203Z INFO solana_bench_tps::bench]
Highest TPS: 13677.37 sampling period 1s max transactions: 104204 clients: 1 drop rate: 0.00
[2024-09-13T04:18:34.354574067Z INFO solana_bench_tps::bench] Average TPS: 3358.2153

New Server/New Client

[2024-09-13T16:59:48.232051611Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 23692.68 | 139309
[2024-09-13T16:59:48.232055345Z INFO solana_bench_tps::bench]
Average max TPS: 23692.68, 0 nodes had 0 TPS
[2024-09-13T16:59:48.232067582Z INFO solana_bench_tps::bench]
Highest TPS: 23692.68 sampling period 1s max transactions: 139309 clients: 1 drop rate: 0.92
[2024-09-13T16:59:48.232070474Z INFO solana_bench_tps::bench] Average TPS: 4601.6934

[2024-09-13T17:01:12.933942408Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 76041.14 | 269543
[2024-09-13T17:01:12.933946825Z INFO solana_bench_tps::bench]
Average max TPS: 76041.14, 0 nodes had 0 TPS
[2024-09-13T17:01:12.933950228Z INFO solana_bench_tps::bench]
Highest TPS: 76041.14 sampling period 1s max transactions: 269543 clients: 1 drop rate: 0.84
[2024-09-13T17:01:12.933952913Z INFO solana_bench_tps::bench] Average TPS: 8062.4863

[2024-09-13T17:45:48.114628190Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 40676.90 | 165451
[2024-09-13T17:45:48.114632591Z INFO solana_bench_tps::bench]
Average max TPS: 40676.90, 0 nodes had 0 TPS
[2024-09-13T17:45:48.114644285Z INFO solana_bench_tps::bench]
Highest TPS: 40676.90 sampling period 1s max transactions: 165451 clients: 1 drop rate: 0.92
[2024-09-13T17:45:48.114647601Z INFO solana_bench_tps::bench] Average TPS: 5347.7236

[2024-09-13T17:53:34.316832096Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 60015.69 | 248985
[2024-09-13T17:53:34.316841646Z INFO solana_bench_tps::bench]
Average max TPS: 60015.69, 0 nodes had 0 TPS
[2024-09-13T17:53:34.316845338Z INFO solana_bench_tps::bench]
Highest TPS: 60015.69 sampling period 1s max transactions: 248985 clients: 1 drop rate: 0.86
[2024-09-13T17:53:34.316848256Z INFO solana_bench_tps::bench] Average TPS: 8185.4077

[2024-09-13T18:54:22.329161126Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 33567.08 | 166174
[2024-09-13T18:54:22.329164691Z INFO solana_bench_tps::bench]
Average max TPS: 33567.08, 0 nodes had 0 TPS
[2024-09-13T18:54:22.329176666Z INFO solana_bench_tps::bench]
Highest TPS: 33567.08 sampling period 1s max transactions: 166174 clients: 1 drop rate: 0.91
[2024-09-13T18:54:22.329188357Z INFO solana_bench_tps::bench] Average TPS: 5201.1616

[2024-09-13T18:54:22.329161126Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 33567.08 | 166174
[2024-09-13T18:54:22.329164691Z INFO solana_bench_tps::bench]
Average max TPS: 33567.08, 0 nodes had 0 TPS
[2024-09-13T18:54:22.329176666Z INFO solana_bench_tps::bench]
Highest TPS: 33567.08 sampling period 1s max transactions: 166174 clients: 1 drop rate: 0.91
[2024-09-13T18:54:22.329188357Z INFO solana_bench_tps::bench] Average TPS: 5201.1616

New Server/Old Client

[2024-09-13T18:59:06.846784990Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 38924.14 | 150891
[2024-09-13T18:59:06.846793930Z INFO solana_bench_tps::bench]
Average max TPS: 38924.14, 0 nodes had 0 TPS
[2024-09-13T18:59:06.846797431Z INFO solana_bench_tps::bench]
Highest TPS: 38924.14 sampling period 1s max transactions: 150891 clients: 1 drop rate: 0.92
[2024-09-13T18:59:06.846800160Z INFO solana_bench_tps::bench] Average TPS: 4745.193

[2024-09-13T19:07:35.783540809Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 33126.60 | 226337
[2024-09-13T19:07:35.783544342Z INFO solana_bench_tps::bench]
Average max TPS: 33126.60, 0 nodes had 0 TPS
[2024-09-13T19:07:35.783554771Z INFO solana_bench_tps::bench]
Highest TPS: 33126.60 sampling period 1s max transactions: 226337 clients: 1 drop rate: 0.88
[2024-09-13T19:07:35.783557669Z INFO solana_bench_tps::bench] Average TPS: 7407.575

[2024-09-13T19:21:07.516963576Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 46994.12 | 271263
[2024-09-13T19:21:07.516974112Z INFO solana_bench_tps::bench]
Average max TPS: 46994.12, 0 nodes had 0 TPS
[2024-09-13T19:21:07.516979948Z INFO solana_bench_tps::bench]
Highest TPS: 46994.12 sampling period 1s max transactions: 271263 clients: 1 drop rate: 0.85
[2024-09-13T19:21:07.516982819Z INFO solana_bench_tps::bench] Average TPS: 8975.588

[2024-09-13T19:29:15.004030839Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 59456.17 | 330006
[2024-09-13T19:29:15.004036201Z INFO solana_bench_tps::bench]
Average max TPS: 59456.17, 0 nodes had 0 TPS
[2024-09-13T19:29:15.004040233Z INFO solana_bench_tps::bench]
Highest TPS: 59456.17 sampling period 1s max transactions: 330006 clients: 1 drop rate: 0.82
[2024-09-13T19:29:15.004044419Z INFO solana_bench_tps::bench] Average TPS: 10835.156

[2024-09-13T19:30:06.065923915Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 39646.18 | 255931
[2024-09-13T19:30:06.065942360Z INFO solana_bench_tps::bench]
Average max TPS: 39646.18, 0 nodes had 0 TPS
[2024-09-13T19:30:06.065946192Z INFO solana_bench_tps::bench]
Highest TPS: 39646.18 sampling period 1s max transactions: 255931 clients: 1 drop rate: 0.84
[2024-09-13T19:30:06.065950086Z INFO solana_bench_tps::bench] Average TPS: 8511.6875

I am actually seeing better performance results with server running with this branch. Interestingly on the bench TPS results to the old server, the old client seems to have better results than the new client.

@lijunwangs
Copy link
Author

I had a staked node running with this branch against test net, it shows connections working in both way:
[2024-09-13T21:06:37.963021363Z INFO solana_metrics::metrics] datapoint: quic_streamer_tpu active_connections=2347i active_streams=0i new_connections=1i new_streams=0i evictions=0i connection_added_from_staked_peer=1i connection_added_from_unstaked_peer=0i connection_add_failed=0i connection_add_failed_invalid_stream_count=0i connection_add_failed_staked_node=0i connection_add_failed_unstaked_node=0i connection_add_failed_on_pruning=0i connection_removed=1i connection_remove_failed=0i connection_setup_timeout=0i connection_setup_error=0i connection_setup_error_timed_out=0i connection_setup_error_closed=0i connection_setup_error_transport=0i connection_setup_error_app_closed=0i connection_setup_error_reset=0i connection_setup_error_locally_closed=0i connection_rate_limited_across_all=0i connection_rate_limited_per_ipaddr=0i invalid_chunk=0i invalid_chunk_size=0i packets_allocated=0i packet_batches_allocated=0i packets_sent_for_batching=0i staked_packets_sent_for_batching=0i unstaked_packets_sent_for_batching=0i bytes_sent_for_batching=0i chunks_sent_for_batching=0i packets_sent_to_consumer=0i bytes_sent_to_consumer=0i chunks_processed_by_batcher=0i chunks_received=0i staked_chunks_received=0i unstaked_chunks_received=0i packet_batch_send_error=0i handle_chunk_to_packet_batcher_send_error=0i packet_batches_sent=0i packet_batch_empty=0i stream_read_errors=0i stream_read_timeouts=0i throttled_streams=0i stream_load_ema=0i stream_load_ema_overflow=0i stream_load_capacity_overflow=0i throttled_unstaked_streams=0i throttled_staked_streams=0i process_sampled_packets_us_90pct=0i process_sampled_packets_us_min=0i process_sampled_packets_us_max=0i process_sampled_packets_us_mean=0i process_sampled_packets_count=0i perf_track_overhead_us=0i connection_rate_limiter_length=3338i outstanding_incoming_connection_attempts=0i total_incoming_connection_attempts=43313i quic_endpoints_count=1i
[2024-09-13T21:06:39.381207132Z INFO solana_quic_client::nonblocking::quic_client] Made connection to 149.50.110.202:8009 id 140138543716368 try_count 0, from connection cache warming?: true
[2024-09-13T21:06:43.342887537Z INFO solana_metrics::metrics] datapoint: quic_streamer_tpu active_connections=2346i active_streams=0i new_connections=1i new_streams=0i evictions=0i connection_added_from_staked_peer=1i connection_added_from_unstaked_peer=0i connection_add_failed=0i connection_add_failed_invalid_stream_count=0i connection_add_failed_staked_node=0i connection_add_failed_unstaked_node=0i connection_add_failed_on_pruning=0i connection_removed=2i connection_remove_failed=0i connection_setup_timeout=0i connection_setup_error=0i connection_setup_error_timed_out=0i connection_setup_error_closed=0i connection_setup_error_transport=0i connection_setup_error_app_closed=0i connection_setup_error_reset=0i connection_setup_error_locally_closed=0i connection_rate_limited_across_all=0i connection_rate_limited_per_ipaddr=0i invalid_chunk=0i invalid_chunk_size=0i packets_allocated=0i packet_batches_allocated=0i packets_sent_for_batching=0i staked_packets_sent_for_batching=0i unstaked_packets_sent_for_batching=0i bytes_sent_for_batching=0i chunks_sent_for_batching=0i packets_sent_to_consumer=0i bytes_sent_to_consumer=0i chunks_processed_by_batcher=0i chunks_received=0i staked_chunks_received=0i unstaked_chunks_received=0i packet_batch_send_error=0i handle_chunk_to_packet_batcher_send_error=0i packet_batches_sent=0i packet_batch_empty=0i stream_read_errors=0i stream_read_timeouts=0i throttled_streams=0i stream_load_ema=0i stream_load_ema_overflow=0i stream_load_capacity_overflow=0i throttled_unstaked_streams=0i throttled_staked_streams=0i process_sampled_packets_us_90pct=0i process_sampled_packets_us_min=0i process_sampled_packets_us_max=0i process_sampled_packets_us_mean=0i process_sampled_packets_count=0i perf_track_overhead_us=0i connection_rate_limiter_length=3338i outstanding_incoming_connection_attempts=0i total_incoming_connection_attempts=43314i quic_endpoints_count=1i

@@ -331,6 +331,7 @@ async fn run_server(
stats
.connection_rate_limited_across_all
.fetch_add(1, Ordering::Relaxed);
incoming.ignore();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure that this should be ignore and not refuse? Won't ignore trigger
an automatic retry from the client making things worse?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good point. I was consciously using ignore feature for two reasons: 1. it is less costly for the server to do so -- we do not have to schedule outgoing reply packets when it is already under load -- we have in the past excessive outgoing packets for under initial packets attacks. 2; the timeout will hopefully make the client taking longer time to do the retry -- it need to wait for the timeout before another attack. I think a non-collaborative client could do that regardless we do refuse or silently drop it. Maybe we could improve the logic in follow on PR to initially do the refuse and when it keep violating the limit we could just ignore.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is an issue because it makes well behaved clients retry, as to them this is effectively packet loss. The default packet loss timeout in quinn iirc is 1.3 * RTT. But I'm ok fixing this in a follow up!

@@ -349,26 +350,35 @@ async fn run_server(
stats
.connection_rate_limited_per_ipaddr
.fetch_add(1, Ordering::Relaxed);
incoming.ignore();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

@lijunwangs lijunwangs merged commit fc0183d into anza-xyz:master Sep 14, 2024
52 checks passed
ray-kast pushed a commit to abklabs/agave that referenced this pull request Nov 27, 2024
1. A incoming connection can be ignored now when we do rate limiting -- which is more efficient -- does not require queueing outgoing packets.
2. rustls interface changes of ServerCertVerifier applied to SkipServerVerification implementations
changes in Cargo.toml to handle curve25519-dalek 3.2.1 patching because of zeroize version constraint due to the newer rustls. The workaround is applied to downstream tests.
3. Quinn 0.11.x introduced new error codes which we need to handle.
4. stream finish is no longer an async function.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants