Reduce default from 8 to 1 TVU receive socket/threads #998

steviez · 2024-04-23T15:26:22Z

Problem

We currently create 8 sockets/threads to pull out turbine shreds. Using 8 of these sockets is excessive, and actually leads to more inefficiencies (ie less densely packed PacketBatch) as some data below will show. We have a 1ms coalesce duration as part of the receiver, but given that we have 8 receivers all running independently, this 1ms coalesce doesn't mean much since they're all competing for packets. Part of work for #35

Summary of Changes

Reduce the default value of sockets/threads from 8 to 1.

One potential concern I could see for reducing the number of receivers would be if network load increased and over-burdened the single streamer. However, bench-streamer would indicate that the streamer can handle MUCH higher loads than what we're seeing

$ cargo run --release -- --num-recv-sockets 1
    Finished release [optimized] target(s) in 0.27s
     Running `/home/sol/src/agave/target/release/solana-bench-streamer --num-recv-sockets 1`
performance: 879154.6507920896 (PPS)

This is a micro-benchmark so its' results need to be noted with caution, but again, this is several orders of magnitude greater than the current mnb load of 2500-3000 inbound shreds per second.

Testing / Data

I've been running two nodes in an A/B setup; one node with the change and one without. The nodes were restarted with this change around 18:00 UTC on 2024/04/22. We'll examine shred_sigverify stats, as this is where the receivers all send their packets and where some of the gains can be observed.

For starters, we see similar values for num_packets, ignoring the restart spike around 18:00. This is not surprising / a basic sanity check:

However, num_batches is significantly smaller for the node running this branch. To simplify, this graphs shows num_packets / num_batches to yield packets-per-batch:

The tip of master node is seeing something like 1.16 packets per batch. So, something like one 2-packet-batch to every five 1-packet-batches.
The experiment node is showing ~3.5 packets per batch

More packets per batches means better locality, which we can see with a ~20% drop in total time spent via elapsed_micros:

And less total time with the same number of packets also means a similar ~20% drop in average spent time per packet

codecov-commenter · 2024-04-23T16:41:36Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.9%. Comparing base (57ab944) to head (cf6c65f).
Report is 1 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master     #998   +/-   ##
=======================================
  Coverage    81.9%    81.9%           
=======================================
  Files         855      855           
  Lines      232154   232154           
=======================================
+ Hits       190170   190172    +2     
+ Misses      41984    41982    -2

behzadnouri · 2024-04-23T20:07:34Z

We probably should rename these (or add some comments) because it is pretty confusing which of the many tvu sockets this is referring to. But it looks like this is the shred-fetch-stage sockets (i.e. multiple sockets all bound to contact_info.tvu() socket). is that correct?

It seems like the motivation to use multiple sockets (for the same port) is explained here: solana-labs#1228 and in particular to address this issue: solana-labs#1224

So performance in steady state aside, the issue is under heavy load socket queues are saturated and packets are dropped.

This commit effectively reverts #1228. Wouldn't that be a concern that socket queue saturation and issues like #1224 arise again?

steviez · 2024-04-23T21:23:10Z

We probably should rename these (or add some comments) because it is pretty confusing which of the many tvu sockets this is referring to.

I'm open to a wider rename to make things more specific; I opted to match existing nomenclature:

agave/gossip/src/cluster_info.rs

Lines 3020 to 3022 in f5db8d2

    
           let (tvu_port, tvu_sockets) = 
        
               multi_bind_in_range(bind_ip_addr, port_range, num_tvu_sockets.get()) 
        
                   .expect("tvu multi_bind");

But it looks like this is the shred-fetch-stage sockets (i.e. multiple sockets all bound to contact_info.tvu() socket). is that correct?

Yep, that's right

steviez · 2024-04-23T21:29:10Z

... This commit effectively reverts solana-labs#1228. Wouldn't that be a concern that socket queue saturation and issues like solana-labs#1224 arise again?

Yeah, totally agree that we need to provision things to be able to withstand worst-case / heavy load scenarios. A few points:

I have not observed an increase in rcvbuf errors on my test node
- I know this is steady state, was going to test in a private cluster where we can artificially increase shreds/block as well
Not that we want any drops in network buffers, but redundancy of the system has increased dramatically since that PR. Ie, the data/coding ratio from then is seen here
I'm a little hesitant to put too much weight into the graph from Occasional UDP packet drop on validator nodes solana-labs/solana#1224 as it relates to today's validator.
- A Blob was nearly 65k bytes
- Blob receive was not using recvmmsg, so the solana code was unable to pull packets out as fast as it can now
- The graph shows ~1k packets-per-second, current MNB has ~27k. But, hard to compare these against each other given previously mentioned packet size differences
Lastly, if I'm not super familiar with the linux socket interface, but if socket queue capacity was the issue, I would expect us to be able to tweak that parameter instead of creating 8 sockets (that I believe must all be read) to spread the load

behzadnouri · 2024-04-23T23:25:35Z

Yeah, totally agree that we need to provision things to be able to withstand worst-case / heavy load scenarios. A few points:

so how can we ensure this change does not resurface previous issues?

Lastly, if I'm not super familiar with the linux socket interface, but if socket queue capacity was the issue, I would expect us to be able to tweak that parameter instead of creating 8 sockets (that I believe must all be read) to spread the load

sys-tuner used to do that:
https://github.com/anza-xyz/agave/blob/9b860abfc/sys-tuner/src/main.rs#L103-L107
and same instructions are part of the current validator docs:
https://docs.solanalabs.com/operations/setup-a-validator#linux

But I do not know if the mulit-socket trick was added before above instructions used or they were still needed in addition to those instructions.
Obviously it would be much better if we can achieve the same effect without overloading the runtime or spawning extra threads.

bw-solana · 2024-04-23T23:30:13Z

FYI - We have an invalidator setup where we can hit the cluster with ~1M TPS in aggregate of basic transfers. This manifests as ~20k TPS at the banking stage/replay level and 3k shreds per slot. In other words, 4-5x the typical mainnet traffic. Is that good enough to evaluate this change?

steviez · 2024-04-24T00:56:40Z

sys-tuner used to do that: https://github.com/anza-xyz/agave/blob/9b860abfc/sys-tuner/src/main.rs#L103-L107 and same instructions are part of the current validator docs: https://docs.solanalabs.com/operations/setup-a-validator#linux

Good callout, some further reading has me convinced this is the right knob:

rmem_default
The default setting of the socket receive buffer in bytes.

And FWIW, tuning those settings manually is effectively "required":

agave/core/src/system_monitor_service.rs

Lines 403 to 408 in 550f806

    
           const INTERESTING_LIMITS: &[(&str, InterestingLimit)] = &[ 
        
               ("net.core.rmem_max", InterestingLimit::Recommend(134217728)), 
        
               ( 
        
                   "net.core.rmem_default", 
        
                   InterestingLimit::Recommend(134217728), 
        
               ),

Someone can opt out with --no-os-network-limits-test, but this is a scenario where we will not offer support:

agave/validator/src/main.rs

Lines 1749 to 1750 in 550f806

    
           if !matches.is_present("no_os_network_limits_test") { 
        
               if SystemMonitorService::check_os_network_limits() {

But I do not know if the mulit-socket trick was added before above instructions used or they were still needed in addition to those instructions.

Let me see if I can do some digging to figure this out. However, I think my earlier comment is very important to note as well:

Blob receive was not using recvmmsg, so the solana code was unable to pull packets out as fast as it can now

Drops mean that packets were coming in faster than the validator could consume them. Even with a very large buffer, we theoretically would expect some drops on a long enough time scale. More threads were added to allow a higher overall consumption rate. But, using recvmmsg() would have optimized the single-threaded case, and was even listed as a possible solution on solana-labs#1224. At some point after multi-socket went in, the change was made to make this path use recvmmsg() anyways.

so how can we ensure this change does not resurface previous issues?

Supposing we run a test like the one Brennan mentioned where we have blocks that contain N more shreds than current MNB load, do you view that as a sufficient worst-case test for a large enough N?

behzadnouri · 2024-04-24T13:30:25Z

yeah, I think we just need to stress test this with some massive number of shreds at tvu port.
if we can also figure out how far it can be pushed until we observe some rcvbuf error and packet drop that would be great.

steviez · 2024-04-24T13:58:08Z

yeah, I think we just need to stress test this with some massive number of shreds at tvu port. if we can also figure out how far it can be pushed until we observe some rcvbuf error and packet drop that would be great.

Cool, think we're in agreement; will report back with some data once I get the chance to setup and run this experiment

A single streamer is capable of handling the load, and doing so with a single thread is more efficient.

steviez changed the title ~~Reduce from 8 to 1 TVU receive socket/thread~~ Reduce default from 8 to 1 TVU receive socket/threads Apr 23, 2024

steviez requested review from behzadnouri and bw-solana April 23, 2024 17:32

Reduce from 8 to 1 TVU receive socket/thread

368f1a0

A single streamer is capable of handling the load, and doing so with a single thread is more efficient.

steviez force-pushed the tvu_recv_threads branch from cf6c65f to 368f1a0 Compare December 17, 2024 07:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce default from 8 to 1 TVU receive socket/threads #998

Reduce default from 8 to 1 TVU receive socket/threads #998

steviez commented Apr 23, 2024

codecov-commenter commented Apr 23, 2024

behzadnouri commented Apr 23, 2024

steviez commented Apr 23, 2024

steviez commented Apr 23, 2024 •

edited

Loading

behzadnouri commented Apr 23, 2024

bw-solana commented Apr 23, 2024

steviez commented Apr 24, 2024

behzadnouri commented Apr 24, 2024

steviez commented Apr 24, 2024

Reduce default from 8 to 1 TVU receive socket/threads #998

Are you sure you want to change the base?

Reduce default from 8 to 1 TVU receive socket/threads #998

Conversation

steviez commented Apr 23, 2024

Problem

Summary of Changes

Testing / Data

codecov-commenter commented Apr 23, 2024

Codecov Report

behzadnouri commented Apr 23, 2024

steviez commented Apr 23, 2024

steviez commented Apr 23, 2024 • edited Loading

behzadnouri commented Apr 23, 2024

bw-solana commented Apr 23, 2024

steviez commented Apr 24, 2024

behzadnouri commented Apr 24, 2024

steviez commented Apr 24, 2024

steviez commented Apr 23, 2024 •

edited

Loading