forked from solana-labs/solana
-
Notifications
You must be signed in to change notification settings - Fork 254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v2.0: reworks max number of outgoing push messages (backport of #3016) #3038
Merged
+5
−13
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was 4k chosen? Any idea how close we get to this limit during steady state? Or what we burst up to?
If we were to sustain at this rate, how much egress would that be? Something like 300Mbps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly based on testing on an unstaked node in east asia, mainnet/testnet metrics plus some margin for spikes. This should leave enough margin during steady state with the caveat below.
How often this function is invoked partly depends on how often the node receives push messages. So I don't have a mathematical mapping between this limit and the egress rate. But I think we have enough metrics to monitor this on testnet and get some estimation.
For now this seems like a working patch addressing contact-info propagation issue for unstaked east asia nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The max possible limit is actually pretty high which may be a concern. Some napkin math puts our max rate at about 4.4 GB/s:
^ Although in order to be sending that much traffic, the node would have to be receiving at the lowest: 491 MB/s
with a staked node on testnet, it is closer to:
With this same staked node on testnet:
Push Burst @ ~192 MB/s on validator startup.
Push Steady State @ ~10MB/s
So our steady state push bandwidth is pretty low. But, the peak is pretty significant and will likely be higher for higher staked nodes.
4096 is chosen to ensure an unstaked node, not receiving any push messages, can keep up with the demand of incoming pull requests while also being able to send out its own ContactInfo. Initially an unstaked node only receives data via pull request. So, the data from the pull responses fills up its table quickly. The problem is, the node cannot push out all of the new CrdsValues quick enough before the node refreshes its ContactInfo in the table and the node's ContactInfo gets pushed to the end of the table.
Out of the 910 calls to
new_push_requests
per second, only 10 of them are run within therun_gossip
set of threads. The other ~900/s are called via thehandle_batch_push_messages
set of threads. But if a node is not receiving any push messages ,handle_batch_push_messages
is exited early so no calls tonew_push_requests
are made.As a result, the node only has 10 threads dedicated to sending push messages, but those 10 threads in their previous state (before this PR), cannot send enough push messages before the node's ContactInfo gets refreshed.
All that is to say, that we need a high enough limit in the
new_push_requests
function to send enough data so that the node can send its ContactInfo via push using just therun_gossip
threads before it gets refreshed.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that max calculation does not take into account that the frequency the function is called limits how many push messages are generated in each call.
For example for the function to be called ~1000 times per second then each call should only take 1ms in which case it cannot generate many push messages in that short period of time.
iow more frequent calls => fewer push messages per call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh you are right it does not. good point