Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

switch to ContactInfo propagation in PullRequests #2894

Closed
wants to merge 1 commit into from

Conversation

gregcusack
Copy link

@gregcusack gregcusack commented Sep 11, 2024

Problem

  1. unstaked nodes are having a difficult time catching up and acquiring shreds via repair: https://discord.com/channels/428295358100013066/1282690942616211526/1282691964558643221
  2. tvu_peers set of unstaked nodes is unreliable and keeps changing, causing nodes to appear offline or out of gossip

Cause

In v2.0.8, self.nodes in crds only stores ContactInfo, it does not store LegacyContactInfo , but v1.18.23 stores LegacyContactInfo in crds.nodes.
master:

agave/gossip/src/crds.rs

Lines 254 to 255 in 37671df

CrdsData::ContactInfo(node) => {
self.nodes.insert(entry_index);

v1.18.23:

agave/gossip/src/crds.rs

Lines 246 to 247 in 8c42fa8

CrdsData::LegacyContactInfo(node) => {
self.nodes.insert(entry_index);

self.nodes aka crds.nodes:
nodes: IndexSet<usize>, // Indices of nodes' ContactInfo.

In both versions, we send our LegacyContactInfo in every PullRequest sent. However, in v2.0.8, this LegacyContactInfo, no longer gets stored in crds.nodes even though it is sent. So when unstaked nodes are sending pull requests, their LegacyContactInfo is not getting propagated.

It seems that unstaked nodes are relying heavily on pull requests to propagated their ContactInfo. But with the update to v2.0.8, receiving nodes are no longer registering LegacyContactInfo in their set of tvu_peers because crds.nodes doesn't hold them anymore.

Summary of Changes

Since mb has upgraded to v1.18.23 and can handle ContactInfo in a PullRequest, we can switch over to propagating ContactInfo in PullRequest. We can also remove the [ignore] on the previously flaky test caused by this same issue.

Data

@steviez applied this patch to tw2NAkBpTwST2c4NDW9xFmWgqSL5ErPy5QYbfYa3KXi, which was previously experiencing repair issues as mentioned above and got the following result:
From Steve:
pr-contact-info-repair
^ Shred percentages through turbine/recovery/repair over the last hour:
Vertical white line is where process came back online with patch
Orange = turbine
Blue = recovery
Purple = repair

Almost entirely repair previously and now almost no repair / all turbine + recovery (recovery technically under turbine too)

This patch also causes tw2NAkBpTwST2c4NDW9xFmWgqSL5ErPy5QYbfYa3KXi to consistently and immediately appear in tvu_peers.

PR made with @steviez

@@ -5738,7 +5738,6 @@ fn test_randomly_mixed_block_verification_methods_between_bootstrap_and_not() {

/// Forks previous marked invalid should be marked as such in fork choice on restart
#[test]
#[ignore]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this change specifically make this test less flaky / passable ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does but we don't know why exactly yet. See: #823 (comment). so i need to debug

@behzadnouri
Copy link

If the issue is that contact-info is not propagated through gossip, we need to first debug why is that.
I would much rather not rely on shoving contact-info into pull requests because that is not how gossip is supposed to work (i.e. messages need to be propagated through push messages and pull responses).

@gregcusack
Copy link
Author

closing. addressing this overarching problem through PR #3016.

@gregcusack gregcusack closed this Oct 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants