Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add in metrics for detecting redundant pulls #113

Closed

Conversation

gregcusack
Copy link

Problem

We had previously added in a metric for tracking gossip push messages through the network in PR: #32725. However, this metric does not account for redundant pull requests.
Redundant Pull: A node receives a message via PullResponse and then receives the same message via Push.
Redundant Pulls prevent us from accurately calculating how well messages are propagating via Push.

Summary of Changes

Add in metric to report when a node receives a NEW message via PullResponse (gossip_crds_sample_pull).
Add in a metric to report when a node receives a message via Push but fails to insert (gossip_crds_sample_fail).

Identifying redundant Pulls:

  1. Get message signatures reported in gossip_crds_sample_pull
  2. Get message signatures reported in gossip_crds_sample_fail
  3. Take the intersection of signatures from (1) and (2)
  4. The intersection of these sets results in all messages that were first received via Pull and then received via Push (aka a Redundant Pull)

Simulation Results

In a 100 node simulation, I saw Redundant Pulls occur somewhat frequently. This indicates Redundant Pulls may be the reason for the discrepancy between the simulated Push coverage and measured Push coverage

Possible Issues

  1. Adding in more metrics to an already heavily used metrics server. Could possibly remove these metrics once we get data we need.

@gregcusack gregcusack requested a review from behzadnouri March 6, 2024 16:59
@gregcusack gregcusack force-pushed the redundant-pull-metrics branch from 298964f to 0f4dcb9 Compare March 6, 2024 18:15
@behzadnouri
Copy link

  • Get message signatures reported in gossip_crds_sample_pull

  • Get message signatures reported in gossip_crds_sample_fail

  • Take the intersection of signatures from (1) and (2)

  • The intersection of these sets results in all messages that were first received via Pull and then received via Push (aka a Redundant Pull)

So this requires a lot of offline processing.
I was thinking of something simpler and not restricted to sampled messages.

Basically, we can change num_push_dups here:
https://github.com/anza-xyz/agave/blob/adefcbbb4/gossip/src/crds.rs#L130-L131
to be just num_push. So

  • a newly inserted value from GossipRoute::PushMessage will initialize this value to 1.
  • a newly inserted value from GossipRoute::PullResponse will initialize this value to 0.
  • (have to think what to do about other GossipRoute::* cases).

(also have to accordingly update a lot of other places which use num_push_dups).

Then if you receive the same value again from GossipRoute::PushMessage:
https://github.com/anza-xyz/agave/blob/adefcbbb4/gossip/src/crds.rs#L304-L308
but num_push is zero then that indicates a redundant pull, and we can either add a metric to CrdsStats to record that or return a new error message to the caller of crds.insert to process that at the call-site.

I would suggest lets hold on to this pr for now, but first implement above simpler approach in a separate pr and lets see how that looks like.

@gregcusack
Copy link
Author

I was thinking of something simpler and not restricted to sampled messages.

My only concern here is that in a simple test, there appeared to be a lot of redundant pulls, so it is possible this would be very heavy on metrics server. But I can give it a shot and run some tests with your suggested changes and see what we see. If it's too much we can then sample.

@gregcusack
Copy link
Author

Closing in favor of PR: #139

@gregcusack gregcusack closed this Mar 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants