Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cluster info on geyser #46

Conversation

godmodegalactus
Copy link

Problem

See issue solana-labs#32934
Summary of Changes

This PR is a POC to add cluster node's info to the Geyser plugin notification.

The LegacyContactInfo received on GOSSIP network are forwarded with the Geyser interface to a client subscriber.

Summary of Changes

the entry point in crds.rs to notify LegacyContactInfo updates and removals,
the Geyser plugin interface update to add LegacyContactInfo notification and notification entry point.
the processing of the updates by the Geyser_plugin_manager.

A GRPC Yellowstone plugin update has been done to test the process and see the notified data volume. You can see the PR here.

An example of client can be found here. It subscribes to the Cluster info node notification and call from time to time the RPC get_cluster_nodes to compare with the notified data.
Early results

The test has been done on Testnet cluster with the Solana version 1.17.5 and Yellowstone plugin 1.11.0+solana.1.17.5.
Comparison with RPC call

It takes a few second for the plugin to get the majority of the Testnet cluster nodes and after 30s there's less than 10 nodes in error when comparing with the RPC call (around 3000 nodes in the cluster).

After a few minutes, Geyser and RPC has the same list of cluster's nodes. From time to time they can have some difference (a few nodes) from one part or the other depending on the notification time propagation of update or removal.
Data volume

I use Tcpdump to monitor the packet send between the validator grpc server and the client to build the get_cluster_nodes RPC call nodes list.

This graphic is an example of network activity and bandwidth use. The volume of data transmitted is between 0,5 Mbits/s and 3 Mbits/s.

I've done some monitoring of the network connection between the grpc server and the client for the cluster node info notification.

This is the graphic of the data transferred between the validator grpc server and the client to build the get_cluster_nodes RPC call nodes list.

Fixes #

@mergify mergify bot requested a review from a team March 4, 2024 14:55
add cluster notifier registration and test

Add cluster node remove geyzer notification

rebase on master and correct PR comments

fix rebase, fmt, clippy
@godmodegalactus godmodegalactus force-pushed the add_cluster_info_on_geyser branch from c15fe52 to 897e0bc Compare March 11, 2024 13:01
@lijunwangs
Copy link

@godmodegalactus -- sorry for getting to this late. Can you address the conflicts?

@lijunwangs lijunwangs self-requested a review April 23, 2024 02:25
@gregcusack
Copy link

why do we need to stream contact info for every gossip update? could we get away with only streaming contact info when an actual ContactInfo CrdsValue is received? notify_clusterinfo_remove() would ensure nodes are removed when they become stale.

also, do we need to stream contact info when a CrdsValue fails to insert into the table? feels like we shouldn't need to report that since it is a failed updated.

@godmodegalactus
Copy link
Author

why do we need to stream contact info for every gossip update? could we get away with only streaming contact info when an actual ContactInfo CrdsValue is received? notify_clusterinfo_remove() would ensure nodes are removed when they become stale.

also, do we need to stream contact info when a CrdsValue fails to insert into the table? feels like we shouldn't need to report that since it is a failed updated.

Yeah i think you are right.

I will pick up this PR again and try to resolve all the issues.

Comment on lines +11 to +13
if let CrdsData::LegacyContactInfo(ref cluster_info) = value.value.data {
let notifier = &clusterinfo_update_notifier.read().unwrap();
notifier.notify_clusterinfo_update(cluster_info);
Copy link

@gregcusack gregcusack Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh jk I see you already set it up to update only on reception of LegacyContactInfo. but ya probably don't need to send an update if the contact info fails to insert

Comment on lines +327 to +329
//notify geyser interface
self.notify_clusterinfo_update(self.table.get(&gossip_label));
ret

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Crds::insert requires an exclusive lock on the crds table.
So we shouldn't put the notification code here.

Instead somewhere else, e.g. here:
https://github.com/anza-xyz/agave/blob/7629a152d/gossip/src/cluster_info.rs#L1618
we need to read locks crds table and obtain new entries since last time using a crds::Cursor.
similar to the code here:
https://github.com/anza-xyz/agave/blob/7629a152d/gossip/src/crds_gossip_push.rs#L184-L190

Copy link

@behzadnouri behzadnouri Nov 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alternatively somewhere in geyser can pull new contact-infos every once in a while or when needed.
for example in turbine we cache contact-infos for 5 seconds and if 5 seconds have passed we pull again all contact-infos from gossip:
https://github.com/anza-xyz/agave/blob/7629a152d/turbine/src/cluster_nodes.rs#L287

#[derive(Debug, Clone, PartialEq, Eq)]
#[repr(C)]
/// Information about a node in the cluster.
pub struct ReplicaClusterInfoNode {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens to this struct if we change gossip ContactInfo, e.g. add a new socket for a new protocol?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya so that is the goal of doing something with an FFI interface. See PR: #3434. We could modify that PR and choose to only expose a pointer to a ContactInfo struct, making it much easier to update ContactInfo in the future. More details here: #3434 (comment)

@godmodegalactus
Copy link
Author

@gregcusack @lijunwangs @behzadnouri #3434 PR superseeds this PR, or should I start working on this PR ?

@behzadnouri
Copy link

@gregcusack @lijunwangs @behzadnouri #3434 PR superseeds this PR, or should I start working on this PR ?

We can follow up on #3434
Please close this PR if you don't want to work on it anymore. thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants