Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky Test: solana-local-cluster::local_cluster test_leader_failure_4 #2406

Closed
anza-team opened this issue Aug 2, 2024 · 1 comment · Fixed by #3509
Closed

Flaky Test: solana-local-cluster::local_cluster test_leader_failure_4 #2406

anza-team opened this issue Aug 2, 2024 · 1 comment · Fixed by #3509
Assignees
Labels

Comments

@anza-team
Copy link
Collaborator

AUTO-GENERATED. DO NOT EDIT.

📝 Buildkite Analytics

@bw-solana
Copy link

bw-solana commented Nov 6, 2024

I believe this (and other local cluster tests) is flaky because we don't reliably spin up nodes and have them catch up with the rest of the cluster in these local cluster tests.

The flow is supposed to look like this for the non-bootstrap nodes:

  1. Pull requests sent
  2. Gossip votes observed
  3. Insert repair tree
  4. Request orphan repairs
  5. Replay & freeze blocks
  6. Vote (once staked)
  7. OC blocks & make roots
  8. Generate leader schedule
  9. Keep activating stake

However, there seems to be a race condition where the later nodes to start up do not fully observe any gossip votes (step) and thus never repair/replay/vote/etc because they chuck them out during verify due to 1 of 2 reasons:

  1. At first, we reject the votes because we don't see the vote account key in epoch authorized voters. E.g. the voting validator doesn't show up as staked until epoch 3 but we're seeing votes for epoch 0.
  2. Later on, we fail because we don't have epoch stakes for the epoch because root bank never advances (because we're not voting and thus not rooting anything) and we only compute leader schedule 3 epochs ahead

A simple solution for this would be to insert the vote/stake accounts into genesis for these validator so they can immediately participate, have votes observed, etc.

We just need to be careful about which tests are explicitly trying to test stake activation.

The only remaining mystery for me is why the nodes spinning up (w/ the exception of non-bootstrap node 1) don't seem to observe gossip votes from the bootstrap node. This is the 1 node that is a verified voter to start with and could have votes observed to start the chain of events outlined above. My suspicion is that sometimes validator startup is just so slow that we're hitting case 2 above (epoch stakes not populated for the epoch we're observing - because cluster has advanced too far)

@bw-solana bw-solana linked a pull request Nov 6, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants