You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I believe this (and other local cluster tests) is flaky because we don't reliably spin up nodes and have them catch up with the rest of the cluster in these local cluster tests.
The flow is supposed to look like this for the non-bootstrap nodes:
Pull requests sent
Gossip votes observed
Insert repair tree
Request orphan repairs
Replay & freeze blocks
Vote (once staked)
OC blocks & make roots
Generate leader schedule
Keep activating stake
However, there seems to be a race condition where the later nodes to start up do not fully observe any gossip votes (step) and thus never repair/replay/vote/etc because they chuck them out during verify due to 1 of 2 reasons:
At first, we reject the votes because we don't see the vote account key in epoch authorized voters. E.g. the voting validator doesn't show up as staked until epoch 3 but we're seeing votes for epoch 0.
Later on, we fail because we don't have epoch stakes for the epoch because root bank never advances (because we're not voting and thus not rooting anything) and we only compute leader schedule 3 epochs ahead
A simple solution for this would be to insert the vote/stake accounts into genesis for these validator so they can immediately participate, have votes observed, etc.
We just need to be careful about which tests are explicitly trying to test stake activation.
The only remaining mystery for me is why the nodes spinning up (w/ the exception of non-bootstrap node 1) don't seem to observe gossip votes from the bootstrap node. This is the 1 node that is a verified voter to start with and could have votes observed to start the chain of events outlined above. My suspicion is that sometimes validator startup is just so slow that we're hitting case 2 above (epoch stakes not populated for the epoch we're observing - because cluster has advanced too far)
AUTO-GENERATED. DO NOT EDIT.
📝 Buildkite Analytics
The text was updated successfully, but these errors were encountered: