-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SIMD-0046: Optimistic cluster restart automation #46
base: main
Are you sure you want to change the base?
Changes from 39 commits
2e99b5b
03277f1
7b71cad
996d1ab
de72626
8f32e6c
4feeb64
0aff4cd
e29d83c
b0b2d47
838429d
1049463
6e4a5cd
8cb6ef6
df5932d
5050a7c
90134b5
85f62b4
eafd745
ecccadf
e143136
57b3b16
4b99230
198e742
7bd9b74
a44eeff
6147af1
bffcd1d
ebc0cec
dc9209f
eefa087
4145325
6aeda83
78c0e5b
a938546
63d3252
ebd1935
deee8ec
9f013a0
5420a17
323ee80
3b3ef2d
87813b8
5b69c78
fac8526
351e675
b130ee3
3234699
76f63bb
18b3d87
813c2cf
2c21911
3fd02f1
a9447b4
e913703
eb359ac
82fcd22
192b01c
c4d3e3e
8a9990d
3167a08
21878c8
879e92d
5b58b8a
d817520
fdc7534
6b2a0b2
6fcc5cf
7614587
ba1c9d4
2dffa19
5635ad3
7adb22b
16e4ec8
3fa3b9a
b6fc273
005caae
5fcbcd1
8f7f752
50d5bcb
397e98b
6190f35
e4e8d84
51e81d9
ac0940f
3805d7a
0466a12
6613293
dd60570
b415733
acb041f
d85ce34
07cbb0c
eb99eac
30a4d38
28373b0
9558fcb
1e9ea74
4ceebbb
f0d933c
821456d
891d5ea
613384d
75306bd
bafaac5
f65c6aa
31c358e
bf5529f
6fb1cf7
9baa635
4f5469c
22dfc57
560dd4d
683a43a
a722bc6
4e531bb
4ac10f2
cfbcb5b
eb81566
a9ac86c
b8185eb
fac38b6
7c9d098
79b3be7
ab33197
331fe01
d22ba4a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,302 @@ | ||
--- | ||
simd: '0024' | ||
title: Optimistic cluster restart automation | ||
authors: | ||
- Wen Xu (Solana Labs) | ||
category: Standard | ||
type: Core | ||
status: Draft | ||
created: 2023-04-07 | ||
feature: (fill in with feature tracking issues once accepted) | ||
--- | ||
|
||
## Summary | ||
|
||
During a cluster restart following an outage, use gossip to exchange local | ||
status and automatically reach consensus on the block to restart from. Proceed | ||
to restart if validators in the restart can reach agreement, or print debug | ||
information and halt otherwise. | ||
|
||
## New Terminology | ||
|
||
* "cluster restart": When there is an outage such that the whole cluster | ||
stalls, human may need to restart most of the validators with a sane state so | ||
that the cluster can continue to function. This is different from sporadic | ||
single validator restart which does not impact the cluster. See | ||
[cluster restart](https://docs.solana.com/running-validator/restart-cluster) | ||
for details. | ||
|
||
* "optimistically confirmed block": a block which gets the votes from the | ||
majority of the validators in a cluster (> 2/3 stake). Our algorithm tries to | ||
guarantee that an optimistically confirmed will never be rolled back. When we | ||
are performing cluster restart, we normally start from the highest | ||
optimistically confirmed block, but it's also okay to start from a child of the | ||
highest optimistically confirmed block as long as consensus can be reached. | ||
|
||
* "silent repair phase": In the new repair and restart plan, the validators in | ||
wen-coding marked this conversation as resolved.
Show resolved
Hide resolved
|
||
restart will first spend some time to exchange information, repair missing | ||
blocks, and finally reach consensus. The validators only continue normal block | ||
production and voting after consensus is reached. We call this preparation | ||
phase where block production and voting are paused the silent repair phase. | ||
|
||
* "ephemeral shred version": right now we update `shred_version` during a | ||
mvines marked this conversation as resolved.
Show resolved
Hide resolved
|
||
cluster restart, it is used to verify received shreds and filter Gossip peers. | ||
In the new repair and restart plan, we introduce a new temporary shred version | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another "new repair and restart plan" ref that probably should be renamed :) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed, I believe this is the last one. |
||
in the silent repair phase so validators in restart don't interfere with those | ||
not in restart. Currently this ephemeral shred version is calculated using | ||
`(current_shred_version + 1) % 0xffff`. | ||
|
||
* `RESTART_STAKE_THRESHOLD`: We need enough validators to participate in a | ||
restart so they can make decision for the whole cluster. If everything works | ||
perfect, we only need 2/3 of the total stake. However, validators could die | ||
or perform abnormally, so we currently set the `RESTART_STAKE_THRESHOLD` at | ||
80%, which is the same as now. | ||
|
||
## Motivation | ||
wen-coding marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Currently during a cluster restart, validator operators need to decide the | ||
highest optimistically confirmed slot, then restart the validators with new | ||
command-line arguments. | ||
|
||
The current process involves a lot of human intervention, if people make a | ||
mistake in deciding the highest optimistically confirmed slot, it is | ||
detrimental to the viability of the ecosystem. | ||
|
||
We aim to automate the negotiation of highest optimistically confirmed slot and | ||
the distribution of all blocks on that fork, so that we can lower the | ||
possibility of human mistakes in the cluster restart process. This also reduces | ||
the burden on validator operators, because they don't have to stay around while | ||
the validators automatically try to reach consensus, they will be paged if | ||
things go wrong. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "they will be paged if things go wrong." -- this feels slightly like editorializing as we don't actually add a paging facility on this proposal, although we may want to suggest under what condition a paging solution could be triggered (unexpected validator exit?) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Changed, does this look better? |
||
|
||
## Alternatives Considered | ||
|
||
### Automatically detect outage and perform cluster restart | ||
The reaction time of a human in case of emergency is measured in minutes, | ||
while a cluster restart where human initiate validator restarts takes hours. | ||
We considered various approaches to automatically detect outage and perform | ||
cluster restart, which can reduce recovery speed to minutes or even seconds. | ||
|
||
However, automatically restarting the whole cluster seems risky. Because | ||
if the recovery process itself doesn't work, it might be some time before | ||
we can get human's attention. And it doesn't solve the cases where new binary | ||
is needed. So for now we still plan to have human in the loop. | ||
|
||
After we gain more experience with the restart approach in this proposal, we | ||
may slowly try to automate more parts to improve cluster reliability. | ||
|
||
### Use gossip and consensus to figure out restart slot before the restart | ||
The main difference between current proposal and this proposal is that this | ||
proposal will automatically enter restart preparation phase without human | ||
intervention. | ||
mvines marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
While getting human out of the loop improves recovery speed, there are concerns | ||
about recovery gossip messages interfering with normal gossip messages, and | ||
automatically start a new message in gossip seems risky. | ||
|
||
### Automatically reduce block production in an outage | ||
Right now we have vote-only mode, a validator will only pack vote transactions | ||
into new blocks if the tower distance (last_vote - local_root) is greater than | ||
400 slots. | ||
|
||
Unfortunately in the previous outages vote-only mode isn't enough to save the | ||
cluster. There are proposals of more aggressive block production reduction to | ||
save the cluster. For example, a leader could produce only one block in four | ||
consecutive slots allocated to it. | ||
|
||
However, this only solves the problem in specific type of outage, and it seems | ||
risky to aggressively reduce block production, so we are not proceeding with | ||
this proposal for now. | ||
|
||
## Detailed Design | ||
|
||
The new protocol tries to make all restarting validators get the same data | ||
blocks and the same set of last votes, then they will almost certainly make the | ||
same decision on the canonical fork and proceed. | ||
|
||
A new command line arg will be added. When the cluster is in need | ||
of a restart, we assume validators holding at least `RESTART_STAKE_THRESHOLD` | ||
percentage of stakes will restart with this arg. Then the following steps | ||
will happen: | ||
|
||
1. The validator boots into the silent repair phase, it will not make new | ||
wen-coding marked this conversation as resolved.
Show resolved
Hide resolved
|
||
blocks or change its votes. The validator propagates its local voted fork | ||
wen-coding marked this conversation as resolved.
Show resolved
Hide resolved
|
||
information to all other validators in restart. | ||
|
||
2. While counting local vote information from all others in restart, the | ||
validator repairs all blocks which could potentially have been optimistically | ||
confirmed. | ||
|
||
3. After repair is complete, the validator counts votes on each fork and | ||
sends out local heaviest fork. | ||
|
||
4. Each validator counts if enough nodes can agree on one block (same slot and | ||
hash) to restart from: | ||
|
||
1. If yes, proceed and restart | ||
|
||
2. If no, print out what it thinks is wrong, halt and wait for human | ||
|
||
See each step explained in details below. | ||
|
||
### 1. Gossip last vote before the restart and ancestors on that fork | ||
|
||
The main goal of this step is to propagate the locally selected fork to all | ||
others in restart. | ||
|
||
We use a new Gossip message `LastVotedForkSlots`, its fields are: | ||
|
||
- `last_voted_slot`: `u64` the slot last voted, this also serves as last_slot | ||
for the bit vector. | ||
- `last_voted_hash`: `Hash` the bank hash of the slot last voted slot. | ||
- `ancestors`: `BitVec<u8>` compressed bit vector representing the slots on | ||
sender's last voted fork. the most significant bit is always | ||
`last_voted_slot`, least significant bit is `last_voted_slot-81000`. | ||
|
||
The number of ancestor slots sent is hard coded at 81000, because that's | ||
400ms * 81000 = 9 hours, we assume most restart decisions to be made in 9 | ||
hours. If a validator restarts after 9 hours past the outage, it cannot join | ||
the restart this way. If enough validators failed to restart within 9 hours, | ||
then fallback to the manual, interactive cluster restart method. | ||
|
||
When a validator enters restart, it uses ephemeral shred version to avoid | ||
interfering with those outside the restart. There is slight chance that | ||
the ephemeral shred version would collide with the shred version after the | ||
silent repair phase, but even if this rare case occurred, we plan to flush the | ||
CRDS table on successful restart, so gossip messages used in restart will be | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When in normal gossip mode, I feel like we also shouldn't pull the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure we can do that, added. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
In this case is there any benefit in having the ephemeral "+1" shred version?
We are overloading shred version here to differentiate those in the repair part of restart and not, without any shreds to verify against. Seems like it could be better to not bump shred version and instead use the existence of the gossip messages for differentiation purposes. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. First of all I feel we should select repair peers only from validators in restart, because at this point you don't know what the status of non-restarted validators are, they could be buried in Turbine/Gossip/random traffic so they can't effectively answer your repair requests. Chatted a bit with Ashwin, I feel like to not pollute normal Gossip, there are two options:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 2 seems nice if it'll work. Gossip has proven itself to be incredibly resilient so I think it likely will There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @behzadnouri Do you think option 2 would work? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm, option 2 might not work too well when the number of restarted validators is small. Because when a restarted validator try to push LastVotedForkSlots to others, the peers it selected might all be non-restarted validators which drop this message. Of course, a newly restarted validator can pull the messages, but it needs to know which restarted peer to pull from. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thought about this more, since this is outage handling, I'd prefer the solution to be less intrusive so it doesn't interfere with normal logic. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sg, I'm on board with that approach. I agree it should be much less intrusive to the existing Labs validator code base |
||
removed. | ||
|
||
### 2. Repair ledgers up to the restart slot | ||
|
||
The main goal of this step is to repair all blocks which could potentially be | ||
optimistically confirmed. | ||
|
||
We need to prevent false negative at all costs, because we can't rollback an | ||
optimistically confirmed block. However, false positive is okay. Because when | ||
we select the heaviest fork in the next step, we should see all the potential | ||
candidates for optimistically confirmed slots, there we can count the votes and | ||
remove some false positive cases. | ||
|
||
However, it's also overkill to repair every block presented by others. When | ||
`LastVotedForkSlots` messages are being received and aggregated, a validator | ||
can categorize blocks missing locally into 3 categories: ignored, must-have, | ||
and unsure. Depending on the stakes of validators currently in restart, some | ||
slots with too few stake can be safely ignored, some have enough stake they | ||
should definitely be repaired, and the rest would be undecided pending more | ||
confirmations. | ||
|
||
Assume `RESTART_STAKE_THRESHOLD` is 80% and that 5% restarted validators can | ||
make mistakes in voting. | ||
mvines marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
When only 5% validators are in restart, everything is in "unsure" category. | ||
|
||
When 67% validators are in restart, any slot with less than | ||
67% - 5% - (100-67%) = 29% is in "ignored" category, because even if all | ||
validators join the restart, the slot will not get 67% stake. When this | ||
threshold is less than 33%, we temporarily put all blocks with >33% stake into | ||
"must-have" category to speed up repairing. Any slot with between 29% and 33% | ||
stake is "unsure". | ||
|
||
When 80% validators are in restart, any slot with less than | ||
67% - 5% - (100-80%) = 42% is in "ignored" category, the rest is "must-have". | ||
|
||
From above examples, we can see the "must-have" threshold changes dynamically | ||
depending on how many validators are in restart. The main benefit is that a | ||
block will only move from "must-have/unsure" to "ignored" as more validators | ||
join the restart, not vice versa. So the list of blocks a validator needs to | ||
repair will never grow bigger when more validators join the restart. | ||
|
||
### 3. Gossip current heaviest fork | ||
|
||
The main goal of this step is to "vote" the heaviest fork to restart from. | ||
|
||
We use a new Gossip message `HeaviestFork`, its fields are: | ||
|
||
- `slot`: `u64` slot of the picked block. | ||
- `hash`: `Hash` bank hash of the picked block. | ||
- `received`: `u8` total percentage of stakes of the validators it received | ||
`HeaviestFork` messages from. | ||
|
||
After receiving `LastVotedForkSlots` from the validators holding stake more | ||
than `RESTART_STAKE_THRESHOLD` and repairing slots in "must-have" category, | ||
replay all blocks and pick the heaviest fork as follows: | ||
|
||
1. For all blocks with more than 67% votes, they must be on picked fork. | ||
|
||
2. If a picked block has more than one children, check if the votes on the | ||
heaviest child is over threshold: | ||
|
||
1. If vote_on_child + stake_on_validators_not_in_restart >= 62%, pick child. | ||
For example, if 80% validators are in restart, child has 42% votes, then | ||
42 + (100-80) = 62%, pick child. 62% is chosen instead of 67% because 5% | ||
could make the wrong votes. | ||
|
||
It's okay to use 62% here because the goal is to prevent false negative rather | ||
than false positive. If validators pick a child of optimistically confirmed | ||
block to start from, it's okay because if 80% of the validators all choose this | ||
block, this block will be instantly confirmed on the chain. | ||
|
||
2. Otherwise stop traversing the tree and use last picked block. | ||
|
||
After deciding heaviest block, gossip | ||
`HeaviestFork(X, Hash(X), received_heaviest_stake)` out, where X is the latest | ||
picked block. We also send out stake of received `HeaviestFork` messages so | ||
that we can proceed to next step when enough validators are ready. | ||
|
||
### 4. Proceed to restart if everything looks okay, halt otherwise | ||
|
||
All validators in restart keep counting the number of `HeaviestFork` where | ||
`received_heaviest_stake` is higher than 80%. Once a validator counts that 80% | ||
of the validators send out `HeaviestFork` where `received_heaviest_stake` is | ||
higher than 80%, it starts the following checks: | ||
|
||
- Whether all `HeaviestFork` have the same slot and same block Hash. Because | ||
validators are only sending slots instead of bank hashes in | ||
`LastVotedForkSlots`, it's possible that a duplicate block can make the | ||
cluster unable to reach consensus. So block hash needs to be checked as well. | ||
|
||
- The voted slot is equal or a child of local optimistically confirmed slot. | ||
|
||
If all checks pass, the validator immediately starts generation of snapshot at | ||
the agreed upon slot. | ||
|
||
While the snapshot generation is in progress, the validator also checks to see | ||
whether two minutes has passed since agreement has been reached, to guarantee | ||
its `HeaviestFork` message propagates to everyone, then proceeds to restart: | ||
|
||
1. Issue a hard fork at the designated slot and change shred version in gossip. | ||
2. Execute the current tasks in --wait-for-supermajority and wait for 80%. | ||
|
||
Before a validator enters restart, it will still propagate `LastVotedForkSlots` | ||
and `HeaviestFork` messages in gossip. After the restart,its shred_version will | ||
be updated so it will no longer send or propagate gossip messages for restart. | ||
|
||
If any of the checks fails, the validator immediately prints out all debug info, | ||
sends out metrics so that people can be paged, and then halts. | ||
|
||
## Impact | ||
|
||
This proposal adds a new silent repair mode to validators, during this phase | ||
the validators will not participate in normal cluster activities, which is the | ||
same as now. Compared to today's cluster restart, the new mode may mean more | ||
network bandwidth and memory on the restarting validators, but it guarantees | ||
the safety of optimistically confirmed user transactions, and validator admins | ||
don't need to manually generate and download snapshots again. | ||
|
||
## Security Considerations | ||
|
||
The two added gossip messages `LastVotedForkSlots` and `HeaviestFork` will only | ||
be sent and processed when the validator is restarted in RepairAndRestart mode. | ||
So random validator restarting in the new mode will not bring extra burden to | ||
the system. | ||
|
||
Non-conforming validators could send out wrong `LastVotedForkSlots` and | ||
`HeaviestFork` messages to mess with cluster restarts, these should be included | ||
in the Slashing rules in the future. | ||
|
||
## Backwards Compatibility | ||
|
||
This change is backward compatible with previous versions, because validators | ||
only enter the new mode during new restart mode which is controlled by a | ||
command line argument. All current restart arguments like | ||
--wait-for-supermajority and --expected-bank-hash will be kept as is for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😩 how are these not in https://docs.solana.com/terminology yet! Would you mind sending a PR into the monorepo to add these terms, they really aren't new for this SIMD!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but can I keep these here so the document is self-contained?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can get those terms on docs.solana.com pretty quickly too, but meanwhile sure makes sense!