Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIMD-0046: Optimistic cluster restart automation #46

Open
wants to merge 127 commits into
base: main
Choose a base branch
from
Open
Changes from 39 commits
Commits
Show all changes
127 commits
Select commit Hold shift + click to select a range
2e99b5b
Add repair and restart proposal.
wen-coding Apr 10, 2023
03277f1
Update proposals/0024-repair-and-restart.md
wen-coding Apr 12, 2023
7b71cad
Update proposals/0024-repair-and-restart.md
wen-coding Apr 12, 2023
996d1ab
Update proposals/0024-repair-and-restart.md
wen-coding Apr 12, 2023
de72626
Add protocol overview and lint changes.
wen-coding Apr 12, 2023
8f32e6c
Merge branch 'smart-restart-proposal' of github.com:wen-coding/solana…
wen-coding Apr 12, 2023
4feeb64
Change threshold value from 47% to 34%.
wen-coding Apr 12, 2023
0aff4cd
Add introduction, and update default slots to send.
wen-coding Apr 15, 2023
e29d83c
Remove snapshot generation from the new restart protocol and lint cha…
wen-coding Apr 17, 2023
b0b2d47
Change must have block threshold.
wen-coding Apr 18, 2023
838429d
Update the proposal to reflect changes in discussion.
wen-coding Apr 19, 2023
1049463
Add the wait before restart.
wen-coding Apr 19, 2023
6e4a5cd
Change Heaviest selection algorithm.
wen-coding Apr 20, 2023
8cb6ef6
Make linter happy.
wen-coding Apr 26, 2023
df5932d
Shorten title to make linter happy.
wen-coding Apr 26, 2023
5050a7c
Add details of messages and change command line.
wen-coding Apr 26, 2023
90134b5
Fix typos on numbers.
wen-coding Apr 27, 2023
85f62b4
Update proposals/0024-repair-and-restart.md
wen-coding May 1, 2023
eafd745
Make linter happy.
wen-coding May 1, 2023
ecccadf
All messages need to keep flowing before restart.
wen-coding May 2, 2023
e143136
A snapshot should be generated first in a restart.
wen-coding May 4, 2023
57b3b16
Use Gossip instead of direct messaging in restart.
wen-coding May 9, 2023
4b99230
Require 80% of the people receive 80% of Heaviest.
wen-coding May 10, 2023
198e742
Add security check and some other changes.
wen-coding May 11, 2023
7bd9b74
Update proposals/0024-repair-and-restart.md
wen-coding May 12, 2023
a44eeff
Update proposals/0024-repair-and-restart.md
wen-coding May 12, 2023
6147af1
Update proposals/0024-repair-and-restart.md
wen-coding May 12, 2023
bffcd1d
Update proposals/0024-repair-and-restart.md
wen-coding May 12, 2023
ebc0cec
Update proposals/0024-repair-and-restart.md
wen-coding May 12, 2023
dc9209f
Update proposals/0024-repair-and-restart.md
wen-coding May 12, 2023
eefa087
Update proposals/0024-repair-and-restart.md
wen-coding May 12, 2023
4145325
Update proposals/0024-repair-and-restart.md
wen-coding May 12, 2023
6aeda83
Update proposals/0024-repair-and-restart.md
wen-coding May 12, 2023
78c0e5b
Update proposals/0024-repair-and-restart.md
wen-coding May 12, 2023
a938546
Add some terminologies.
wen-coding May 12, 2023
63d3252
Merge branch 'smart-restart-proposal' of github.com:wen-coding/solana…
wen-coding May 12, 2023
ebd1935
Rewording a few paragraphs to make things clear.
wen-coding May 15, 2023
deee8ec
Fix a few small sentences.
wen-coding May 15, 2023
9f013a0
Remove .bak file.
wen-coding May 15, 2023
5420a17
Update proposals/0024-repair-and-restart.md
wen-coding May 16, 2023
323ee80
Update proposals/0024-repair-and-restart.md
wen-coding May 16, 2023
3b3ef2d
Update proposals/0024-repair-and-restart.md
wen-coding May 16, 2023
87813b8
Fix a few wordings.
wen-coding May 16, 2023
5b69c78
This proposal is actually proposal 46.
wen-coding May 16, 2023
fac8526
Make linter happy.
wen-coding May 16, 2023
351e675
Fixes.
wen-coding May 18, 2023
b130ee3
Add description of when to enter next step.
wen-coding May 19, 2023
3234699
Make linter happy.
wen-coding May 19, 2023
76f63bb
Make linter happy.
wen-coding Jun 6, 2023
18b3d87
Update proposals/0046-optimistic-cluster-restart-automation.md
wen-coding Jul 17, 2023
813c2cf
Try indent some paragraphs.
wen-coding Jul 18, 2023
2c21911
Merge branch 'smart-restart-proposal' of github.com:wen-coding/solana…
wen-coding Jul 18, 2023
3fd02f1
Backtick all new terminologies.
wen-coding Jul 18, 2023
a9447b4
Make linter happy.
wen-coding Jul 18, 2023
e913703
Update proposals/0046-optimistic-cluster-restart-automation.md
wen-coding Jul 18, 2023
eb359ac
Remove unnecessary paragraph.
wen-coding Jul 18, 2023
82fcd22
Merge branch 'smart-restart-proposal' of github.com:wen-coding/solana…
wen-coding Jul 18, 2023
192b01c
Update proposals/0046-optimistic-cluster-restart-automation.md
wen-coding Jul 18, 2023
c4d3e3e
Update proposals/0046-optimistic-cluster-restart-automation.md
wen-coding Jul 18, 2023
8a9990d
Change percent from u8 to u16.
wen-coding Jul 18, 2023
3167a08
Merge branch 'smart-restart-proposal' of github.com:wen-coding/solana…
wen-coding Jul 18, 2023
21878c8
Make linter happy.
wen-coding Jul 18, 2023
879e92d
Remove command line reference.
wen-coding Jul 18, 2023
5b58b8a
Revise the threshold for block repair.
wen-coding Jul 19, 2023
d817520
Make linter happy again.
wen-coding Jul 19, 2023
fdc7534
Remove 80% reference when we mean RESTART_STAKE_THRESHOLD.
wen-coding Jul 19, 2023
6b2a0b2
Rename HeaviestFork to RestartHeaviestFork.
wen-coding Jul 19, 2023
6fcc5cf
Rename LastVotedForkSlots to RestartLastVotedForkSlots.
wen-coding Jul 19, 2023
7614587
Change format of examples.
wen-coding Jul 19, 2023
ba1c9d4
Change format of the bullet list.
wen-coding Jul 19, 2023
2dffa19
Change reasoning of 81000 slots.
wen-coding Jul 19, 2023
5635ad3
Replace silent repair with new name "wen restart".
wen-coding Jul 20, 2023
7adb22b
Try to make linter happy.
wen-coding Jul 20, 2023
16e4ec8
Make linter happy again.
wen-coding Jul 20, 2023
3fa3b9a
Back to the title linter likes.
wen-coding Jul 20, 2023
b6fc273
Add cluster restart slot to the doc.
wen-coding Jul 20, 2023
005caae
Small fixes.
wen-coding Jul 21, 2023
5fcbcd1
Add handling for oscillating info.
wen-coding Jul 24, 2023
8f7f752
Make linter happy.
wen-coding Jul 24, 2023
50d5bcb
Add epoch boundary handling.
wen-coding Jul 26, 2023
397e98b
Add cluster wide threshold calculation across Epoch boundary.
wen-coding Jul 26, 2023
6190f35
Update cross epoch stake selection.
wen-coding Jul 27, 2023
e4e8d84
Correct mistake in description.
wen-coding Jul 27, 2023
51e81d9
Make it clear we are generating incremental snapshot.
wen-coding Aug 1, 2023
ac0940f
Fix typo
wen-coding Aug 2, 2023
3805d7a
Add more reasoning about how HeaviestFork is picked.
wen-coding Aug 4, 2023
0466a12
Make linter happy.
wen-coding Aug 4, 2023
6613293
Change indent.
wen-coding Aug 9, 2023
dd60570
Make linter happy.
wen-coding Aug 9, 2023
b415733
Rework the proof.
wen-coding Aug 9, 2023
acb041f
Update proposals/0046-optimistic-cluster-restart-automation.md
wen-coding Aug 14, 2023
d85ce34
Explain 81000 slots and issue hard fork before snapshot generation.
wen-coding Aug 14, 2023
07cbb0c
Merge branch 'smart-restart-proposal' of github.com:wen-coding/solana…
wen-coding Aug 14, 2023
eb99eac
Use a hard limit for must-have blocks and accept new
wen-coding Aug 14, 2023
30a4d38
Reverse the order of bits to be consistent with EpochSlots.
wen-coding Aug 18, 2023
28373b0
Update restart descriptions.
wen-coding Sep 8, 2023
9558fcb
Update 81k to 64k.
wen-coding Nov 18, 2023
1e9ea74
Update the find heaviest algorithm and proof.
wen-coding Mar 12, 2024
4ceebbb
Update the proof for heaviest fork, we don't need to check stakes.
wen-coding Mar 12, 2024
f0d933c
Update notations in proof.
wen-coding Mar 12, 2024
821456d
Explain the 42% constant.
wen-coding May 23, 2024
891d5ea
Explain 5% as well.
wen-coding May 24, 2024
613384d
Small fixes.
wen-coding May 24, 2024
75306bd
Update stake calculation when crossing Epoch boundaries.
wen-coding Aug 3, 2024
bafaac5
Merge branch 'solana-foundation:main' into smart-restart-proposal
wen-coding Aug 12, 2024
f65c6aa
Update exit criteria when crossing Epoch boundary.
wen-coding Aug 19, 2024
31c358e
Merge branch 'smart-restart-proposal' of github.com:wen-coding/solana…
wen-coding Aug 19, 2024
bf5529f
Add RestartHeaviestFork round 2.
wen-coding Sep 4, 2024
6fb1cf7
Make linter happy.
wen-coding Sep 4, 2024
9baa635
Use round 0 and round 1 instead of round 1 and 2.
wen-coding Sep 4, 2024
4f5469c
Replace previous HeaviestFork stage with a leader based design.
wen-coding Sep 12, 2024
22dfc57
Update the abstract as well.
wen-coding Sep 12, 2024
560dd4d
Update wording.
wen-coding Sep 12, 2024
683a43a
Update company info.
wen-coding Sep 12, 2024
a722bc6
Update the exit condition of step 2.
wen-coding Sep 12, 2024
4e531bb
Clarify step 4.
wen-coding Sep 12, 2024
4ac10f2
Fix typo.
wen-coding Sep 12, 2024
cfbcb5b
Rename the leader to coordinator. Add the final HeaviestFork aggregat…
wen-coding Sep 21, 2024
eb81566
Fix the correctness proof.
wen-coding Oct 5, 2024
a9ac86c
Fix the correctness proof.
wen-coding Oct 5, 2024
b8185eb
Clarify that we pick the slot first then replay to get hash.
wen-coding Oct 22, 2024
fac38b6
Change status to Review
wen-coding Oct 22, 2024
7c9d098
Some small fixes.
wen-coding Nov 2, 2024
79b3be7
Fix typo.
wen-coding Nov 8, 2024
ab33197
Add proof for the 33% limit.
wen-coding Nov 14, 2024
331fe01
Make linter happy.
wen-coding Nov 15, 2024
d22ba4a
Make linter happy.
wen-coding Nov 15, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
302 changes: 302 additions & 0 deletions proposals/0024-repair-and-restart.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,302 @@
---
simd: '0024'
title: Optimistic cluster restart automation
authors:
- Wen Xu (Solana Labs)
category: Standard
type: Core
status: Draft
created: 2023-04-07
feature: (fill in with feature tracking issues once accepted)
---

## Summary

During a cluster restart following an outage, use gossip to exchange local
status and automatically reach consensus on the block to restart from. Proceed
to restart if validators in the restart can reach agreement, or print debug
information and halt otherwise.

## New Terminology

* "cluster restart": When there is an outage such that the whole cluster
stalls, human may need to restart most of the validators with a sane state so
that the cluster can continue to function. This is different from sporadic
single validator restart which does not impact the cluster. See
[cluster restart](https://docs.solana.com/running-validator/restart-cluster)
for details.

* "optimistically confirmed block": a block which gets the votes from the
majority of the validators in a cluster (> 2/3 stake). Our algorithm tries to
guarantee that an optimistically confirmed will never be rolled back. When we
are performing cluster restart, we normally start from the highest
optimistically confirmed block, but it's also okay to start from a child of the
highest optimistically confirmed block as long as consensus can be reached.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😩 how are these not in https://docs.solana.com/terminology yet! Would you mind sending a PR into the monorepo to add these terms, they really aren't new for this SIMD!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but can I keep these here so the document is self-contained?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can get those terms on docs.solana.com pretty quickly too, but meanwhile sure makes sense!

* "silent repair phase": In the new repair and restart plan, the validators in
wen-coding marked this conversation as resolved.
Show resolved Hide resolved
restart will first spend some time to exchange information, repair missing
blocks, and finally reach consensus. The validators only continue normal block
production and voting after consensus is reached. We call this preparation
phase where block production and voting are paused the silent repair phase.

* "ephemeral shred version": right now we update `shred_version` during a
mvines marked this conversation as resolved.
Show resolved Hide resolved
cluster restart, it is used to verify received shreds and filter Gossip peers.
In the new repair and restart plan, we introduce a new temporary shred version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another "new repair and restart plan" ref that probably should be renamed :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, I believe this is the last one.

in the silent repair phase so validators in restart don't interfere with those
not in restart. Currently this ephemeral shred version is calculated using
`(current_shred_version + 1) % 0xffff`.

* `RESTART_STAKE_THRESHOLD`: We need enough validators to participate in a
restart so they can make decision for the whole cluster. If everything works
perfect, we only need 2/3 of the total stake. However, validators could die
or perform abnormally, so we currently set the `RESTART_STAKE_THRESHOLD` at
80%, which is the same as now.

## Motivation
wen-coding marked this conversation as resolved.
Show resolved Hide resolved

Currently during a cluster restart, validator operators need to decide the
highest optimistically confirmed slot, then restart the validators with new
command-line arguments.

The current process involves a lot of human intervention, if people make a
mistake in deciding the highest optimistically confirmed slot, it is
detrimental to the viability of the ecosystem.

We aim to automate the negotiation of highest optimistically confirmed slot and
the distribution of all blocks on that fork, so that we can lower the
possibility of human mistakes in the cluster restart process. This also reduces
the burden on validator operators, because they don't have to stay around while
the validators automatically try to reach consensus, they will be paged if
things go wrong.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"they will be paged if things go wrong." -- this feels slightly like editorializing as we don't actually add a paging facility on this proposal, although we may want to suggest under what condition a paging solution could be triggered (unexpected validator exit?)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed, does this look better?


## Alternatives Considered

### Automatically detect outage and perform cluster restart
The reaction time of a human in case of emergency is measured in minutes,
while a cluster restart where human initiate validator restarts takes hours.
We considered various approaches to automatically detect outage and perform
cluster restart, which can reduce recovery speed to minutes or even seconds.

However, automatically restarting the whole cluster seems risky. Because
if the recovery process itself doesn't work, it might be some time before
we can get human's attention. And it doesn't solve the cases where new binary
is needed. So for now we still plan to have human in the loop.

After we gain more experience with the restart approach in this proposal, we
may slowly try to automate more parts to improve cluster reliability.

### Use gossip and consensus to figure out restart slot before the restart
The main difference between current proposal and this proposal is that this
proposal will automatically enter restart preparation phase without human
intervention.
mvines marked this conversation as resolved.
Show resolved Hide resolved

While getting human out of the loop improves recovery speed, there are concerns
about recovery gossip messages interfering with normal gossip messages, and
automatically start a new message in gossip seems risky.

### Automatically reduce block production in an outage
Right now we have vote-only mode, a validator will only pack vote transactions
into new blocks if the tower distance (last_vote - local_root) is greater than
400 slots.

Unfortunately in the previous outages vote-only mode isn't enough to save the
cluster. There are proposals of more aggressive block production reduction to
save the cluster. For example, a leader could produce only one block in four
consecutive slots allocated to it.

However, this only solves the problem in specific type of outage, and it seems
risky to aggressively reduce block production, so we are not proceeding with
this proposal for now.

## Detailed Design

The new protocol tries to make all restarting validators get the same data
blocks and the same set of last votes, then they will almost certainly make the
same decision on the canonical fork and proceed.

A new command line arg will be added. When the cluster is in need
of a restart, we assume validators holding at least `RESTART_STAKE_THRESHOLD`
percentage of stakes will restart with this arg. Then the following steps
will happen:

1. The validator boots into the silent repair phase, it will not make new
wen-coding marked this conversation as resolved.
Show resolved Hide resolved
blocks or change its votes. The validator propagates its local voted fork
wen-coding marked this conversation as resolved.
Show resolved Hide resolved
information to all other validators in restart.

2. While counting local vote information from all others in restart, the
validator repairs all blocks which could potentially have been optimistically
confirmed.

3. After repair is complete, the validator counts votes on each fork and
sends out local heaviest fork.

4. Each validator counts if enough nodes can agree on one block (same slot and
hash) to restart from:

1. If yes, proceed and restart

2. If no, print out what it thinks is wrong, halt and wait for human

See each step explained in details below.

### 1. Gossip last vote before the restart and ancestors on that fork

The main goal of this step is to propagate the locally selected fork to all
others in restart.

We use a new Gossip message `LastVotedForkSlots`, its fields are:

- `last_voted_slot`: `u64` the slot last voted, this also serves as last_slot
for the bit vector.
- `last_voted_hash`: `Hash` the bank hash of the slot last voted slot.
- `ancestors`: `BitVec<u8>` compressed bit vector representing the slots on
sender's last voted fork. the most significant bit is always
`last_voted_slot`, least significant bit is `last_voted_slot-81000`.

The number of ancestor slots sent is hard coded at 81000, because that's
400ms * 81000 = 9 hours, we assume most restart decisions to be made in 9
hours. If a validator restarts after 9 hours past the outage, it cannot join
the restart this way. If enough validators failed to restart within 9 hours,
then fallback to the manual, interactive cluster restart method.

When a validator enters restart, it uses ephemeral shred version to avoid
interfering with those outside the restart. There is slight chance that
the ephemeral shred version would collide with the shred version after the
silent repair phase, but even if this rare case occurred, we plan to flush the
CRDS table on successful restart, so gossip messages used in restart will be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When in normal gossip mode, I feel like we also shouldn't pull the LastVotedForkSlots and Heaviest messages and filter them out from a push. Otherwise these messages could end up floating around in gossip anyway

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure we can do that, added.

Copy link
Contributor

@AshwinSekar AshwinSekar May 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we also shouldn't pull the LastVotedForkSlots and Heaviest messages and filter them out from a push

In this case is there any benefit in having the ephemeral "+1" shred version?
Sounds like we'll be:

  • Blocking all restart messages from gossip if not participating
  • Extending the check in shred ingestion to allow shreds from the old (-1) version

We are overloading shred version here to differentiate those in the repair part of restart and not, without any shreds to verify against. Seems like it could be better to not bump shred version and instead use the existence of the gossip messages for differentiation purposes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all I feel we should select repair peers only from validators in restart, because at this point you don't know what the status of non-restarted validators are, they could be buried in Turbine/Gossip/random traffic so they can't effectively answer your repair requests.

Chatted a bit with Ashwin, I feel like to not pollute normal Gossip, there are two options:

  1. Use "silent restart shred version", but this means you have to add in a hack where during repairs you accept shreds in original shred version.
  2. Make validators in restart drop normal Gossip messages, and make validators not in restart drop restart Gossip messages (keep the ClusterInfo messages, of course), then we don't need the "silent restart shred version".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 seems nice if it'll work. Gossip has proven itself to be incredibly resilient so I think it likely will

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@behzadnouri Do you think option 2 would work?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, option 2 might not work too well when the number of restarted validators is small. Because when a restarted validator try to push LastVotedForkSlots to others, the peers it selected might all be non-restarted validators which drop this message. Of course, a newly restarted validator can pull the messages, but it needs to know which restarted peer to pull from.
Basically with option 2 we still need to know who's restarted when pushing and pulling to make gossip efficient. But this involves more hacks in gossip code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought about this more, since this is outage handling, I'd prefer the solution to be less intrusive so it doesn't interfere with normal logic.
Read the code a bit more, I think manipulating shred_version is still the less intrusive option. We can use only the silent_repair_shred_version on Gossip and uses the original shred_version elsewhere so that repair would just work.
This would mean validators not in restart are able to send over shreds, it may consume more memory but I think it's not worth putting in additional hack to block, because if any validator is malicious then it's easy to send shreds belonging to any shred_version.
So I'd still stick with option 1. We can maybe filter out the restart Gossip messages on validators not in restart if we want, even if we don't, it's just doubling the resource needed for EpochSlots, so it should be fine as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sg, I'm on board with that approach. I agree it should be much less intrusive to the existing Labs validator code base

removed.

### 2. Repair ledgers up to the restart slot

The main goal of this step is to repair all blocks which could potentially be
optimistically confirmed.

We need to prevent false negative at all costs, because we can't rollback an
optimistically confirmed block. However, false positive is okay. Because when
we select the heaviest fork in the next step, we should see all the potential
candidates for optimistically confirmed slots, there we can count the votes and
remove some false positive cases.

However, it's also overkill to repair every block presented by others. When
`LastVotedForkSlots` messages are being received and aggregated, a validator
can categorize blocks missing locally into 3 categories: ignored, must-have,
and unsure. Depending on the stakes of validators currently in restart, some
slots with too few stake can be safely ignored, some have enough stake they
should definitely be repaired, and the rest would be undecided pending more
confirmations.

Assume `RESTART_STAKE_THRESHOLD` is 80% and that 5% restarted validators can
make mistakes in voting.
mvines marked this conversation as resolved.
Show resolved Hide resolved

When only 5% validators are in restart, everything is in "unsure" category.

When 67% validators are in restart, any slot with less than
67% - 5% - (100-67%) = 29% is in "ignored" category, because even if all
validators join the restart, the slot will not get 67% stake. When this
threshold is less than 33%, we temporarily put all blocks with >33% stake into
"must-have" category to speed up repairing. Any slot with between 29% and 33%
stake is "unsure".

When 80% validators are in restart, any slot with less than
67% - 5% - (100-80%) = 42% is in "ignored" category, the rest is "must-have".

From above examples, we can see the "must-have" threshold changes dynamically
depending on how many validators are in restart. The main benefit is that a
block will only move from "must-have/unsure" to "ignored" as more validators
join the restart, not vice versa. So the list of blocks a validator needs to
repair will never grow bigger when more validators join the restart.

### 3. Gossip current heaviest fork

The main goal of this step is to "vote" the heaviest fork to restart from.

We use a new Gossip message `HeaviestFork`, its fields are:

- `slot`: `u64` slot of the picked block.
- `hash`: `Hash` bank hash of the picked block.
- `received`: `u8` total percentage of stakes of the validators it received
`HeaviestFork` messages from.

After receiving `LastVotedForkSlots` from the validators holding stake more
than `RESTART_STAKE_THRESHOLD` and repairing slots in "must-have" category,
replay all blocks and pick the heaviest fork as follows:

1. For all blocks with more than 67% votes, they must be on picked fork.

2. If a picked block has more than one children, check if the votes on the
heaviest child is over threshold:

1. If vote_on_child + stake_on_validators_not_in_restart >= 62%, pick child.
For example, if 80% validators are in restart, child has 42% votes, then
42 + (100-80) = 62%, pick child. 62% is chosen instead of 67% because 5%
could make the wrong votes.

It's okay to use 62% here because the goal is to prevent false negative rather
than false positive. If validators pick a child of optimistically confirmed
block to start from, it's okay because if 80% of the validators all choose this
block, this block will be instantly confirmed on the chain.

2. Otherwise stop traversing the tree and use last picked block.

After deciding heaviest block, gossip
`HeaviestFork(X, Hash(X), received_heaviest_stake)` out, where X is the latest
picked block. We also send out stake of received `HeaviestFork` messages so
that we can proceed to next step when enough validators are ready.

### 4. Proceed to restart if everything looks okay, halt otherwise

All validators in restart keep counting the number of `HeaviestFork` where
`received_heaviest_stake` is higher than 80%. Once a validator counts that 80%
of the validators send out `HeaviestFork` where `received_heaviest_stake` is
higher than 80%, it starts the following checks:

- Whether all `HeaviestFork` have the same slot and same block Hash. Because
validators are only sending slots instead of bank hashes in
`LastVotedForkSlots`, it's possible that a duplicate block can make the
cluster unable to reach consensus. So block hash needs to be checked as well.

- The voted slot is equal or a child of local optimistically confirmed slot.

If all checks pass, the validator immediately starts generation of snapshot at
the agreed upon slot.

While the snapshot generation is in progress, the validator also checks to see
whether two minutes has passed since agreement has been reached, to guarantee
its `HeaviestFork` message propagates to everyone, then proceeds to restart:

1. Issue a hard fork at the designated slot and change shred version in gossip.
2. Execute the current tasks in --wait-for-supermajority and wait for 80%.

Before a validator enters restart, it will still propagate `LastVotedForkSlots`
and `HeaviestFork` messages in gossip. After the restart,its shred_version will
be updated so it will no longer send or propagate gossip messages for restart.

If any of the checks fails, the validator immediately prints out all debug info,
sends out metrics so that people can be paged, and then halts.

## Impact

This proposal adds a new silent repair mode to validators, during this phase
the validators will not participate in normal cluster activities, which is the
same as now. Compared to today's cluster restart, the new mode may mean more
network bandwidth and memory on the restarting validators, but it guarantees
the safety of optimistically confirmed user transactions, and validator admins
don't need to manually generate and download snapshots again.

## Security Considerations

The two added gossip messages `LastVotedForkSlots` and `HeaviestFork` will only
be sent and processed when the validator is restarted in RepairAndRestart mode.
So random validator restarting in the new mode will not bring extra burden to
the system.

Non-conforming validators could send out wrong `LastVotedForkSlots` and
`HeaviestFork` messages to mess with cluster restarts, these should be included
in the Slashing rules in the future.

## Backwards Compatibility

This change is backward compatible with previous versions, because validators
only enter the new mode during new restart mode which is controlled by a
command line argument. All current restart arguments like
--wait-for-supermajority and --expected-bank-hash will be kept as is for now.