This repository has been archived by the owner on Jan 22, 2025. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Wen restart aggregate last voted fork slots #33892
Merged
wen-coding
merged 61 commits into
solana-labs:master
from
wen-coding:wen_restart_aggregate_last_voted_fork_slots
Mar 2, 2024
Merged
Changes from 1 commit
Commits
Show all changes
61 commits
Select commit
Hold shift + click to select a range
47f8891
Push and aggregate RestartLastVotedForkSlots.
wen-coding a8d0c08
Fix API and lint errors.
wen-coding f7b8232
Reduce clutter.
wen-coding f5f71b4
Put my own LastVotedForkSlots into the aggregate.
wen-coding 630cc70
Merge branch 'solana-labs:master' into wen_restart_aggregate_last_vot…
wen-coding ce32c03
Write LastVotedForkSlots aggregate progress into local file.
wen-coding b90185d
Fix typo and name constants.
wen-coding 3c819f0
Fix flaky test.
wen-coding e21efe3
Clarify the comments.
wen-coding b24b8db
- Use constant for wait_for_supermajority
wen-coding a2204f3
Fix delay_after_first_shred and remove loop in wen_restart.
wen-coding 0c1ef0f
Read wen_restart slots inside the loop instead.
wen-coding b9324c8
Merge branch 'master' into wen_restart_aggregate_last_voted_fork_slots
wen-coding 122314d
Discard turbine shreds while in wen_restart in windows insert rather …
wen-coding e1252a4
Merge branch 'master' into wen_restart_aggregate_last_voted_fork_slots
wen-coding 31ca285
Merge branch 'master' into wen_restart_aggregate_last_voted_fork_slots
wen-coding c3ab972
Merge branch 'master' into wen_restart_aggregate_last_voted_fork_slots
wen-coding 8fc2327
Use the new Gossip API.
wen-coding 229f447
Rename slots_to_repair_for_wen_restart and a few others.
wen-coding bc1b4b5
Rename a few more and list all states.
wen-coding 8743b5c
Pipe exit down to aggregate loop so we can exit early.
wen-coding 4ebbde8
Merge branch 'master' into wen_restart_aggregate_last_voted_fork_slots
wen-coding 0d82a7c
Fix import of RestartLastVotedForkSlots.
wen-coding 5e0a5b1
Merge branch 'master' into wen_restart_aggregate_last_voted_fork_slots
wen-coding ec21ec1
Merge branch 'master' into wen_restart_aggregate_last_voted_fork_slots
wen-coding c172c26
Use the new method to generate test bank.
wen-coding 08de626
Make linter happy.
wen-coding ea4d800
Merge branch 'master' into wen_restart_aggregate_last_voted_fork_slots
wen-coding 4f91be7
Use new bank constructor for tests.
wen-coding de89a4e
Merge branch 'master' into wen_restart_aggregate_last_voted_fork_slots
wen-coding 1e98324
Fix a bad merge.
wen-coding 777523f
Merge branch 'master' into wen_restart_aggregate_last_voted_fork_slots
wen-coding 1e478e8
- add new const for wen_restart
wen-coding b0980e4
Merge branch 'master' into wen_restart_aggregate_last_voted_fork_slots
wen-coding f4acd69
Add initialize and put the main logic into a loop.
wen-coding bb471c1
Merge branch 'wen_restart_aggregate_last_voted_fork_slots' of https:/…
wen-coding c45a29b
Change aggregate interface and other fixes.
wen-coding 167b790
Add failure tests and tests for state transition.
wen-coding e0a070f
Add more tests and add ability to recover from written records in
wen-coding 8be5cd0
Merge branch 'master' into wen_restart_aggregate_last_voted_fork_slots
wen-coding ddd144e
Merge branch 'master' into wen_restart_aggregate_last_voted_fork_slots
wen-coding 1cfc510
Various name changes.
wen-coding 5b10c6e
We don't really care what type of error is returned.
wen-coding 0620aaf
Wait on expected progress message in proto file instead of sleep.
wen-coding 1ceda56
Merge branch 'master' into wen_restart_aggregate_last_voted_fork_slots
wen-coding 93abe45
Code reorganization and cleanup.
wen-coding cb1788e
Make linter happy.
wen-coding 72a732e
Add WenRestartError.
wen-coding 4c920cb
Split WenRestartErrors into separate erros per state.
wen-coding bf71c9b
Revert "Split WenRestartErrors into separate erros per state."
wen-coding 056aef7
Merge branch 'master' into wen_restart_aggregate_last_voted_fork_slots
wen-coding 645452f
Use individual functions when testing for failures.
wen-coding f46e62a
Move initialization errors into initialize().
wen-coding e3d0194
Use anyhow instead of thiserror to generate backtrace for error.
wen-coding ffbb20c
Add missing Cargo.lock.
wen-coding 3b50964
Merge branch 'master' into wen_restart_aggregate_last_voted_fork_slots
wen-coding 59fd5ff
Add error log when last_vote is missing in the tower storage.
wen-coding e7c320c
Merge branch 'master' into wen_restart_aggregate_last_voted_fork_slots
wen-coding 9ccf5ca
Merge branch 'master' into wen_restart_aggregate_last_voted_fork_slots
wen-coding 021dbe9
Change error log info.
wen-coding 40c0fb6
Change test to match exact error.
wen-coding File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to organize these by state transition:
wait_for_wen_restart()
still returnsBox<dyn std::error::Error>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, when you say "write properly to the record", do you mean writing to the logs or to the protobuf as well? Since some of the errors might be transient, I didn't plan to write the errors to the protobuf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I didn't mean write the error out. I mean the
progress
data structure you write is what you expect after the error happens.For every single error, for every single state transition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did the change but it actually felt wrong IMHO, several reasons:
pub enum xxxError is public interface, if you think about how it's used, all people see from outside the module is wait_for_wen_restart, so it feels natural to have WenRestartError returned, it feels unnatural to have several Error enum returned, leaking our implementation details (which they shouldn't care about).
I could of course test each function individually, but I do have to test they work in wait_for_wen_restart. If I only need to test is_err() then it's all fine, but if I want to test for specific error (e.g. file permission wrong), I had to do a downcast on dyn error, I had to constantly look back to error definitions to see what type of error I'm now expecting
The fact that we do need to write the protobuf means I need to call write_wen_restart() from several places, but they will return different xxxError types, so I need to create yet another WriteWenRestartError and convert to other Error types, this is code duplication and unnecessary conversion. If we move part of the function to another internal method, we need to create yet another Error struct?
All in all, this feels wrong, maybe I'm misunderstanding what you described.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we're not writing a standalone library so I think a boxed error is fine. The purpose of this error is to make attribution of errors/debugging easier, which categorizing the errors does. We can now trace exactly which step in the restart went wrong. I think we could also introduce a top level
WenRestartError
that wraps the errors like so:Hmm I don't think you have to test every error case in
wait_for_wen_restart()
. I think if the state transition functions area-> b -> c
, then:a
,b
, andc
individually will run successfully to "completion" across all cases. "Completion" in this case means the ending output passes the verification for the initial input of the next state transition. Concretely this means testing in the happy case the state is correct, and that on all declared errors, they can recover and then run to completion.wait_for_wen_restart()
is that the order of callsa -> b -> c
is correct and the arguments are being passed correctly. Concretely this means just testing one happy path.I think a unique error for
write_wen_restart
is fine, it is just a suberror of each state transition. This way we know which step the write failed in.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it more, I think I agree that failures should be tested on individual functions rather than wait_on_wen_restart(), so I changed accordingly.
The only thing I'm hesitating is that I still don't see how splitting WenRestartError into individual errors helps us. If the proto file is suddenly not writable, we need to solve that problem no matter what stage WenRestart is in, so I don't think replicating file IO errors everywhere really helps the user, it just complicates the public API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about this, we guarantee no errors on the state transitions 😄 . I.e. we guarantee the state going in and the state coming out of the records are always valid.
solana/wen-restart/src/wen_restart.rs
Lines 88 to 92 in 645452f
https://github.com/solana-labs/solana/pull/33892/files#diff-c32a5c15bdd23820489e6f8af8b305aee344228d0cb6789e554a988d57d54bd7R70-R71. These validations can be done in the beginning on entry into the state machine in
wait_for_wen_restart
, as avalidate_state_1()
,validate_state_2()
, etc. These can then be unit tested. Ideally I think we would changeWenRestartProgress
to be an enum with a separate state for each step in this state machine, i.e.and then at the end of each state transition we just spit out the next iteration of the enum
write_wen_restart_records()
either writes a whole valid record OR doesn't write a record/i.e. errors, then we know the state must always be valid.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can move the initialization errors into initialize(), but I can't guarantee there are no other errors after initialize(). For example, push_restart_last_voted_fork_slots() could return error. And we might want to throw out error in the future while running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right on this 😃
If we don't want a separate enum per state transition, then at minimum I would like to be able to print out the backtrace of where that error occurred. Maybe this: https://stackoverflow.com/questions/75060346/how-can-i-get-file-and-line-number-where-an-error-was-returned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to anyhow instead of thiserror for backtrace, also moved initialization errors into initialize().