-
Notifications
You must be signed in to change notification settings - Fork 632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(state-dumper): dump state for post-resharding shards #12491
fix(state-dumper): dump state for post-resharding shards #12491
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #12491 +/- ##
=======================================
Coverage 69.84% 69.84%
=======================================
Files 838 838
Lines 169410 169424 +14
Branches 169410 169424 +14
=======================================
+ Hits 118323 118341 +18
+ Misses 45840 45834 -6
- Partials 5247 5249 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚨 Try these New Features:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
feel free to merge first
fn get_shard_layout_from_protocol_version( | ||
&self, | ||
_protocol_version: ProtocolVersion, | ||
) -> Result<ShardLayout, EpochError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the other PR, I removed the Result since it never actually fails.
nearcore/src/state_sync.rs
Outdated
@@ -47,6 +50,37 @@ pub struct StateSyncDumper { | |||
} | |||
|
|||
impl StateSyncDumper { | |||
/// Returns all current ShardIDs, plus any that may belong to a future epoch after a protocol upgrade | |||
/// For now we start a thread for each shard ID even if it won't be needed for a long time. | |||
/// TODO: fix that, and handle the dynamic resharding case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: TODO(resharding)
@@ -276,28 +296,20 @@ fn get_current_state( | |||
}; | |||
|
|||
if Some(&new_epoch_id) == was_last_epoch_done.as_ref() { | |||
tracing::debug!(target: "state_sync_dump", ?shard_id, ?was_last_epoch_done, ?new_epoch_id, new_epoch_height, ?new_sync_hash, "latest epoch is done. No new epoch to dump. Idle"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove the comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's a sort of unrelated change yeah... But in my opinion this file is just way too verbose in general, and this log line in particular feels kind of unnecessary, since we're printing it on a loop constantly after the state dump is done. If you take a look at logs generated by the state dumper by grepping for "state_sync_dump", it's just kind of unreadable. This deletion is one small step to chip away at that I guess.
let shard_layout = epoch_manager.get_shard_layout(&new_epoch_id)?; | ||
|
||
if shard_layout.shard_ids().contains(shard_id) | ||
&& cares_about_shard(chain, shard_id, &new_sync_hash, &shard_tracker, &account_id)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my education, why does it check cares_about_shard for an account_id? I thought each thread just dumps for one shard id and that's it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's just so that the thread dumping this shard ID can tell whether the state should be there. I think most of the time this acount ID is None because nobody is running the state dumper on a validator, but it's Some
in tests, where this cares_about_shard()
will be the right way of telling whether this shard_id is going to be in the state snapshot
Currently, the state dumper starts one thread for each ShardId in the current epoch, and each of those is responsible for dumping headers and parts just for that ShardId. But after a resharding, there's no thread that's aware that it should dump state for any of the new ShardIds. Here we fix it by iterating not just over the ShardIds in the current epoch when we start the threads, but also over any future ShardIds that might belong to post-protocol upgrade epochs.
This is not great because we're starting threads that won't be doing anything useful (but still doing work in a loop which in tests can be nontrivial since we set the "iteration_delay" config value to 100ms) for quite some time, and we don't stop old threads after the shard ID they correspond to is no longer a valid shard ID in the current epoch. But it's not horrible and this is an easy first fix.