Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(state-dumper): dump state for post-resharding shards #12491

Merged
merged 5 commits into from
Nov 21, 2024

Conversation

marcelo-gonzalez
Copy link
Contributor

Currently, the state dumper starts one thread for each ShardId in the current epoch, and each of those is responsible for dumping headers and parts just for that ShardId. But after a resharding, there's no thread that's aware that it should dump state for any of the new ShardIds. Here we fix it by iterating not just over the ShardIds in the current epoch when we start the threads, but also over any future ShardIds that might belong to post-protocol upgrade epochs.

This is not great because we're starting threads that won't be doing anything useful (but still doing work in a loop which in tests can be nontrivial since we set the "iteration_delay" config value to 100ms) for quite some time, and we don't stop old threads after the shard ID they correspond to is no longer a valid shard ID in the current epoch. But it's not horrible and this is an easy first fix.

Copy link

codecov bot commented Nov 20, 2024

Codecov Report

Attention: Patch coverage is 73.33333% with 12 lines in your changes missing coverage. Please review.

Project coverage is 69.84%. Comparing base (9e4933b) to head (121e8ba).
Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
chain/chain/src/test_utils/kv_runtime.rs 0.00% 6 Missing ⚠️
nearcore/src/state_sync.rs 81.25% 0 Missing and 6 partials ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           master   #12491   +/-   ##
=======================================
  Coverage   69.84%   69.84%           
=======================================
  Files         838      838           
  Lines      169410   169424   +14     
  Branches   169410   169424   +14     
=======================================
+ Hits       118323   118341   +18     
+ Misses      45840    45834    -6     
- Partials     5247     5249    +2     
Flag Coverage Δ
backward-compatibility 0.16% <0.00%> (-0.01%) ⬇️
db-migration 0.16% <0.00%> (-0.01%) ⬇️
genesis-check 1.29% <0.00%> (-0.01%) ⬇️
linux 69.17% <73.33%> (+<0.01%) ⬆️
linux-nightly 69.42% <73.33%> (-0.01%) ⬇️
macos 51.01% <0.00%> (+<0.01%) ⬆️
pytests 1.60% <0.00%> (-0.01%) ⬇️
sanity-checks 1.40% <0.00%> (-0.01%) ⬇️
unittests 69.67% <73.33%> (+<0.01%) ⬆️
upgradability 0.21% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

Copy link
Contributor

@wacban wacban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
feel free to merge first

fn get_shard_layout_from_protocol_version(
&self,
_protocol_version: ProtocolVersion,
) -> Result<ShardLayout, EpochError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the other PR, I removed the Result since it never actually fails.

@@ -47,6 +50,37 @@ pub struct StateSyncDumper {
}

impl StateSyncDumper {
/// Returns all current ShardIDs, plus any that may belong to a future epoch after a protocol upgrade
/// For now we start a thread for each shard ID even if it won't be needed for a long time.
/// TODO: fix that, and handle the dynamic resharding case.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: TODO(resharding)

@@ -276,28 +296,20 @@ fn get_current_state(
};

if Some(&new_epoch_id) == was_last_epoch_done.as_ref() {
tracing::debug!(target: "state_sync_dump", ?shard_id, ?was_last_epoch_done, ?new_epoch_id, new_epoch_height, ?new_sync_hash, "latest epoch is done. No new epoch to dump. Idle");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove the comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's a sort of unrelated change yeah... But in my opinion this file is just way too verbose in general, and this log line in particular feels kind of unnecessary, since we're printing it on a loop constantly after the state dump is done. If you take a look at logs generated by the state dumper by grepping for "state_sync_dump", it's just kind of unreadable. This deletion is one small step to chip away at that I guess.

let shard_layout = epoch_manager.get_shard_layout(&new_epoch_id)?;

if shard_layout.shard_ids().contains(shard_id)
&& cares_about_shard(chain, shard_id, &new_sync_hash, &shard_tracker, &account_id)?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my education, why does it check cares_about_shard for an account_id? I thought each thread just dumps for one shard id and that's it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's just so that the thread dumping this shard ID can tell whether the state should be there. I think most of the time this acount ID is None because nobody is running the state dumper on a validator, but it's Some in tests, where this cares_about_shard() will be the right way of telling whether this shard_id is going to be in the state snapshot

@marcelo-gonzalez marcelo-gonzalez added this pull request to the merge queue Nov 21, 2024
Merged via the queue into near:master with commit 4e87bae Nov 21, 2024
26 of 28 checks passed
@marcelo-gonzalez marcelo-gonzalez deleted the state-dumper-shards branch November 21, 2024 17:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants