fix(state-dumper): dump state for post-resharding shards #12491

marcelo-gonzalez · 2024-11-20T17:35:35Z

Currently, the state dumper starts one thread for each ShardId in the current epoch, and each of those is responsible for dumping headers and parts just for that ShardId. But after a resharding, there's no thread that's aware that it should dump state for any of the new ShardIds. Here we fix it by iterating not just over the ShardIds in the current epoch when we start the threads, but also over any future ShardIds that might belong to post-protocol upgrade epochs.

This is not great because we're starting threads that won't be doing anything useful (but still doing work in a loop which in tests can be nontrivial since we set the "iteration_delay" config value to 100ms) for quite some time, and we don't stop old threads after the shard ID they correspond to is no longer a valid shard ID in the current epoch. But it's not horrible and this is an easy first fix.

codecov · 2024-11-20T18:36:42Z

Codecov Report

Attention: Patch coverage is 73.33333% with 12 lines in your changes missing coverage. Please review.

Project coverage is 69.84%. Comparing base (9e4933b) to head (121e8ba).
Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
chain/chain/src/test_utils/kv_runtime.rs	0.00%	6 Missing ⚠️
nearcore/src/state_sync.rs	81.25%	0 Missing and 6 partials ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##           master   #12491   +/-   ##
=======================================
  Coverage   69.84%   69.84%           
=======================================
  Files         838      838           
  Lines      169410   169424   +14     
  Branches   169410   169424   +14     
=======================================
+ Hits       118323   118341   +18     
+ Misses      45840    45834    -6     
- Partials     5247     5249    +2

Flag	Coverage Δ
backward-compatibility	`0.16% <0.00%> (-0.01%)`	⬇️
db-migration	`0.16% <0.00%> (-0.01%)`	⬇️
genesis-check	`1.29% <0.00%> (-0.01%)`	⬇️
linux	`69.17% <73.33%> (+<0.01%)`	⬆️
linux-nightly	`69.42% <73.33%> (-0.01%)`	⬇️
macos	`51.01% <0.00%> (+<0.01%)`	⬆️
pytests	`1.60% <0.00%> (-0.01%)`	⬇️
sanity-checks	`1.40% <0.00%> (-0.01%)`	⬇️
unittests	`69.67% <73.33%> (+<0.01%)`	⬆️
upgradability	`0.21% <0.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚨 Try these New Features:

Flaky Tests Detection - Detect and resolve failed and flaky tests
JS Bundle Analysis - Avoid shipping oversized bundles

wacban

LGTM
feel free to merge first

wacban · 2024-11-21T11:09:14Z

chain/chain/src/test_utils/kv_runtime.rs

+    fn get_shard_layout_from_protocol_version(
+        &self,
+        _protocol_version: ProtocolVersion,
+    ) -> Result<ShardLayout, EpochError> {


In the other PR, I removed the Result since it never actually fails.

wacban · 2024-11-21T11:09:46Z

nearcore/src/state_sync.rs

@@ -47,6 +50,37 @@ pub struct StateSyncDumper {
 }

 impl StateSyncDumper {
+    /// Returns all current ShardIDs, plus any that may belong to a future epoch after a protocol upgrade
+    /// For now we start a thread for each shard ID even if it won't be needed for a long time.
+    /// TODO: fix that, and handle the dynamic resharding case.


nit: TODO(resharding)

wacban · 2024-11-21T11:11:31Z

nearcore/src/state_sync.rs

@@ -276,28 +296,20 @@ fn get_current_state(
    };

    if Some(&new_epoch_id) == was_last_epoch_done.as_ref() {
-        tracing::debug!(target: "state_sync_dump", ?shard_id, ?was_last_epoch_done, ?new_epoch_id, new_epoch_height, ?new_sync_hash, "latest epoch is done. No new epoch to dump. Idle");


Why remove the comment?

I guess it's a sort of unrelated change yeah... But in my opinion this file is just way too verbose in general, and this log line in particular feels kind of unnecessary, since we're printing it on a loop constantly after the state dump is done. If you take a look at logs generated by the state dumper by grepping for "state_sync_dump", it's just kind of unreadable. This deletion is one small step to chip away at that I guess.

wacban · 2024-11-21T11:13:27Z

nearcore/src/state_sync.rs

+    let shard_layout = epoch_manager.get_shard_layout(&new_epoch_id)?;
+
+    if shard_layout.shard_ids().contains(shard_id)
+        && cares_about_shard(chain, shard_id, &new_sync_hash, &shard_tracker, &account_id)?


For my education, why does it check cares_about_shard for an account_id? I thought each thread just dumps for one shard id and that's it.

I think it's just so that the thread dumping this shard ID can tell whether the state should be there. I think most of the time this acount ID is None because nobody is running the state dumper on a validator, but it's Some in tests, where this cares_about_shard() will be the right way of telling whether this shard_id is going to be in the state snapshot

marcelo-gonzalez added 2 commits November 20, 2024 11:45

cherry pick EpochManager change from near#12484

7dc3716

dump state for post-resharding shards

ff9265d

marcelo-gonzalez requested a review from wacban November 20, 2024 17:35

marcelo-gonzalez requested a review from a team as a code owner November 20, 2024 17:35

wacban approved these changes Nov 21, 2024

View reviewed changes

marcelo-gonzalez added 3 commits November 21, 2024 11:05

TODO(resharding)

d2a92ad

cherry pick update from near#12484

1faf4de

Merge remote-tracking branch 'origin/master' into state-dumper-shards

121e8ba

marcelo-gonzalez enabled auto-merge November 21, 2024 16:16

marcelo-gonzalez added this pull request to the merge queue Nov 21, 2024

Merged via the queue into near:master with commit 4e87bae Nov 21, 2024
26 of 28 checks passed

marcelo-gonzalez deleted the state-dumper-shards branch November 21, 2024 17:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(state-dumper): dump state for post-resharding shards #12491

fix(state-dumper): dump state for post-resharding shards #12491

marcelo-gonzalez commented Nov 20, 2024

codecov bot commented Nov 20, 2024 •

edited

Loading

wacban left a comment

wacban Nov 21, 2024

wacban Nov 21, 2024

wacban Nov 21, 2024

marcelo-gonzalez Nov 21, 2024

wacban Nov 21, 2024

marcelo-gonzalez Nov 21, 2024

fix(state-dumper): dump state for post-resharding shards #12491

fix(state-dumper): dump state for post-resharding shards #12491

Conversation

marcelo-gonzalez commented Nov 20, 2024

codecov bot commented Nov 20, 2024 • edited Loading

Codecov Report

wacban left a comment

Choose a reason for hiding this comment

wacban Nov 21, 2024

Choose a reason for hiding this comment

wacban Nov 21, 2024

Choose a reason for hiding this comment

wacban Nov 21, 2024

Choose a reason for hiding this comment

marcelo-gonzalez Nov 21, 2024

Choose a reason for hiding this comment

wacban Nov 21, 2024

Choose a reason for hiding this comment

marcelo-gonzalez Nov 21, 2024

Choose a reason for hiding this comment

codecov bot commented Nov 20, 2024 •

edited

Loading