Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(resharding): Skip old ShardUIds in get_new_flat_storage_head() #12505

Merged
merged 3 commits into from
Nov 27, 2024

Conversation

marcelo-gonzalez
Copy link
Contributor

This function finds the chunk in the last final block by looking for the one with the same shard ID as one of the shards in the current block's epoch. But if there is a skipped chunk in a child shard in the first block after a resharding, then the old chunk in the child's shard_index will be the parent's chunk from the previous epoch. Then this shard ID comparison doesn't work and we get an error. Fix it by comparing the current ShardUID to the ShardUID of the chunk, and returning None if they don't match.

This function finds the chunk in the last final block by looking for
the one with the same shard ID as one of the shards in the current
block's epoch. But if there is a skipped chunk in a child shard in the
first block after a resharding, then the old chunk in the child's
shard_index will be the parent's chunk from the previous epoch.  Then
this shard ID comparison doesn't work and we get an error. Fix it by
comparing the current ShardUID to the ShardUID of the chunk, and
returning None if they don't match.
@marcelo-gonzalez marcelo-gonzalez requested a review from a team as a code owner November 22, 2024 22:21
@marcelo-gonzalez
Copy link
Contributor Author

This can be triggered by cargo test -p integration-tests --features nightly -- --exact --nocapture test_loop::tests::resharding_v3::test_resharding_v3_shard_shuffling if you run it with this diff:

diff --git a/chain/chain/src/chain.rs b/chain/chain/src/chain.rs
index e949fdbb5..b62312fd2 100644
--- a/chain/chain/src/chain.rs
+++ b/chain/chain/src/chain.rs
@@ -791,27 +791,7 @@ impl Chain {
             me,
             &prev_hash,
         )?;
-        let prev_block = self.get_block(&prev_hash)?;
 
-        if prev_block.chunks().len() != epoch_first_block.chunks().len()
-            && !shards_to_state_sync.is_empty()
-        {
-            // Currently, the state sync algorithm assumes that the number of chunks do not change
-            // between the epoch being synced to and the last epoch.
-            // For example, if shard layout changes at the beginning of epoch T, validators
-            // will not be able to sync states at epoch T for epoch T+1
-            // Fortunately, since all validators track all shards for now, this error will not be
-            // triggered in live yet
-            // Instead of propagating the error, we simply log the error here because the error
-            // do not affect processing blocks for this epoch. However, when the next epoch comes,
-            // the validator will not have the states ready so it will halt.
-            error!(
-                "Cannot download states for epoch {:?} because sharding just changed. I'm {:?}",
-                epoch_first_block.header().epoch_id(),
-                me
-            );
-            debug_assert!(false);
-        }
         if shards_to_state_sync.is_empty() {
             Ok(None)
         } else {
diff --git a/integration-tests/src/test_loop/tests/resharding_v3.rs b/integration-tests/src/test_loop/tests/resharding_v3.rs
index 537672fc0..1376f3ca0 100644
--- a/integration-tests/src/test_loop/tests/resharding_v3.rs
+++ b/integration-tests/src/test_loop/tests/resharding_v3.rs
@@ -604,7 +604,6 @@ fn test_resharding_v3_double_sign_resharding_block() {
 
 // TODO(resharding): fix nearcore and un-ignore this test
 #[test]
-#[ignore]
 fn test_resharding_v3_shard_shuffling() {
     let params = TestReshardingParameters::new()
         .shuffle_shard_assignment()

Then if you grep for ERROR in the logs you'll see a lot of this:

ERROR catchup:run_catchup: near_chain::chain: Error processing block during catch up, retrying

The error there isn't printed, but this is what's causing it. Then if you run the same test with the same diff on top of this PR, the test will pass (although it still fails with some scary Transaction overwrites itself: Store Update error if you change the run_until success_condition to return true after the epoch height == 6 :/)

Copy link
Member

@Longarithm Longarithm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good one, thank you!
It seemed overly complicated at first but then I got it. It is worth thinking how we could hide all that complexity of missing-chunks-with-old-layout in blocks/chunks/shard layout API.

return Ok(None);
}

let shard_index = shard_layout.get_shard_index(shard_uid.shard_id())?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add short comment why we need it. I'd be tempted to use only shard id check but don't want to rely on shard id uniqueness.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I was thinking of just comparing the ShardIds since we're going with unique ones anyway, but I got paranoid and went with the full shard UID check

@Longarithm
Copy link
Member

Well, alternatively, we could just stop "proposing" new heads until all chunks in new shard layout will be "finalised" by blocks. But this is still a hack.

@marcelo-gonzalez
Copy link
Contributor Author

Well, alternatively, we could just stop "proposing" new heads until all chunks in new shard layout will be "finalised" by blocks. But this is still a hack.

Also another thing I was thinking while looking at this was what if we only return Some(chunk_header.prev_block_hash()) from get_new_flat_storage_head() if chunk_header is a new chunk? Then you know this mismatch in ShardUIds cannot happen because of the shard layout check earlier in the function. But I wasnt absolutely 100% sure that it is true that if get_new_flat_storage_head() returns the prev_block_hash of an old chunk, then it returned that same hash in the past when it was a new chunk (feels like that should be true? But im not 100% familiar w exactly what edge cases can happen) so I just went with this to not have to think about it.

@marcelo-gonzalez marcelo-gonzalez added this pull request to the merge queue Nov 27, 2024
Merged via the queue into near:master with commit c28ea57 Nov 27, 2024
21 of 22 checks passed
@marcelo-gonzalez marcelo-gonzalez deleted the resharding-flat-update branch November 27, 2024 04:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants