-
Notifications
You must be signed in to change notification settings - Fork 684
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(resharding): Skip old ShardUIds in get_new_flat_storage_head() #12505
fix(resharding): Skip old ShardUIds in get_new_flat_storage_head() #12505
Conversation
This function finds the chunk in the last final block by looking for the one with the same shard ID as one of the shards in the current block's epoch. But if there is a skipped chunk in a child shard in the first block after a resharding, then the old chunk in the child's shard_index will be the parent's chunk from the previous epoch. Then this shard ID comparison doesn't work and we get an error. Fix it by comparing the current ShardUID to the ShardUID of the chunk, and returning None if they don't match.
This can be triggered by diff --git a/chain/chain/src/chain.rs b/chain/chain/src/chain.rs
index e949fdbb5..b62312fd2 100644
--- a/chain/chain/src/chain.rs
+++ b/chain/chain/src/chain.rs
@@ -791,27 +791,7 @@ impl Chain {
me,
&prev_hash,
)?;
- let prev_block = self.get_block(&prev_hash)?;
- if prev_block.chunks().len() != epoch_first_block.chunks().len()
- && !shards_to_state_sync.is_empty()
- {
- // Currently, the state sync algorithm assumes that the number of chunks do not change
- // between the epoch being synced to and the last epoch.
- // For example, if shard layout changes at the beginning of epoch T, validators
- // will not be able to sync states at epoch T for epoch T+1
- // Fortunately, since all validators track all shards for now, this error will not be
- // triggered in live yet
- // Instead of propagating the error, we simply log the error here because the error
- // do not affect processing blocks for this epoch. However, when the next epoch comes,
- // the validator will not have the states ready so it will halt.
- error!(
- "Cannot download states for epoch {:?} because sharding just changed. I'm {:?}",
- epoch_first_block.header().epoch_id(),
- me
- );
- debug_assert!(false);
- }
if shards_to_state_sync.is_empty() {
Ok(None)
} else {
diff --git a/integration-tests/src/test_loop/tests/resharding_v3.rs b/integration-tests/src/test_loop/tests/resharding_v3.rs
index 537672fc0..1376f3ca0 100644
--- a/integration-tests/src/test_loop/tests/resharding_v3.rs
+++ b/integration-tests/src/test_loop/tests/resharding_v3.rs
@@ -604,7 +604,6 @@ fn test_resharding_v3_double_sign_resharding_block() {
// TODO(resharding): fix nearcore and un-ignore this test
#[test]
-#[ignore]
fn test_resharding_v3_shard_shuffling() {
let params = TestReshardingParameters::new()
.shuffle_shard_assignment() Then if you grep for ERROR in the logs you'll see a lot of this:
The error there isn't printed, but this is what's causing it. Then if you run the same test with the same diff on top of this PR, the test will pass (although it still fails with some scary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good one, thank you!
It seemed overly complicated at first but then I got it. It is worth thinking how we could hide all that complexity of missing-chunks-with-old-layout in blocks/chunks/shard layout API.
return Ok(None); | ||
} | ||
|
||
let shard_index = shard_layout.get_shard_index(shard_uid.shard_id())?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add short comment why we need it. I'd be tempted to use only shard id check but don't want to rely on shard id uniqueness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I was thinking of just comparing the ShardIds since we're going with unique ones anyway, but I got paranoid and went with the full shard UID check
Well, alternatively, we could just stop "proposing" new heads until all chunks in new shard layout will be "finalised" by blocks. But this is still a hack. |
Also another thing I was thinking while looking at this was what if we only return |
This function finds the chunk in the last final block by looking for the one with the same shard ID as one of the shards in the current block's epoch. But if there is a skipped chunk in a child shard in the first block after a resharding, then the old chunk in the child's shard_index will be the parent's chunk from the previous epoch. Then this shard ID comparison doesn't work and we get an error. Fix it by comparing the current ShardUID to the ShardUID of the chunk, and returning None if they don't match.