-
Notifications
You must be signed in to change notification settings - Fork 764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor data column reconstruction and avoid blocking processing #6403
Refactor data column reconstruction and avoid blocking processing #6403
Conversation
… code and avoids having to pass `DataColumnsToPublish` around and blocking other processing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Just needs some conflicts fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good to go modulo conflicts
# Conflicts: # beacon_node/beacon_chain/src/beacon_chain.rs # beacon_node/beacon_chain/src/data_availability_checker.rs # beacon_node/beacon_chain/src/data_availability_checker/overflow_lru_cache.rs # beacon_node/network/src/network_beacon_processor/sync_methods.rs
Thanks @michaelsproul and @dapplion ! I've resolved conflicts. |
# Conflicts: # beacon_node/network/src/network_beacon_processor/mod.rs
let Some(pending_components) = self | ||
.availability_cache | ||
.peek_pending_components(block_root, |pending_components_opt| { | ||
pending_components_opt.cloned() | ||
}) | ||
else { | ||
// Block may have been imported as it does not exist in availability cache. | ||
return Ok(None); | ||
}; | ||
|
||
if !self.should_reconstruct(&pending_components) { | ||
return Ok(None); | ||
} | ||
|
||
self.availability_cache | ||
.set_reconstruction_started(block_root); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a potential race condition here. We clone an old view of the pending components, check if reconstruction_started
, then acquire the lock again in set_reconstruction_started
and set reconstruction_started
to true. We could have N parallel reconstruction attempts happening:
- thread 1: peek_pending_components + clone + drop lock
- thread 2: peek_pending_components + clone + drop lock
- thread 1:
should_reconstruct
returns true - thread 2:
should_reconstruct
returns true
So either hold the lock until you set_reconstruction_started or restructure the check into being atomic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fixed in b0c3379.
I've also added error handling to reconstruction failure, so we can recover from potentially bad columns that somehow made it through to DA checker.
); | ||
|
||
self.availability_cache | ||
.put_kzg_verified_data_columns( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have the same issue here, we take a snapshot of an old view of the shared state in imported_custody_column_indexes
and then later acquire the write lock again in put_kzg_verified_data_columns
. However, if the bug above is fixed and reconstruction strictly happens once we should be fine? Also worst case we publish the same thing twice. However, now that the entry into the da_checker is not dropped until after import this can still be problematic and lead to double imports?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, if the bug above is fixed and reconstruction strictly happens once we should be fine
Yep I think so. The double import issue has been raised here, I think it might be best to address it separately:
#6439
# Conflicts: # beacon_node/beacon_chain/src/data_availability_checker/overflow_lru_cache.rs
…y Lion. Also added error handling to reconstruction failure.
pub fn handle_reconstruction_failure(&self, block_root: &Hash256) { | ||
if let Some(pending_components_mut) = self.critical.write().get_mut(block_root) { | ||
pending_components_mut.verified_data_columns = vec![]; | ||
pending_components_mut.reconstruction_started = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Will sleep better with this fallback :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the iterations looks solid now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked out all the DA checker / overflow_lru_cache changes. It looks good to me although I can't really comment on race conditions with block import until I have a conversation with you guys about it.
{ | ||
ReconstructColumnsDecision::Yes(pending_components) => pending_components, | ||
ReconstructColumnsDecision::No(reason) => { | ||
return Ok(DataColumnReconstructionResult::NotRequired(reason)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to point out that the Error variant here is called NotRequired
but ReconstructionDecision
includes "not enough columns" which (unlike every other variant) doesn't actually mean it's not required, only that it can't be done yet. I thought you might instead change things to:
#[derive(Debug)]
pub enum RejectionCriteria {
BlockAlreadyImported,
AlreadyStarted,
NotRequiredForFullNode,
AllColumnsReceived,
NotEnoughColumns,
}
pub enum DataColumnReconstructionResult<E: EthSpec> {
Success(AvailabilityAndReconstructedColumns<E>),
NotRequired(RejectionCriteria),
Pending(RejectionCriteria),
}
pub(crate) enum ReconstructColumnsDecision<E: EthSpec> {
Yes(PendingComponents<E>),
No(RejectionCriteria),
}
impl<E: EthSpec> From<RejectionCriteria> for DataColumnReconstructionResult<E> {
fn from(criteria: RejectionCriteria) -> Self {
match criteria {
RejectionCriteria::NotEnoughColumns => DataColumnReconstructionResult::Pending(criteria),
_ => DataColumnReconstructionResult::NotRequired(criteria),
}
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah I thought about having an enum for all the reasons, and I decided to go with string instead because we don't really need to handle each variant differently, the main usage of the reason
is for metric label, and it requires a string, hence I went with it for simplicity and efficiency:
lighthouse/beacon_node/beacon_chain/src/beacon_chain.rs
Lines 3199 to 3202 in 0e6eaa2
metrics::inc_counter_vec( | |
&metrics::KZG_DATA_COLUMN_RECONSTRUCTION_INCOMPLETE_TOTAL, | |
&[reason], | |
); |
the Error variant here is called NotRequired but ReconstructionDecision includes "not enough columns" which (unlike every other variant) doesn't actually mean it's not required, only that it can't be done yet.
Yes that's right, if this is confusing I could rename DataColumnReconstructionResult::NotRequired
to something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've renamed NotRequired
to NotStarted
.
|
||
// Check indices from cache again to make sure we don't publish components we've already received. | ||
let Some(existing_column_indices) = self.cached_data_column_indexes(block_root) else { | ||
return Ok(DataColumnReconstructionResult::RecoveredColumnsNotImported( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checking my understanding here.. but shouldn't this never happen due to the reconstruction_started
gate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah good question - this scenario can happen if we receive the columns via gossip, and we no longer need to import/publish them.
@mergify queue |
✅ The pull request has been merged automaticallyThe pull request has been merged automatically at ee7fca3 |
@mergify requeue |
❌ This pull request head commit has not been previously disembarked from queue. |
@mergify refresh |
✅ Pull request refreshed |
…gp#6403) * Move reconstruction logic out of `overflow_lru_cache` to simplify the code and avoids having to pass `DataColumnsToPublish` around and blocking other processing. * Publish reconstructed cells before recomputing head. Remove duplicate functions. * Merge branch 'unstable' into non-blocking-reconstruction * Merge branch 'unstable' into non-blocking-reconstruction # Conflicts: # beacon_node/beacon_chain/src/beacon_chain.rs # beacon_node/beacon_chain/src/data_availability_checker.rs # beacon_node/beacon_chain/src/data_availability_checker/overflow_lru_cache.rs # beacon_node/network/src/network_beacon_processor/sync_methods.rs * Spawn a blocking task for reconstruction. * Merge branch 'unstable' into non-blocking-reconstruction # Conflicts: # beacon_node/network/src/network_beacon_processor/mod.rs * Fix fmt * Merge branch 'unstable' into non-blocking-reconstruction # Conflicts: # beacon_node/beacon_chain/src/data_availability_checker/overflow_lru_cache.rs * Fix race condition by making check and mutation atomic as suggested by Lion. Also added error handling to reconstruction failure. * Add reconstruction reason metric and more debug logging to da checker. * Add comment and logging. * Rename `NotRequired` to `NotStarted`. * Remove extra character added.
Issue Addressed
Continuation of #5990.
I've taken the changes from #5990 with some cleanups. This should simplify the code a bit and reduce supernode bandwidth and performance (reconstruction no longer blocks DA processing for the block).
Proposed Changes
overflow_lru_cache
, this simplifies the code and avoids having to passDataColumnsToPublish
around.