Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor data column reconstruction and avoid blocking processing #6403

Merged
merged 13 commits into from
Oct 17, 2024

Conversation

jimmygchen
Copy link
Member

@jimmygchen jimmygchen commented Sep 17, 2024

Issue Addressed

Continuation of #5990.

I've taken the changes from #5990 with some cleanups. This should simplify the code a bit and reduce supernode bandwidth and performance (reconstruction no longer blocks DA processing for the block).

Proposed Changes

  • Move reconstruction logic out of overflow_lru_cache, this simplifies the code and avoids having to pass DataColumnsToPublish around.
  • Changes to data column reconstruction:
    • Attempt reconstruction without holding availability cache lock, so we can process other gossip / rpc data columns simultaneously
    • Check availability cache again before publishing reconstructed columns to avoid publishing excess duplicates

… code and avoids having to pass `DataColumnsToPublish` around and blocking other processing.
@jimmygchen jimmygchen added ready-for-review The code is ready for review das Data Availability Sampling labels Sep 17, 2024
Copy link
Member

@michaelsproul michaelsproul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Just needs some conflicts fixed

@michaelsproul michaelsproul added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels Sep 23, 2024
Copy link
Collaborator

@dapplion dapplion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good to go modulo conflicts

# Conflicts:
#	beacon_node/beacon_chain/src/beacon_chain.rs
#	beacon_node/beacon_chain/src/data_availability_checker.rs
#	beacon_node/beacon_chain/src/data_availability_checker/overflow_lru_cache.rs
#	beacon_node/network/src/network_beacon_processor/sync_methods.rs
@jimmygchen jimmygchen added ready-for-review The code is ready for review and removed waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels Sep 30, 2024
@jimmygchen
Copy link
Member Author

Thanks @michaelsproul and @dapplion ! I've resolved conflicts.
Also noticed that I was doing reconstruction in an async method, so I've moved that into a blocking task here: d3c84e8

# Conflicts:
#	beacon_node/network/src/network_beacon_processor/mod.rs
Comment on lines 522 to 537
let Some(pending_components) = self
.availability_cache
.peek_pending_components(block_root, |pending_components_opt| {
pending_components_opt.cloned()
})
else {
// Block may have been imported as it does not exist in availability cache.
return Ok(None);
};

if !self.should_reconstruct(&pending_components) {
return Ok(None);
}

self.availability_cache
.set_reconstruction_started(block_root);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a potential race condition here. We clone an old view of the pending components, check if reconstruction_started, then acquire the lock again in set_reconstruction_started and set reconstruction_started to true. We could have N parallel reconstruction attempts happening:

  • thread 1: peek_pending_components + clone + drop lock
  • thread 2: peek_pending_components + clone + drop lock
  • thread 1: should_reconstruct returns true
  • thread 2: should_reconstruct returns true

So either hold the lock until you set_reconstruction_started or restructure the check into being atomic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fixed in b0c3379.

I've also added error handling to reconstruction failure, so we can recover from potentially bad columns that somehow made it through to DA checker.

);

self.availability_cache
.put_kzg_verified_data_columns(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have the same issue here, we take a snapshot of an old view of the shared state in imported_custody_column_indexes and then later acquire the write lock again in put_kzg_verified_data_columns. However, if the bug above is fixed and reconstruction strictly happens once we should be fine? Also worst case we publish the same thing twice. However, now that the entry into the da_checker is not dropped until after import this can still be problematic and lead to double imports?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, if the bug above is fixed and reconstruction strictly happens once we should be fine

Yep I think so. The double import issue has been raised here, I think it might be best to address it separately:
#6439

# Conflicts:
#	beacon_node/beacon_chain/src/data_availability_checker/overflow_lru_cache.rs
@jimmygchen jimmygchen added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels Oct 4, 2024
@jimmygchen jimmygchen added ready-for-review The code is ready for review and removed waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels Oct 11, 2024
Comment on lines +612 to +615
pub fn handle_reconstruction_failure(&self, block_root: &Hash256) {
if let Some(pending_components_mut) = self.critical.write().get_mut(block_root) {
pending_components_mut.verified_data_columns = vec![];
pending_components_mut.reconstruction_started = false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Will sleep better with this fallback :)

Copy link
Collaborator

@dapplion dapplion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the iterations looks solid now

Copy link
Member

@ethDreamer ethDreamer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked out all the DA checker / overflow_lru_cache changes. It looks good to me although I can't really comment on race conditions with block import until I have a conversation with you guys about it.

{
ReconstructColumnsDecision::Yes(pending_components) => pending_components,
ReconstructColumnsDecision::No(reason) => {
return Ok(DataColumnReconstructionResult::NotRequired(reason));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to point out that the Error variant here is called NotRequired but ReconstructionDecision includes "not enough columns" which (unlike every other variant) doesn't actually mean it's not required, only that it can't be done yet. I thought you might instead change things to:

#[derive(Debug)]
pub enum RejectionCriteria {
    BlockAlreadyImported,
    AlreadyStarted,
    NotRequiredForFullNode,
    AllColumnsReceived,
    NotEnoughColumns,
}

pub enum DataColumnReconstructionResult<E: EthSpec> {
    Success(AvailabilityAndReconstructedColumns<E>),
    NotRequired(RejectionCriteria),
    Pending(RejectionCriteria),
}

pub(crate) enum ReconstructColumnsDecision<E: EthSpec> {
    Yes(PendingComponents<E>),
    No(RejectionCriteria),
}

impl<E: EthSpec> From<RejectionCriteria> for DataColumnReconstructionResult<E> {
    fn from(criteria: RejectionCriteria) -> Self {
        match criteria {
            RejectionCriteria::NotEnoughColumns => DataColumnReconstructionResult::Pending(criteria),
            _ => DataColumnReconstructionResult::NotRequired(criteria),
        }
    }
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah I thought about having an enum for all the reasons, and I decided to go with string instead because we don't really need to handle each variant differently, the main usage of the reason is for metric label, and it requires a string, hence I went with it for simplicity and efficiency:

metrics::inc_counter_vec(
&metrics::KZG_DATA_COLUMN_RECONSTRUCTION_INCOMPLETE_TOTAL,
&[reason],
);

the Error variant here is called NotRequired but ReconstructionDecision includes "not enough columns" which (unlike every other variant) doesn't actually mean it's not required, only that it can't be done yet.

Yes that's right, if this is confusing I could rename DataColumnReconstructionResult::NotRequired to something else?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've renamed NotRequired to NotStarted.


// Check indices from cache again to make sure we don't publish components we've already received.
let Some(existing_column_indices) = self.cached_data_column_indexes(block_root) else {
return Ok(DataColumnReconstructionResult::RecoveredColumnsNotImported(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking my understanding here.. but shouldn't this never happen due to the reconstruction_started gate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good question - this scenario can happen if we receive the columns via gossip, and we no longer need to import/publish them.

@michaelsproul
Copy link
Member

@mergify queue

Copy link

mergify bot commented Oct 17, 2024

queue

✅ The pull request has been merged automatically

The pull request has been merged automatically at ee7fca3

@michaelsproul michaelsproul added ready-for-merge This PR is ready to merge. and removed ready-for-review The code is ready for review labels Oct 17, 2024
@michaelsproul
Copy link
Member

@mergify requeue

Copy link

mergify bot commented Oct 17, 2024

requeue

❌ This pull request head commit has not been previously disembarked from queue.

@michaelsproul
Copy link
Member

@mergify refresh

Copy link

mergify bot commented Oct 17, 2024

refresh

✅ Pull request refreshed

mergify bot added a commit that referenced this pull request Oct 17, 2024
@mergify mergify bot merged commit ee7fca3 into sigp:unstable Oct 17, 2024
28 of 29 checks passed
chong-he pushed a commit to chong-he/lighthouse that referenced this pull request Nov 26, 2024
…gp#6403)

* Move reconstruction logic out of `overflow_lru_cache` to simplify the code and avoids having to pass `DataColumnsToPublish` around and blocking other processing.

* Publish reconstructed cells before recomputing head. Remove duplicate functions.

* Merge branch 'unstable' into non-blocking-reconstruction

* Merge branch 'unstable' into non-blocking-reconstruction

# Conflicts:
#	beacon_node/beacon_chain/src/beacon_chain.rs
#	beacon_node/beacon_chain/src/data_availability_checker.rs
#	beacon_node/beacon_chain/src/data_availability_checker/overflow_lru_cache.rs
#	beacon_node/network/src/network_beacon_processor/sync_methods.rs

* Spawn a blocking task for reconstruction.

* Merge branch 'unstable' into non-blocking-reconstruction

# Conflicts:
#	beacon_node/network/src/network_beacon_processor/mod.rs

* Fix fmt

* Merge branch 'unstable' into non-blocking-reconstruction

# Conflicts:
#	beacon_node/beacon_chain/src/data_availability_checker/overflow_lru_cache.rs

* Fix race condition by making check and mutation atomic as suggested by Lion. Also added error handling to reconstruction failure.

* Add reconstruction reason metric and more debug logging to da checker.

* Add comment and logging.

* Rename `NotRequired` to `NotStarted`.

* Remove extra character added.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
das Data Availability Sampling ready-for-merge This PR is ready to merge.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants