Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support tx poh recording in unified scheduler #4150

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

ryoqun
Copy link
Member

@ryoqun ryoqun commented Dec 17, 2024

Problem

Currently, unified scheduler cant record transactions into poh recorder because its subsystem called TaskHander and its underlying code-path (solana-ledger => solana-runtime) doesn't support such a committing operation.

Summary of Changes

Support it with a relatively less intrusive way by introducing a callback mechanism along the code-path.

extracted from #3946, see the pr for the general overview.

@@ -140,7 +140,7 @@ pub struct RecordTransactionsSummary {
pub starting_transaction_index: Option<usize>,
}

#[derive(Clone)]
#[derive(Clone, Debug)]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need Debug on all of this? I can't imagine anyone has ever used it for the massive scheduling and bank structures. I'd sooner remove it from all of that than let it spread further, wdyt?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, i think Debug is handy for quick debugging and actually required by assert_matches!()...

Comment on lines 141 to 142
/// Commit failed internally.
CommitFailed,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we now able to fail on commit while it was not previously possible?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we are failing the pre-commit check (recording), not commit itself.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's the case, why can we not eat the error in the same way we do in normal block-production, without adding a new variant here?

Copy link
Member Author

@ryoqun ryoqun Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we now able to fail on commit while it was not previously possible?
...
It seems like we are failing the pre-commit check (recording), not commit itself.

hope this rename helps to reduce this confusion: 23159ff

why can we not eat the error in the same way we do in normal block-production, without adding a new variant here?

normal block-production code path can handle the error condition while not confined to TransactionError. it is directly deciding what to commit by looking RecordTransactionsSummary at core/src/banking_stage/consumer.rs.

However, I opted not to use the block production code path for SchedulingMode::BlockProduction because it has some unwanted functionalities for unified scheduler. Also, I didn't want to introduce yet another code-path with copied code for this transaction execution because these code-pathes are quite important to be maintained well.

While it's not ideal to add a variant here, there's some precedent for internal-use variants: ResanitizationNeeded and ProgramCacheHitMaxLimit. And they are added for ease of introduction rather than correctly adjusting all the types around code base.

Comment on lines 449 to 451
BlockProduction => {
let mut vec = vec![];
if handler_context.transaction_status_sender.is_some() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this part, could you explain what the thought process here is?

What does block-production have to do with transaction_status_sender; in nearly all cases it should be None for block-producers since it is only used for RPCs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given below code, I'm thinking it may have been a typo and expected to be transaction_recorder?
Even so, it seems better to guarantee the recorder is present if we have block-production mode selected?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay; I see now it's RPC-only stuff for the status batches.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for pointing out unclear code...: a59e39a

) -> Result<()> {
let TransactionBatchWithIndexes {
batch,
transaction_indexes,
} = batch;
let record_token_balances = transaction_status_sender.is_some();
let mut transaction_indexes = transaction_indexes.to_vec();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: adding a clone here here

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if we can remove this clone and simplify the callback interface.

What if the callback itself did an allocation (only if necessary) returning the transaction index in that case.

I also really don't like the Option<Option<usize>> that is there now because it hides the meaning. It's not clear from just this code that the outer option means that recording/pre-commit failed.

If I were to do this, I'd take one of these two approaches:

  1. Make the outer option of return value a Result<...,()>
  2. Simple enum type to more clearly represent meanings

Lean towards 1 since it's simpler, and not sure an additional enum would benefit too much here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: adding a clone here here

Wondering if we can remove this clone and simplify the callback interface.

What if the callback itself did an allocation (only if necessary) returning the transaction index in that case.

nice catch.. done: 08536f0
I think Cow is enough. I'd like to remain the closure agnostic from allocation at all for separation of concern.

I also really don't like the Option<Option<usize>> that is there now because it hides the meaning. It's not clear from just this code that the outer option means that recording/pre-commit failed.

If I were to do this, I'd take one of these two approaches:

1. Make the outer option of return value a Result<...,()>

2. Simple enum type to more clearly represent meanings

Lean towards 1 since it's simpler, and not sure an additional enum would benefit too much here.

this is done: 3b852f6

recording_config: ExecutionRecordingConfig,
timings: &mut ExecuteTimings,
log_messages_bytes_limit: Option<usize>,
pre_commit_callback: Option<impl FnOnce() -> bool>,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it simpler to just not make this optional?
just have impl FnOnce() -> bool. Likely compiler will optimize the simple case of || true out completely, and let's us simplify the code by not having conditional calls.

Copy link
Member Author

@ryoqun ryoqun Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hehe, i made them Option-ed by purpose. wrote about it a bit: 36f8537

indeed, || true will completely optimized out. but I prefer to retain Option here, considering this is very security sensitive code for extra safety to ensure block-verification code-path isn't affected.

timings: &mut ExecuteTimings,
log_messages_bytes_limit: Option<usize>,
pre_commit_callback: Option<impl FnOnce() -> bool>,
) -> Option<(Vec<TransactionCommitResult>, TransactionBalancesSet)> {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also seems like it should be a Result instead of an Option, because this should succeed and fails due to something going wrong.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is done: 3b852f6

@ryoqun ryoqun force-pushed the unified-scheduler-poh branch from bb442fa to 77136d3 Compare December 18, 2024 15:54
@ryoqun ryoqun requested a review from apfitzge December 18, 2024 15:58
@ryoqun ryoqun force-pushed the unified-scheduler-poh branch 2 times, most recently from b4a71e2 to f715f8c Compare December 19, 2024 04:36
@ryoqun ryoqun force-pushed the unified-scheduler-poh branch from f715f8c to cd60f32 Compare December 19, 2024 04:40
@ryoqun ryoqun force-pushed the unified-scheduler-poh branch 2 times, most recently from 75faba8 to f290620 Compare December 19, 2024 06:00
@ryoqun ryoqun force-pushed the unified-scheduler-poh branch from f290620 to 447bdb6 Compare December 19, 2024 06:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants