Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: collect and cache builtin instructions cost and count per transaction #2692

Closed

Conversation

tao-stones
Copy link

Problem

#2561

Summary of Changes

  • add a new bench case that test tx has 355 instructions that are all builtin instructions (including compute-budget ixs). This is worst-case as all instructions need to resolve its cost.
  • collect tx's builtin instruction counts and cost, remove compute_budget_ from instruction_details name to reflect that struct caches more than just compute-budget details
  • updated filter to cache resolve builtin instruction cost, to avoid repeated hashing and lookup from BUILTIN_INSTRUCTION_COSTS

Fixes #2561

rename compute_budget_instruction_details to instruction_details as it contains more than just compute-budget ix info;
@tao-stones
Copy link
Author

tao-stones commented Aug 22, 2024

The three commits are organized for bench incremental steps: original, simply add needed function, perf optimized:

Commit 1, bench before change
     Running benches/process_compute_budget_instructions.rs (target/release/deps/process_compute_budget_instructions-c23596bf6c26a34b)
bench_process_compute_budget_instructions_empty/0 instructions
                        time:   [6.9498 µs 6.9685 µs 6.9868 µs]
                        thrpt:  [146.56 Melem/s 146.95 Melem/s 147.34 Melem/s]
                 change:
                        time:   [+3.1493% +3.4355% +3.7214%] (p = 0.00 < 0.05)
                        thrpt:  [-3.5879% -3.3214% -3.0531%]
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild

bench_process_compute_budget_instructions_no_builtins/4 dummy Instructions
                        time:   [12.728 µs 12.744 µs 12.759 µs]
                        thrpt:  [80.259 Melem/s 80.354 Melem/s 80.450 Melem/s]
                 change:
                        time:   [+1.8458% +2.0980% +2.3406%] (p = 0.00 < 0.05)
                        thrpt:  [-2.2870% -2.0549% -1.8124%]
                        Performance has regressed.
Found 18 outliers among 100 measurements (18.00%)
  9 (9.00%) low mild
  7 (7.00%) high mild
  2 (2.00%) high severe

bench_process_compute_budget_instructions_compute_budgets/4 compute-budget instructions
                        time:   [26.258 µs 26.293 µs 26.332 µs]
                        thrpt:  [38.888 Melem/s 38.945 Melem/s 38.998 Melem/s]
                 change:
                        time:   [+0.8114% +1.0765% +1.3665%] (p = 0.00 < 0.05)
                        thrpt:  [-1.3481% -1.0651% -0.8049%]
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

bench_process_compute_budget_instructions_builtins/4 dummy builtins
                        time:   [16.261 µs 16.285 µs 16.310 µs]
                        thrpt:  [62.782 Melem/s 62.882 Melem/s 62.973 Melem/s]
                 change:
                        time:   [-0.3143% -0.1345% +0.0376%] (p = 0.13 > 0.05)
                        thrpt:  [-0.0376% +0.1347% +0.3153%]
                        No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe

bench_process_compute_budget_instructions_mixed/355 mixed instructions
                        time:   [496.86 µs 497.54 µs 498.27 µs]
                        thrpt:  [2.0551 Melem/s 2.0581 Melem/s 2.0609 Melem/s]
                 change:
                        time:   [-0.1292% +0.0754% +0.2721%] (p = 0.47 > 0.05)
                        thrpt:  [-0.2714% -0.0753% +0.1293%]
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

bench_process_compute_budget_and_transfer_only/355 transfer instructions and compute budget ixs
                        time:   [490.41 µs 490.74 µs 491.12 µs]
                        thrpt:  [2.0850 Melem/s 2.0867 Melem/s 2.0880 Melem/s]
                 change:
                        time:   [+0.2029% +0.3755% +0.5466%] (p = 0.00 < 0.05)
                        thrpt:  [-0.5436% -0.3741% -0.2025%]
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
commit 2: add function to collect builtin cost; It adds hashing for every program_id; results: 0-ix regress a bit due to added math to calc non-cb-ix-count, but hashing made 4-ix benches worse, and much worse for many-ix benches
     Running benches/process_compute_budget_instructions.rs (target/release/deps/process_compute_budget_instructions-9affcf53954823c1)
bench_process_compute_budget_instructions_empty/0 instructions
                        time:   [7.8715 µs 7.8805 µs 7.8878 µs]
                        thrpt:  [129.82 Melem/s 129.94 Melem/s 130.09 Melem/s]
                 change:
                        time:   [+17.913% +18.243% +18.527%] (p = 0.00 < 0.05)
                        thrpt:  [-15.631% -15.428% -15.192%]
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

bench_process_compute_budget_instructions_no_builtins/4 dummy Instructions
                        time:   [24.825 µs 24.861 µs 24.901 µs]
                        thrpt:  [41.122 Melem/s 41.189 Melem/s 41.249 Melem/s]
                 change:
                        time:   [+100.37% +100.68% +101.03%] (p = 0.00 < 0.05)
                        thrpt:  [-50.257% -50.170% -50.092%]
                        Performance has regressed.
Found 20 outliers among 100 measurements (20.00%)
  2 (2.00%) low severe
  1 (1.00%) low mild
  3 (3.00%) high mild
  14 (14.00%) high severe

bench_process_compute_budget_instructions_compute_budgets/4 compute-budget instructions
                        time:   [41.173 µs 41.188 µs 41.205 µs]
                        thrpt:  [24.851 Melem/s 24.862 Melem/s 24.871 Melem/s]
                 change:
                        time:   [+59.869% +60.266% +60.630%] (p = 0.00 < 0.05)
                        thrpt:  [-37.745% -37.604% -37.449%]
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  5 (5.00%) high mild
  6 (6.00%) high severe

bench_process_compute_budget_instructions_builtins/4 dummy builtins
                        time:   [34.823 µs 34.838 µs 34.857 µs]
                        thrpt:  [29.377 Melem/s 29.393 Melem/s 29.406 Melem/s]
                 change:
                        time:   [+112.60% +113.18% +113.73%] (p = 0.00 < 0.05)
                        thrpt:  [-53.211% -53.091% -52.964%]
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) high mild
  6 (6.00%) high severe

Benchmarking bench_process_compute_budget_instructions_mixed/355 mixed instructions: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.1s, enable flat sampling, or reduce sample count to 50.
bench_process_compute_budget_instructions_mixed/355 mixed instructions
                        time:   [1.8023 ms 1.8044 ms 1.8067 ms]
                        thrpt:  [566.78 Kelem/s 567.51 Kelem/s 568.15 Kelem/s]
                 change:
                        time:   [+258.39% +262.14% +264.37%] (p = 0.00 < 0.05)
                        thrpt:  [-72.555% -72.386% -72.097%]
                        Performance has regressed.
Found 15 outliers among 100 measurements (15.00%)
  5 (5.00%) high mild
  10 (10.00%) high severe

Benchmarking bench_process_compute_budget_and_transfer_only/355 transfer instructions and compute budget ixs: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.2s, enable flat sampling, or reduce sample count to 50.
bench_process_compute_budget_and_transfer_only/355 transfer instructions and compute budget ixs
                        time:   [1.8198 ms 1.8211 ms 1.8227 ms]
                        thrpt:  [561.81 Kelem/s 562.31 Kelem/s 562.69 Kelem/s]
                 change:
                        time:   [+272.41% +272.99% +273.55%] (p = 0.00 < 0.05)
                        thrpt:  [-73.230% -73.190% -73.148%]
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
Commit 3: updated filter to cached resolved builtin ix cost. It adds additional cost of allocating larger array per tx, but removed all repeated hashing; results: 0-ix bench regressed, 4-ix bench has small changes, many-ix benches significantly improved
     Running benches/process_compute_budget_instructions.rs (target/release/deps/process_compute_budget_instructions-b53a7f97abe58117)
bench_process_compute_budget_instructions_empty/0 instructions
                        time:   [16.950 µs 16.990 µs 17.026 µs]
                        thrpt:  [60.144 Melem/s 60.271 Melem/s 60.412 Melem/s]
                 change:
                        time:   [+114.95% +115.54% +116.15%] (p = 0.00 < 0.05)
                        thrpt:  [-53.735% -53.606% -53.478%]
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

bench_process_compute_budget_instructions_no_builtins/4 dummy Instructions
                        time:   [22.542 µs 22.604 µs 22.656 µs]
                        thrpt:  [45.198 Melem/s 45.302 Melem/s 45.427 Melem/s]
                 change:
                        time:   [-9.8919% -9.6990% -9.4811%] (p = 0.00 < 0.05)
                        thrpt:  [+10.474% +10.741% +10.978%]
                        Performance has improved.

bench_process_compute_budget_instructions_compute_budgets/4 compute-budget instructions
                        time:   [48.071 µs 48.132 µs 48.174 µs]
                        thrpt:  [21.256 Melem/s 21.275 Melem/s 21.302 Melem/s]
                 change:
                        time:   [+16.602% +16.779% +16.923%] (p = 0.00 < 0.05)
                        thrpt:  [-14.474% -14.368% -14.238%]
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) low severe
  1 (1.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe

bench_process_compute_budget_instructions_builtins/4 dummy builtins
                        time:   [48.941 µs 48.973 µs 49.008 µs]
                        thrpt:  [20.894 Melem/s 20.909 Melem/s 20.923 Melem/s]
                 change:
                        time:   [+40.020% +40.242% +40.451%] (p = 0.00 < 0.05)
                        thrpt:  [-28.801% -28.695% -28.582%]
                        Performance has regressed.
Found 15 outliers among 100 measurements (15.00%)
  3 (3.00%) low severe
  2 (2.00%) low mild
  8 (8.00%) high mild
  2 (2.00%) high severe

bench_process_compute_budget_instructions_mixed/355 mixed instructions
                        time:   [556.23 µs 557.12 µs 558.17 µs]
                        thrpt:  [1.8346 Melem/s 1.8380 Melem/s 1.8410 Melem/s]
                 change:
                        time:   [-69.185% -69.137% -69.087%] (p = 0.00 < 0.05)
                        thrpt:  [+223.49% +224.01% +224.52%]
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

bench_process_compute_budget_and_transfer_only/355 transfer instructions and compute budget ixs
                        time:   [639.73 µs 641.01 µs 642.66 µs]
                        thrpt:  [1.5934 Melem/s 1.5975 Melem/s 1.6007 Melem/s]
                 change:
                        time:   [-64.939% -64.876% -64.810%] (p = 0.00 < 0.05)
                        thrpt:  [+184.17% +184.71% +185.22%]
                        Performance has improved.

@tao-stones tao-stones requested review from apfitzge and jstarry August 22, 2024 15:48
Copy link

@apfitzge apfitzge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good to me, small preference on the nested-enum choice

Comment on lines 11 to 13
// None - un-checked
// Some<None> - checked, not builtin
// Some<Some<(bool, u32)>> - checked, is builtin and (is-compute-budget, default-cost)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me this would better represented by a new enum:

    #[derive(Default)]
    enum BuiltinCheckStatus {
		#[default]
        Unchecked,
        NotBuiltin,
        Builtin{
            is_compute_budget: bool,
            default_cost: u32,
        }
    }

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

think this saves a byte as well since we don't need 2 option discriminants

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enum is a way to go

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

d76291d

savings cross all benches, due to smaller memory footprint
     Running benches/process_compute_budget_instructions.rs (target/release/deps/process_compute_budget_instructions-b53a7f97abe58117)
bench_process_compute_budget_instructions_empty/0 instructions
                        time:   [11.957 µs 11.975 µs 11.995 µs]
                        thrpt:  [85.368 Melem/s 85.513 Melem/s 85.641 Melem/s]
                 change:
                        time:   [-29.911% -29.703% -29.488%] (p = 0.00 < 0.05)
                        thrpt:  [+41.820% +42.253% +42.676%]
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe

bench_process_compute_budget_instructions_no_builtins/4 dummy Instructions
                        time:   [18.924 µs 18.941 µs 18.959 µs]
                        thrpt:  [54.011 Melem/s 54.062 Melem/s 54.112 Melem/s]
                 change:
                        time:   [-15.416% -15.210% -15.005%] (p = 0.00 < 0.05)
                        thrpt:  [+17.654% +17.938% +18.225%]
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  4 (4.00%) high severe

bench_process_compute_budget_instructions_compute_budgets/4 compute-budget instructions
                        time:   [37.252 µs 37.332 µs 37.424 µs]
                        thrpt:  [27.362 Melem/s 27.430 Melem/s 27.488 Melem/s]
                 change:
                        time:   [-21.462% -21.177% -20.898%] (p = 0.00 < 0.05)
                        thrpt:  [+26.419% +26.867% +27.328%]
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

bench_process_compute_budget_instructions_builtins/4 dummy builtins
                        time:   [41.314 µs 41.521 µs 41.730 µs]
                        thrpt:  [24.539 Melem/s 24.662 Melem/s 24.786 Melem/s]
                 change:
                        time:   [-8.4225% -8.0835% -7.7755%] (p = 0.00 < 0.05)
                        thrpt:  [+8.4311% +8.7944% +9.1972%]
                        Performance has improved.
Found 18 outliers among 100 measurements (18.00%)
  18 (18.00%) high mild

bench_process_compute_budget_instructions_mixed/355 mixed instructions
                        time:   [539.98 µs 540.43 µs 540.90 µs]
                        thrpt:  [1.8931 Melem/s 1.8948 Melem/s 1.8964 Melem/s]
                 change:
                        time:   [-2.9610% -2.8023% -2.6445%] (p = 0.00 < 0.05)
                        thrpt:  [+2.7164% +2.8831% +3.0513%]
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

bench_process_compute_budget_and_transfer_only/355 transfer instructions and compute budget ixs
                        time:   [607.07 µs 607.59 µs 608.19 µs]
                        thrpt:  [1.6837 Melem/s 1.6853 Melem/s 1.6868 Melem/s]
                 change:
                        time:   [-4.9817% -4.8051% -4.6671%] (p = 0.00 < 0.05)
                        thrpt:  [+4.8956% +5.0477% +5.2429%]
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe

@tao-stones tao-stones requested a review from apfitzge August 22, 2024 18:49
@jstarry
Copy link

jstarry commented Aug 23, 2024

Implementation with the aux cache looks much better than what you had before! But shouldn't this code be behind a feature gate and shouldn't we at least have a SIMD written up describing the intended feature gated change in behavior?

@apfitzge
Copy link

Implementation with the aux cache looks much better than what you had before! But shouldn't this code be behind a feature gate and shouldn't we at least have a SIMD written up describing the intended feature gated change in behavior?

This doesn't change the behavior yet though, right? It's caching this data because we plan to use it, but the compute-budget-details we get from the sanitize_compute... function is unchanged by this (afaict).

@tao-stones
Copy link
Author

Implementation with the aux cache looks much better than what you had before! But shouldn't this code be behind a feature gate and shouldn't we at least have a SIMD written up describing the intended feature gated change in behavior?

This doesn't change the behavior yet though, right? It's caching this data because we plan to use it, but the compute-budget-details we get from the sanitize_compute... function is unchanged by this (afaict).

Yes. No change in this PR, it just adds "collect and cache" function. The follow-up PR is going to use cached builtin cost, which will change the behavior, It's feature gate: #2562

I didn't create a SIMD because this feature gate isn't to change protocol, but to fix a bug; the bug being "compute budget allocates 200K per builtin, yet only consume its default cost; except for compute-budget instructions, that it does not allocate units but still consume its default cost".

@jstarry
Copy link

jstarry commented Aug 24, 2024

This doesn't change the behavior yet though, right? It's caching this data because we plan to use it, but the compute-budget-details we get from the sanitize_compute... function is unchanged by this (afaict).

There's a non-zero perf hit, which I see as a behavior change. There's no reason to cache this data when the feature isn't enabled right? But if it's too difficult to put this new caching behind a feature gate, maybe it's fine to keep it as is.

I didn't create a SIMD because this feature gate isn't to change protocol, but to fix a bug; the bug being "compute budget allocates 200K per builtin, yet only consume its default cost; except for compute-budget instructions, that it does not allocate units but still consume its default cost".

This is a protocol change to fix a bug. If firedancer isn't aware of this protocol change they could process transactions differently. Imagine a transaction doesn't set a compute limit but relies on the fact that adding a few builtin instructions to their transaction will increase their tx compute limit which is used fully by an invocation to a custom program. The transaction would succeed before the feature gate and would fail after the feature gate leading to a divergence if all clients aren't in sync for implementation. Given that this feature needs coordination between client teams and that it could break downstream users, I think we should have a SIMD to discuss.

@tao-stones
Copy link
Author

Synced with FD previously, FD planned to rebase cost model implementation after this fix is in (they are currently using agave runtime, and its cost model implementation, but do not handle adjust-up, so over packing in some cases).

Just chatted with Philip, considering currently schedule, it seems it makes better sense to do all that after breakpoint. In this case, a SIMD would be very helpful to document the change, and perhaps discuss other possible solutions. I'll open one then link to feature gate issue #2562.

As for this PR, wdyt to merge it if no other open issues itself?

@jstarry
Copy link

jstarry commented Aug 27, 2024

I really don't think it makes sense to merge yet, what's the rush?

@tao-stones
Copy link
Author

I really don't think it makes sense to merge yet, what's the rush?

I have few PRs after this, but I can reorg my pipeline. Let's keep this open while SIMD solana-foundation/solana-improvement-documents#170 being discussed.

@apfitzge
Copy link

apfitzge commented Sep 9, 2024

I think rather than waiting on the SIMD process, which could take a while, how about we split this up?

Keep the old version such that we can get the compute budget without the builtin-check overhead, but also this new version which has the builtins checked.

Ideally we'd have a cost-model fn that we could pass this new struct into (with feature_set) so we can get the cost without doing separate scans for compute-budget AND builtins - having that would also help the transition to new tx type and runtime-transaction.
Once runtime transaction is used we can remove the old version since its' not accessed, and this more detailed meta info will be cached so we only calculate once.

@jstarry @tao-stones does that seem reasonable to you?

@tao-stones
Copy link
Author

#3799 did it

@tao-stones tao-stones closed this Dec 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Collect builtin instructions cost details
3 participants