[reconfigurator] Executor: only do cleanup for zones that are `ready_for_cleanup` #7713

jgallagher · 2025-03-03T19:59:23Z

Changes the three steps that required the earlier PUT /omicron-zones execution step to succeed because they needed to clean up after zones that were definitely shut down to now ready the ready_for_cleanup blueprint disposition (which is set by the planner after it confirms via inventory that the zones are dead).

While I was here, I removed both the special case error-to-warning implementation in a couple of steps and the finalize step that compiled those. I believe all of that predates the ability for omdb to interpret the update-engine event report. Now that it can, emitting step warnings directly is fine. I spun this branch up in a4x2 and then shut down one of the sled-agents; that results in this report the executor:

  last completed activation: iter 47, triggered by a periodic timer firing
    started at 2025-03-03T19:49:51.027Z (23s ago) and ran for 11711ms
    target blueprint: 3680509e-77c3-4d98-a8a4-276de89b2c53                                                                                                                                                                                                       
    execution:        enabled                                                                                                                                                                                                                                    
    status:           completed (15 steps)                                                                                                                                                                                                                       
    warning:          at: Plumb service firewall rules: failed to plumb service firewall rules to sleds: Internal Error: Communication Error: error sending request for url (http://[fd00:1122:3344:102::1]:12345/vpc/001de000-074c-4000-8000-000000000000/firewall/rules)
    warning:          at: Deploy Omicron zones: Failed to put OmicronZonesConfig {                                                                                                                                             

... snip the OmicronZonesConfig Debug output ...

} to sled e2389240-03a7-4a96-a04a-aa4ee3a38381: Communication Error: error sending request for url (http://[fd00:1122:3344:102::1]:12345/omicron-zones): error sending request for url (http://[fd00:1122:3344:102::1]:12345/omicron-zones): client error (Connect): tcp connect error: Connection refused (os error 146): Connection refused (os error 146)
    warning:          at: Deploy datasets: Failed to put DatasetsConfig {                                                                                                                                              

... snip the DatasetConfig Debug output ...

} to sled e2389240-03a7-4a96-a04a-aa4ee3a38381: Communication Error: error sending request for url (http://[fd00:1122:3344:102::1]:12345/datasets): error sending request for url (http://[fd00:1122:3344:102::1]:12345/datasets): client error (Connect): tcp connect error: Connection refused (os error 146): Connection refused (os error 146)
    warning:          at: Deploy physical disks: Failed to put BlueprintPhysicalDisksConfig {                                                                                                                                              

... snip the BlueprintPhysicalDisksConfig Debug output ...

} to sled e2389240-03a7-4a96-a04a-aa4ee3a38381: Communication Error: error sending request for url (http://[fd00:1122:3344:102::1]:12345/omicron-physical-disks): error sending request for url (http://[fd00:1122:3344:102::1]:12345/omicron-physical-disks): client error (Connect): tcp connect error: Connection refused (os error 146): Connection refused (os error 146)
    error:            (none)

Closes #6999, under the assumption that as of this PR we've identified all the inter-step dependencies and broken them.

nexus/reconfigurator/execution/src/lib.rs

smklein · 2025-03-03T20:24:47Z

nexus/reconfigurator/execution/src/omicron_zones.rs

-            ) {
+            // We expect to only be called with expunged zones that are ready
+            // for cleanup; skip any with a different disposition.
+            if !config.disposition.is_ready_for_cleanup() {


Would this conditional be a bug if it was invoked? Should we at least be logging something here?

Hmm, I could arguments for any of these options:

Document this function as "panics if called with a zone that isn't ready for cleanup" and assert.

Document this function as "returns an error for any zone that isn't ready for cleanup".

Keep this check and log a warning.

Keep this check and don't log anything.

I prefer option 4, I think? "Call this method with whatever set of zones you want, and I locally choose to act on the ones I know it's safe to act on". Honestly I'd kind of prefer this take a &Blueprint instead of an iterator of zones to avoid this problem altogether, but that makes the tests more awkward. Maybe this should be split into a pub(crate) version that takes a &Blueprint and a private version that takes this iterator; the &Blueprint one could do the filtering, and tests could call the private one?

I'm going to take a stab at cleaning this up in a small followup PR.

andrewjstone

Great cleanup. Love to see the progress here!

davepacheco

Nice!

@smklein

Builds on #7713, and is followup from #7713 (comment). In #7652 I changed all the executor substeps to take iterators instead of `&BTreeMap` references that no longer existed, but that introduced a weird split where the top-level caller had to filter the blueprint down to just the items that the inner functions expected. @smklein pointed out one place where the inner code was being extra defensive, which was just more confusing than anything. This PR removes that split: the top-level executor now always passes a full `&Blueprint` down, and the inner modules are responsible for doing their own filtering as appropriate. To easy testing, I kept the versions that take an iterator of already-filtered items as private `*_impl` functions that the new functions-that-take-a-full-`Blueprint` themselves call too.

jgallagher added 9 commits March 3, 2025 14:51

clean_up_expunged_zones: act on zones ready for cleanup

828fff3

support_bundle_fail_expunged: act on zones ready for cleanup

bd4a86d

reassign_sagas_from_expunged: act on zones ready for cleanup

e33103a

make more step failures not stop execution

a347074

result_to_step_result -> map_err_to_step_warning

1f6020d

more consistent error reporting: cockroachdb settings step

fc7581c

remove finalize step (most steps can emit warnings now)

5b5bd7f

cargo fmt

eab96e1

comments and dead code removal

b6253ee

jgallagher requested review from andrewjstone and smklein March 3, 2025 19:59

smklein approved these changes Mar 3, 2025

View reviewed changes

Merge branch 'main' into john/execution-ready-for-cleanup

c465f26

andrewjstone approved these changes Mar 3, 2025

View reviewed changes

jgallagher added 3 commits March 4, 2025 10:10

Merge branch 'main' into john/execution-ready-for-cleanup

365f144

typo

be02ab4

fix test (put zones failure doesn't fail entire execution anymore)

7924b9b

jgallagher mentioned this pull request Mar 4, 2025

[reconfigurator] Executor internal cleanup around iterators #7722

Merged

davepacheco reviewed Mar 4, 2025

View reviewed changes

jgallagher changed the title ~~[reconfigurator] Executor: only do cleanup zones zones that are ready_for_cleanup~~ [reconfigurator] Executor: only do cleanup for zones that are ready_for_cleanup Mar 4, 2025

jgallagher merged commit 2a2c407 into main Mar 4, 2025
16 checks passed

jgallagher deleted the john/execution-ready-for-cleanup branch March 4, 2025 18:12

jgallagher mentioned this pull request Mar 7, 2025

blueprint-execution: Break dependency between PUT /omicron-zones success and subsequent cleanup steps #7527

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[reconfigurator] Executor: only do cleanup for zones that are `ready_for_cleanup` #7713

[reconfigurator] Executor: only do cleanup for zones that are `ready_for_cleanup` #7713

jgallagher commented Mar 3, 2025

smklein Mar 3, 2025

jgallagher Mar 4, 2025

jgallagher Mar 4, 2025

andrewjstone left a comment

davepacheco left a comment

[reconfigurator] Executor: only do cleanup for zones that are ready_for_cleanup #7713

[reconfigurator] Executor: only do cleanup for zones that are ready_for_cleanup #7713

Conversation

jgallagher commented Mar 3, 2025

smklein Mar 3, 2025

Choose a reason for hiding this comment

jgallagher Mar 4, 2025

Choose a reason for hiding this comment

jgallagher Mar 4, 2025

Choose a reason for hiding this comment

andrewjstone left a comment

Choose a reason for hiding this comment

davepacheco left a comment

Choose a reason for hiding this comment

[reconfigurator] Executor: only do cleanup for zones that are `ready_for_cleanup` #7713

[reconfigurator] Executor: only do cleanup for zones that are `ready_for_cleanup` #7713