Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[reconfigurator] Executor: only do cleanup for zones that are ready_for_cleanup #7713

Merged
merged 13 commits into from
Mar 4, 2025

Conversation

jgallagher
Copy link
Contributor

Changes the three steps that required the earlier PUT /omicron-zones execution step to succeed because they needed to clean up after zones that were definitely shut down to now ready the ready_for_cleanup blueprint disposition (which is set by the planner after it confirms via inventory that the zones are dead).

While I was here, I removed both the special case error-to-warning implementation in a couple of steps and the finalize step that compiled those. I believe all of that predates the ability for omdb to interpret the update-engine event report. Now that it can, emitting step warnings directly is fine. I spun this branch up in a4x2 and then shut down one of the sled-agents; that results in this report the executor:

  last completed activation: iter 47, triggered by a periodic timer firing
    started at 2025-03-03T19:49:51.027Z (23s ago) and ran for 11711ms
    target blueprint: 3680509e-77c3-4d98-a8a4-276de89b2c53                                                                                                                                                                                                       
    execution:        enabled                                                                                                                                                                                                                                    
    status:           completed (15 steps)                                                                                                                                                                                                                       
    warning:          at: Plumb service firewall rules: failed to plumb service firewall rules to sleds: Internal Error: Communication Error: error sending request for url (http://[fd00:1122:3344:102::1]:12345/vpc/001de000-074c-4000-8000-000000000000/firewall/rules)
    warning:          at: Deploy Omicron zones: Failed to put OmicronZonesConfig {                                                                                                                                             

... snip the OmicronZonesConfig Debug output ...

} to sled e2389240-03a7-4a96-a04a-aa4ee3a38381: Communication Error: error sending request for url (http://[fd00:1122:3344:102::1]:12345/omicron-zones): error sending request for url (http://[fd00:1122:3344:102::1]:12345/omicron-zones): client error (Connect): tcp connect error: Connection refused (os error 146): Connection refused (os error 146)
    warning:          at: Deploy datasets: Failed to put DatasetsConfig {                                                                                                                                              

... snip the DatasetConfig Debug output ...

} to sled e2389240-03a7-4a96-a04a-aa4ee3a38381: Communication Error: error sending request for url (http://[fd00:1122:3344:102::1]:12345/datasets): error sending request for url (http://[fd00:1122:3344:102::1]:12345/datasets): client error (Connect): tcp connect error: Connection refused (os error 146): Connection refused (os error 146)
    warning:          at: Deploy physical disks: Failed to put BlueprintPhysicalDisksConfig {                                                                                                                                              

... snip the BlueprintPhysicalDisksConfig Debug output ...

} to sled e2389240-03a7-4a96-a04a-aa4ee3a38381: Communication Error: error sending request for url (http://[fd00:1122:3344:102::1]:12345/omicron-physical-disks): error sending request for url (http://[fd00:1122:3344:102::1]:12345/omicron-physical-disks): client error (Connect): tcp connect error: Connection refused (os error 146): Connection refused (os error 146)
    error:            (none)                                                                                                                                                                                                                                     

Closes #6999, under the assumption that as of this PR we've identified all the inter-step dependencies and broken them.

) {
// We expect to only be called with expunged zones that are ready
// for cleanup; skip any with a different disposition.
if !config.disposition.is_ready_for_cleanup() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this conditional be a bug if it was invoked? Should we at least be logging something here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I could arguments for any of these options:

  • Document this function as "panics if called with a zone that isn't ready for cleanup" and assert.
  • Document this function as "returns an error for any zone that isn't ready for cleanup".
  • Keep this check and log a warning.
  • Keep this check and don't log anything.

I prefer option 4, I think? "Call this method with whatever set of zones you want, and I locally choose to act on the ones I know it's safe to act on". Honestly I'd kind of prefer this take a &Blueprint instead of an iterator of zones to avoid this problem altogether, but that makes the tests more awkward. Maybe this should be split into a pub(crate) version that takes a &Blueprint and a private version that takes this iterator; the &Blueprint one could do the filtering, and tests could call the private one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to take a stab at cleaning this up in a small followup PR.

Copy link
Contributor

@andrewjstone andrewjstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great cleanup. Love to see the progress here!

Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@jgallagher jgallagher changed the title [reconfigurator] Executor: only do cleanup zones zones that are ready_for_cleanup [reconfigurator] Executor: only do cleanup for zones that are ready_for_cleanup Mar 4, 2025
@jgallagher jgallagher merged commit 2a2c407 into main Mar 4, 2025
16 checks passed
@jgallagher jgallagher deleted the john/execution-ready-for-cleanup branch March 4, 2025 18:12
jgallagher added a commit that referenced this pull request Mar 4, 2025
Builds on #7713, and is followup from
#7713 (comment).
In #7652 I changed all the executor substeps to take iterators instead
of `&BTreeMap` references that no longer existed, but that introduced a
weird split where the top-level caller had to filter the blueprint down
to just the items that the inner functions expected. @smklein pointed
out one place where the inner code was being extra defensive, which was
just more confusing than anything.

This PR removes that split: the top-level executor now always passes a
full `&Blueprint` down, and the inner modules are responsible for doing
their own filtering as appropriate. To easy testing, I kept the versions
that take an iterator of already-filtered items as private `*_impl`
functions that the new functions-that-take-a-full-`Blueprint` themselves
call too.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

many Reconfigurator execution steps are fatal that shouldn't be
4 participants