Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[reconfigurator] Avoid deploying expunged datasets to sled agents #7308

Merged
merged 4 commits into from
Feb 7, 2025

Conversation

smklein
Copy link
Collaborator

@smklein smklein commented Jan 6, 2025

If a blueprint has marked a dataset as expunged, stop telling sled agents to deploy it.

Fixes #7304

@smklein
Copy link
Collaborator Author

smklein commented Jan 6, 2025

This does not resolve #6177 , and the issues raised by @davepacheco there are still open (namely, the ordering between {dataset,zone} {add,remove} which could cause issues during zone expungement). However, this solves the more scope-limited issue raised by #7304 : If a blueprint thinks a dataset is expunged, we stop telling the sled agents to instantiate it.

@smklein smklein requested a review from davepacheco January 6, 2025 19:06
@@ -969,7 +969,14 @@ impl From<BlueprintDatasetsConfig> for DatasetsConfig {
datasets: config
.datasets
.into_iter()
.map(|(id, d)| (id, d.into()))
.filter_map(|(id, d)| {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the actual fix - basically, converting from "Blueprint dataset" -> "Deployable dataset" format will filter out expunged datasets.

If this feels too magical, I can remove impl From here and instead create:

impl BlueprintDatasetsConfig {
  fn into_in_service_datasets(self) -> DatasetsConfig {
     ...
  }
}

(the implementation would be the same, but this could make it more explicit that we're filtering)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like your suggestion, seems like a good refactor :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 0bca90f

@davepacheco
Copy link
Collaborator

Given the stuff mentioned in #7313: are we comfortable landing this before having fixed #7309? I think it should be fine. But I don't think it improves things without also doing #6177. And I think it makes the system more complicated to reason about in the meantime.

@smklein
Copy link
Collaborator Author

smklein commented Jan 8, 2025

Given the stuff mentioned in #7313: are we comfortable landing this before having fixed #7309? I think it should be fine. But I don't think it improves things without also doing #6177.

I agree that this shouldn't change the net effect of the system, without subsequent PRs.

And I think it makes the system more complicated to reason about in the meantime.

Not sure I agree with this one. IMO this PR inches us closer to "more correct" behavior because it makes the reconfigurator do the right thing with the APIs it has. I think it'll have the same impact on the Sled Agent as "not requesting expunged datasets" because they aren't getting deleted regardless.

I think it would make sense to do the following:

At this point: The reconfigurator would be sending a set of info to the sled agents that's correct, but not being carried out correctly, because things aren't being deleted.

Then we could:

@davepacheco
Copy link
Collaborator

IMO this PR inches us closer to "more correct" behavior because it makes the reconfigurator do the right thing with the APIs it has. I think it'll have the same impact on the Sled Agent as "not requesting expunged datasets" because they aren't getting deleted regardless.

Agreed!

I think it would make sense to do the following:

I agree that sequence should work. My worry here stems mainly from various recent discussions (like the R12/#7265 one) where I feel like we struggled to accurately predict how the system would behave in this area, particularly across various releases. #7309 itself is an example of a non-obvious consequence of a bunch of other reasonable-seeming things. I guess another way to say that is that if we land this fix, then a bug (6177) becomes load-bearing for zone expungement to keep working, which feels like a weird place to be.

@smklein
Copy link
Collaborator Author

smklein commented Feb 7, 2025

So, given the extra context in #7500 , I was wrong about this not impacting the underlying system.

In cases where a dataset is expunged, but the disk remains, then yes, this has no effect.

However, in cases where a dataset is expunged from a disk which is also expunged, then it's critical to omit these datasets when handing them to a sled agent. Otherwise, the reconfigurator can ask the sled agent to deploy a dataset that should be expunged, on a disk that should be expunged. This appears to hang reconfigurator execution.

@jgallagher jgallagher merged commit 3ae42bf into main Feb 7, 2025
16 checks passed
@jgallagher jgallagher deleted the dont_deploy_expunged_datasets branch February 7, 2025 16:32
jgallagher added a commit that referenced this pull request Feb 10, 2025
This is followup from #7308 and
#7500 (comment):

* Remove `From<BlueprintZonesConfig> for OmicronZonesConfig`. This
included expunged zones, but thankfully had no callers.
* Change `BlueprintZonesConfig::to_omicron_zones_config(filter)` to
`BlueprintZonesConfig::into_running_omicron_zones_config()` (no `filter`
argument). All the callers of this method were passing
`BlueprintZoneFilter::ShouldBeRunning`, and I don't think there's a
reason to use any other filter?
* Remove `From<BlueprintPhysicalDisksConfig> for
OmicronPhysicalDisksConfig` (which included expunged disks), and replace
it with `BlueprintPhysicalDisksConfig::into_in_service_disks()`. This
one _did_ have callers, including the blueprint executor, but I think
we've avoided problems because the planner current [drops disks if the
incoming planning input says they're not in
service](https://github.com/oxidecomputer/omicron/blob/3ae42bf76cb9b55993705b817157e4f60935b6dd/nexus/reconfigurator/planning/src/blueprint_builder/builder.rs#L1090-L1097).
I'm not sure that planner behavior is right, and might change with
#7286, so it seemed safer to go ahead and fix this now.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

reconfigurator does not try to remove expunged datasets
4 participants