Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nexus overprovisions disks with control plane zones on them #7225

Open
davepacheco opened this issue Dec 10, 2024 · 4 comments
Open

Nexus overprovisions disks with control plane zones on them #7225

davepacheco opened this issue Dec 10, 2024 · 4 comments
Milestone

Comments

@davepacheco
Copy link
Collaborator

This appears to be the root cause of #7221: Nexus provisioned so many Crucible regions on one disk that the CockroachDB zone whose root filesystem and database filesystem were both on the same pool ran out of space.

There are many ways to deal with this, with different tradeoffs. I'll post a few notes from the earlier control plane discussion.

@davepacheco davepacheco added this to the 13 milestone Dec 10, 2024
@davepacheco
Copy link
Collaborator Author

Tagging this for R13 because I don't think it's a blocker (not a regression, right?) but it's important to keep on our radar. (Maybe it's better to not have a milestone and discuss at the product roundtable? CC @askfongjojo @morlandi7.)

@davepacheco
Copy link
Collaborator Author

davepacheco commented Dec 10, 2024

Some ideas discussed so far:

  1. The Crucible region allocation CTE could consult the current target blueprint to see what control plane zones are on the same disk and account for their disk space in making sure not to overprovision a disk. I'm not familiar with this area but it sounds like this is complicated and potentially tightly couples this code to the blueprint structure.
  2. A simplification proposed is we simply reserve a fixed amount of space for the control plane on each disk. I expect this would be on the order of 100 GiB (~3-4% of our current disk size) but I think we'd want to do a little more analysis to figure out the right number. This would probably have to be somewhat deployment-specific -- this seems like it wouldn't work in a4x2 or other non-production hardware.
  3. An in-between approach is that Reconfigurator could manage a dynamic tunable, either system-wide or on a per-disk basis, saying how much of the space to reserve for the control plane. This would give us a little more runtime flexibiity to claw back some space for the control plane. (Personally I don't think this is worth the complexity right now. Even if we wind up reserving 5-10% of each disk for the control plane, the worst case there is an efficiency problem and when that becomes one of our bigger problems then we could invest in making this bound tighter.)
  4. Bias the selection of regions towards disks with more space. There are clearly still availability considerations (i.e., don't put them all on the same disk) and I'm sure there are other considerations as well. But if we always chose disks with more space available, then we wouldn't this problem until the whole system was almost completely full.

I haven't spent much time in this area of code and I apologize if I've got stuff wrong!

Options 1-3 are all pretty related and they all assume:

  • We want the region allocation CTE to fail with some kind of database constraint violation if anything would cause it to try to overprovision the disk.
  • That we can assign a number to the amount of space the control plane can use on each disk.

So far my vote would be to go with (2), using a hardcoded limit on the disk space used by the control plane. Then we'd enforce that at allocation time with a database constraint and at runtime with quotas/reservations (#7227).

@davepacheco
Copy link
Collaborator Author

Maybe related? #6110.

@jmpesp
Copy link
Contributor

jmpesp commented Dec 10, 2024

For what it's worth, option 4 may not be that complicated. The following region allocation CTE change should do it:

diff --git a/nexus/db-queries/src/db/queries/region_allocation.rs b/nexus/db-queries/src/db/queries/region_allocation.rs
index a9130d87f..239014303 100644
--- a/nexus/db-queries/src/db/queries/region_allocation.rs
+++ b/nexus/db-queries/src/db/queries/region_allocation.rs
@@ -218,7 +218,8 @@ pub fn allocation_query(
   candidate_datasets AS (
     SELECT DISTINCT ON (dataset.pool_id)
       dataset.id,
-      dataset.pool_id
+      dataset.pool_id,
+      dataset.size_used
     FROM (dataset INNER JOIN candidate_zpools ON (dataset.pool_id = candidate_zpools.pool_id))
     WHERE (
       ((dataset.time_deleted IS NULL) AND
@@ -235,7 +236,8 @@ pub fn allocation_query(
   shuffled_candidate_datasets AS (
     SELECT
       candidate_datasets.id,
-      candidate_datasets.pool_id
+      candidate_datasets.pool_id,
+      candidate_datasets.size_used
     FROM candidate_datasets
     ORDER BY md5((CAST(candidate_datasets.id as BYTEA) || ").param().sql(")) LIMIT ").param().sql("
   ),")
@@ -257,7 +259,8 @@ pub fn allocation_query(
       NULL AS port,
       ").param().sql(" AS read_only,
       FALSE as deleting
-    FROM shuffled_candidate_datasets")
+    FROM shuffled_candidate_datasets
+    ORDER BY shuffled_candidate_datasets.size_used ASC")
   // Only select the *additional* number of candidate regions for the required
   // redundancy level
   .sql("

Though there's some interaction here with the supplied seed that I'm kinda ignoring in that diff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants