Nexus overprovisions disks with control plane zones on them #7225

davepacheco · 2024-12-10T19:40:23Z

This appears to be the root cause of #7221: Nexus provisioned so many Crucible regions on one disk that the CockroachDB zone whose root filesystem and database filesystem were both on the same pool ran out of space.

There are many ways to deal with this, with different tradeoffs. I'll post a few notes from the earlier control plane discussion.

davepacheco · 2024-12-10T19:41:16Z

Tagging this for R13 because I don't think it's a blocker (not a regression, right?) but it's important to keep on our radar. (Maybe it's better to not have a milestone and discuss at the product roundtable? CC @askfongjojo @morlandi7.)

davepacheco · 2024-12-10T19:50:01Z

Some ideas discussed so far:

The Crucible region allocation CTE could consult the current target blueprint to see what control plane zones are on the same disk and account for their disk space in making sure not to overprovision a disk. I'm not familiar with this area but it sounds like this is complicated and potentially tightly couples this code to the blueprint structure.
A simplification proposed is we simply reserve a fixed amount of space for the control plane on each disk. I expect this would be on the order of 100 GiB (~3-4% of our current disk size) but I think we'd want to do a little more analysis to figure out the right number. This would probably have to be somewhat deployment-specific -- this seems like it wouldn't work in a4x2 or other non-production hardware.
An in-between approach is that Reconfigurator could manage a dynamic tunable, either system-wide or on a per-disk basis, saying how much of the space to reserve for the control plane. This would give us a little more runtime flexibiity to claw back some space for the control plane. (Personally I don't think this is worth the complexity right now. Even if we wind up reserving 5-10% of each disk for the control plane, the worst case there is an efficiency problem and when that becomes one of our bigger problems then we could invest in making this bound tighter.)
Bias the selection of regions towards disks with more space. There are clearly still availability considerations (i.e., don't put them all on the same disk) and I'm sure there are other considerations as well. But if we always chose disks with more space available, then we wouldn't this problem until the whole system was almost completely full.

I haven't spent much time in this area of code and I apologize if I've got stuff wrong!

Options 1-3 are all pretty related and they all assume:

We want the region allocation CTE to fail with some kind of database constraint violation if anything would cause it to try to overprovision the disk.
That we can assign a number to the amount of space the control plane can use on each disk.

So far my vote would be to go with (2), using a hardcoded limit on the disk space used by the control plane. Then we'd enforce that at allocation time with a database constraint and at runtime with quotas/reservations (#7227).

davepacheco · 2024-12-10T19:59:12Z

Maybe related? #6110.

jmpesp · 2024-12-10T20:39:47Z

For what it's worth, option 4 may not be that complicated. The following region allocation CTE change should do it:

diff --git a/nexus/db-queries/src/db/queries/region_allocation.rs b/nexus/db-queries/src/db/queries/region_allocation.rs
index a9130d87f..239014303 100644
--- a/nexus/db-queries/src/db/queries/region_allocation.rs
+++ b/nexus/db-queries/src/db/queries/region_allocation.rs
@@ -218,7 +218,8 @@ pub fn allocation_query(
   candidate_datasets AS (
     SELECT DISTINCT ON (dataset.pool_id)
       dataset.id,
-      dataset.pool_id
+      dataset.pool_id,
+      dataset.size_used
     FROM (dataset INNER JOIN candidate_zpools ON (dataset.pool_id = candidate_zpools.pool_id))
     WHERE (
       ((dataset.time_deleted IS NULL) AND
@@ -235,7 +236,8 @@ pub fn allocation_query(
   shuffled_candidate_datasets AS (
     SELECT
       candidate_datasets.id,
-      candidate_datasets.pool_id
+      candidate_datasets.pool_id,
+      candidate_datasets.size_used
     FROM candidate_datasets
     ORDER BY md5((CAST(candidate_datasets.id as BYTEA) || ").param().sql(")) LIMIT ").param().sql("
   ),")
@@ -257,7 +259,8 @@ pub fn allocation_query(
       NULL AS port,
       ").param().sql(" AS read_only,
       FALSE as deleting
-    FROM shuffled_candidate_datasets")
+    FROM shuffled_candidate_datasets
+    ORDER BY shuffled_candidate_datasets.size_used ASC")
   // Only select the *additional* number of candidate regions for the required
   // redundancy level
   .sql("

Though there's some interaction here with the supplied seed that I'm kinda ignoring in that diff.

davepacheco added this to the 13 milestone Dec 10, 2024

davepacheco mentioned this issue Dec 10, 2024

A cockroachdb service failed to startup after mupdate to c8f8332bc #7221

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nexus overprovisions disks with control plane zones on them #7225

Nexus overprovisions disks with control plane zones on them #7225

davepacheco commented Dec 10, 2024

davepacheco commented Dec 10, 2024

davepacheco commented Dec 10, 2024 •

edited

Loading

davepacheco commented Dec 10, 2024

jmpesp commented Dec 10, 2024

Nexus overprovisions disks with control plane zones on them #7225

Nexus overprovisions disks with control plane zones on them #7225

Comments

davepacheco commented Dec 10, 2024

davepacheco commented Dec 10, 2024

davepacheco commented Dec 10, 2024 • edited Loading

davepacheco commented Dec 10, 2024

jmpesp commented Dec 10, 2024

davepacheco commented Dec 10, 2024 •

edited

Loading