Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(2/5) [nexus] Add Affinity/Anti-Affinity groups to database #7444

Open
wants to merge 7 commits into
base: affinity-api
Choose a base branch
from

Conversation

smklein
Copy link
Collaborator

@smklein smklein commented Jan 30, 2025

Pulled out of #7076

Updates schema to include Affinity/Anti-Affinity groups, but does not use these schemas yet.

Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions, but overall, this looks good.

schema/crdb/dbinit.sql Outdated Show resolved Hide resolved
schema/crdb/affinity/up11.sql Show resolved Hide resolved
schema/crdb/dbinit.sql Outdated Show resolved Hide resolved
nexus/db-model/src/affinity.rs Outdated Show resolved Hide resolved
schema/crdb/dbinit.sql Outdated Show resolved Hide resolved
Copy link
Contributor

@gjcolombo gjcolombo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generally LGTM.

One high-level note, which you can take or leave (and which I hope I'm not duplicating from someone else!): There are a handful of comments both here and in #7445 about the non-atomicity of the "reserve resources" and "create VMM record" steps of the instance start saga. I think those steps could be made atomic pretty easily: the query to reserve resources already operates in a large transaction, and the two subsequent steps of the start saga (bump the next-available IP address on the selected sled and create a new VMM record) don't seem like they would make it that much more expensive. (Perhaps trying to contend on the sled record would increase the chances of transaction conflicts?)

It seems like this is inconvenient enough that it might be worthwhile to try changing it. But I'm leery of restructuring an existing saga like this without the ability to run down existing sagas while preparing to update... and, at least in this PR, it sounds like having that would mitigate some of our concerns about not creating reservations and VMM records atomically. So this might not be worth pursuing right now, but I did at least want to note that if the start saga's existing behavior in this respect is a humongous pain, we can probably change it easily.

@smklein
Copy link
Collaborator Author

smklein commented Feb 3, 2025

There are a handful of comments both here and in #7445 about the non-atomicity of the "reserve resources" and "create VMM record" steps of the instance start saga. I think those steps could be made atomic pretty easily: the query to reserve resources already operates in a large transaction, and the two subsequent steps of the start saga (bump the next-available IP address on the selected sled and create a new VMM record) don't seem like they would make it that much more expensive. (Perhaps trying to contend on the sled record would increase the chances of transaction conflicts?)

I acknowledge that this is possible, but I have a couple concerns I want to address before considering this route.

  1. Do we want this transaction to get larger? I have concern that the "reserve resources" transaction is already getting too large, and even before these affinity PRs, it's a potential source of contention. I was considering converting more of the transaction into a CTE as a follow-up to this PR, and continuing to extend it makes that more difficult.

  2. Do we think sled reservation should be so tightly coupled with VMM record creation? One of the reasons that the "sled reservation" table has a "kind" field -- even though it's "instances only" right now -- was so that we could, one day, use it when making placement decisions about control-plane services too. Right now, we don't, because we're kinda ignoring everything that isn't an instance when we perform these reservations. But that model might break down in the future, if we want to avoid overprovisioning instances while considering control plane usage?

I filed #7468 to summarize some of my thoughts here.

@gjcolombo
Copy link
Contributor

Do we want this transaction to get larger? I have concern that the "reserve resources" transaction is already getting too large, and even before these affinity PRs, it's a potential source of contention. I was considering converting more of the transaction into a CTE as a follow-up to this PR, and continuing to extend it makes that more difficult.

Agreed, this is admittedly already a big transaction. I'd definitely be in favor of breaking it up some/trying to convert some of it to a CTE.

To step back a bit, my thought process here is:

  • There's a slight conceptual difference between "this instance has resources allocated on this sled" and "this instance is incarnate in a VMM on this sled" (e.g. I believe we can have a Destroyed VMM whose resource allocation hasn't yet been cleaned up by an update saga)
  • When we make affinity decisions, it's "is this instance incarnate here?" that really matters, not "does this instance have a resource reservation here?"
  • But the current construction of the start saga means that if we consider instances' VMM fields directly, we can violate the affinity rules (by selecting the same sled for two anti-affine instances that are starting simultaneously and that don't have VMM pointers yet)

The two concepts at issue are more than close enough for what we're trying to do at this stage, but in the long run I'd love to see if we can smooth out this rough spot by making it possible to ask "where are the VMMs?" instead of "where are the reservations?". Certainly not a blocker, though--just something I wanted to raise as a possibility.

@andrewjstone
Copy link
Contributor

  1. Do we think sled reservation should be so tightly coupled with VMM record creation? One of the reasons that the "sled reservation" table has a "kind" field -- even though it's "instances only" right now -- was so that we could, one day, use it when making placement decisions about control-plane services too. Right now, we don't, because we're kinda ignoring everything that isn't an instance when we perform these reservations. But that model might break down in the future, if we want to avoid overprovisioning instances while considering control plane usage?

I thought we agreed to flatten that table down and remove the "kind" field. I think that if we don't do that, we'll end up with multiple columns that are optional and only applicable to some rows. I think it would be better to instead use multiple resource tables and join them appropriately when making allocation decisions, or come up with a new schema when we actually get to the point of implementing this stuff.

@smklein
Copy link
Collaborator Author

smklein commented Feb 3, 2025

I thought we agreed to flatten that table down and remove the "kind" field. I think that if we don't do that, we'll end up with multiple columns that are optional and only applicable to some rows. I think it would be better to instead use multiple resource tables and join them appropriately when making allocation decisions, or come up with a new schema when we actually get to the point of implementing this stuff.

Yeah, I still plan to do that! I think I'm just speculating on "where are we going to have coupling".

Let's suppose we have an sled_instance_resource and a sled_service_resource table, with similar columns - when we're making a new allocation on a sled, we would want to consider both, to be aware of "total usage on a sled".

For instance specifically, do we want to have this happen in a "reservation" stage prior to VMM creation? Or is the creation of the reservation + VMM record going to happen in the same transaction?

I do think either option could work. I just want to make sure we think about future patterns here to be careful about contention of these DB rows.

@andrewjstone
Copy link
Contributor

I thought we agreed to flatten that table down and remove the "kind" field. I think that if we don't do that, we'll end up with multiple columns that are optional and only applicable to some rows. I think it would be better to instead use multiple resource tables and join them appropriately when making allocation decisions, or come up with a new schema when we actually get to the point of implementing this stuff.

Yeah, I still plan to do that! I think I'm just speculating on "where are we going to have coupling".

Let's suppose we have an sled_instance_resource and a sled_service_resource table, with similar columns - when we're making a new allocation on a sled, we would want to consider both, to be aware of "total usage on a sled".

For instance specifically, do we want to have this happen in a "reservation" stage prior to VMM creation? Or is the creation of the reservation + VMM record going to happen in the same transaction?

I do think either option could work. I just want to make sure we think about future patterns here to be careful about contention of these DB rows.

Thanks @smklein. That makes sense. I kinda assumed you were using kind as shorthand, and was just making sure. I don't actually know the right direction for allocation decisions, but am interested in the problem, although I haven't thought much about it honestly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants