[nexus] Don't fail instances in create saga unwind (#7437)

When the `instance_create` saga's `sic_create_instance_record` action unwinds, it executes the compensating action [`sic_delete_instance_record`][1]. This action moves the instance's state to `Failed` prior to actually calling into `project_delete_instance` to delete it, [here][2]. This is because we presently only allow instances to be deleted when they are in the `Stopped` or `Failed` states, as noted [here][3]. Because we must first transition the instance to `Failed` in order to delete it, there is an intermediate state when the instance record created by an unwinding saga exists but is in the `Failed` state. Instances in the `Failed` state are eligible to be restarted by the `instance_reincarnation` background task. That task queries for all `Failed`` instances and creates `instance_start` sagas to attempt to restart them. Therefore, if the `instance_reincarnation` background task runs during the window of time between when an unwinding `instance_create` saga marks the instance record as `Failed` and when it actually attempts to call `project_delete_instance`, the instance can transition to `Starting` by the new instance-start saga. This results in the attempt to delete it failing, causing the unwinding saga to get stuck. This is not great --- it causes a test flake (see #7326), but it's actually a real bug, as it can result in a saga unable to unwind. This commit fixes this by moving most of `project_delete_instance` to a new function, `project_delete_instance_in_state`, which accepts a list of states in which the instance may be deleted as an argument. `project_delete_instance` now calls that function with the "normal" list of states, but unwinding instance-create sagas are additionally able to allow the instance record to be deleted while it's `Creating` in a single atomic database operation, avoiding the transient `Failed` state. This fixes #7326. Unfortunately, it's quite challenging to have a regression test for this, because it would require interrupting the unwinding saga's `sic_delete_instance_record` mid-activation, which we don't really have a nice mechanism for. Also, there was a comment in the `sic_delete_instance_record` action that stated we should be looking up the instance record to delete by ID rather than by project and name. The comment references issue #1536. While I was here changing this action, I figured I'd go ahead and change this as well. My assumption is that the previous thing predates the `LookupPath::instance_id` method? [1]: https://github.com/oxidecomputer/omicron/blob/ec4b5dc3c0c45b667e57a52389d82382b0b59112/nexus/src/app/sagas/instance_create.rs#L970 [2]: https://github.com/oxidecomputer/omicron/blob/ec4b5dc3c0c45b667e57a52389d82382b0b59112/nexus/src/app/sagas/instance_create.rs#L1010-L1033 [3]: https://github.com/oxidecomputer/omicron/blob/ec4b5dc3c0c45b667e57a52389d82382b0b59112/nexus/src/app/sagas/instance_create.rs#L987-L988
oxidecomputer · Jan 30, 2025 · d062c60 · d062c60
1 parent e5bf48f
commit d062c60
Show file tree

Hide file tree

Showing 2 changed files with 67 additions and 49 deletions.
diff --git a/nexus/db-queries/src/db/datastore/instance.rs b/nexus/db-queries/src/db/datastore/instance.rs
@@ -1309,18 +1309,44 @@ impl DataStore {
         opctx: &OpContext,
         authz_instance: &authz::Instance,
     ) -> DeleteResult {
-        opctx.authorize(authz::Action::Delete, authz_instance).await?;
-
         // This is subject to change, but for now we're going to say that an
         // instance must be "stopped" or "failed" in order to delete it.  The
         // delete operation sets "time_deleted" (just like with other objects)
         // and also sets the state to "destroyed".
+        self.project_delete_instance_in_states(
+            &opctx,
+            authz_instance,
+            &[InstanceState::NoVmm, InstanceState::Failed],
+        )
+        .await
+    }
+
+    /// Delete the provided `authz_instance`, as long as it is in one of the
+    /// provided set of [`InstanceState`]s.
+    ///
+    /// A.K.A. "[`project_delete_instance`], hard mode". Typically, instances
+    /// may only be deleted if they are in the [`InstanceState::NoVmm`] or
+    /// `[`InstanceState::Failed`] states. This method exists separately in
+    /// order to allow an unwinding `instance-create` saga to delete the
+    /// instance record without transiently moving it to
+    /// [`InstanceState::Failed`], in which it becomes eligible for automatic
+    /// restart.
+    ///
+    /// In general, callers should use [`project_delete_instance`] instead,
+    /// unless you really know what you're doing.
+    ///
+    /// [`project_delete_instance`]: Self::project_delete_instance
+    pub async fn project_delete_instance_in_states(
+        &self,
+        opctx: &OpContext,
+        authz_instance: &authz::Instance,
+        ok_to_delete_instance_states: &'static [InstanceState],
+    ) -> DeleteResult {
         use db::schema::{disk, instance};
 
-        let stopped = InstanceState::NoVmm;
-        let failed = InstanceState::Failed;
+        opctx.authorize(authz::Action::Delete, authz_instance).await?;
+
         let destroyed = InstanceState::Destroyed;
-        let ok_to_delete_instance_states = vec![stopped, failed];
 
         let detached_label = api::external::DiskState::Detached.label();
         let ok_to_detach_disk_states =

diff --git a/nexus/src/app/sagas/instance_create.rs b/nexus/src/app/sagas/instance_create.rs
@@ -11,7 +11,6 @@ use crate::app::{
 };
 use crate::external_api::params;
 use nexus_db_model::{ExternalIp, NetworkInterfaceKind};
-use nexus_db_queries::db::identity::Resource;
 use nexus_db_queries::db::lookup::LookupPath;
 use nexus_db_queries::db::queries::network_interface::InsertError as InsertNicError;
 use nexus_db_queries::{authn, authz, db};
@@ -977,60 +976,53 @@ async fn sic_delete_instance_record(
         &sagactx,
         &params.serialized_authn,
     );
+
     let instance_id = sagactx.lookup::<InstanceUuid>("instance_id")?;
-    let instance_name = sagactx
-        .lookup::<db::model::Instance>("instance_record")?
-        .name()
-        .clone()
-        .into();
-
-    // We currently only support deleting an instance if it is stopped or
-    // failed, so update the state accordingly to allow deletion.
-    // TODO-correctness TODO-security It's not correct to re-resolve the
-    // instance name now.  See oxidecomputer/omicron#1536.
+
     let result = LookupPath::new(&opctx, &datastore)
-        .project_id(params.project_id)
-        .instance_name(&instance_name)
+        .instance_id(instance_id.into_untyped_uuid())
         .fetch()
         .await;
-
-    // Although, as mentioned in the comment above, we should not be doing the
-    // lookup by name here, we do want this operation to be idempotent.
-    //
-    // As such, if the instance has already been deleted, we should return with
-    // a no-op.
-    let (authz_instance, db_instance) = match result {
-        Ok((.., authz_instance, db_instance)) => (authz_instance, db_instance),
+    // This action must be idempotent, so that an unwinding saga doesn't get
+    // stuck if it executes twice. Therefore, if the instance has already been
+    // deleted, we should return with a no-op, rather than an error.
+    let authz_instance = match result {
+        Ok((.., authz_instance, _)) => authz_instance,
         Err(err) => match err {
             Error::ObjectNotFound { .. } => return Ok(()),
             _ => return Err(err.into()),
         },
     };
 
-    let runtime_state = db::model::InstanceRuntimeState {
-        nexus_state: db::model::InstanceState::Failed,
-        // Must update the generation, or the database query will fail.
-        //
-        // The runtime state of the instance record is only changed as a result
-        // of the successful completion of the saga, or in this action during
-        // saga unwinding. So we're guaranteed that the cached generation in the
-        // saga log is the most recent in the database.
-        gen: db::model::Generation::from(db_instance.runtime_state.gen.next()),
-        ..db_instance.runtime_state
-    };
-
-    let updated =
-        datastore.instance_update_runtime(&instance_id, &runtime_state).await?;
-
-    if !updated {
-        warn!(
-            osagactx.log(),
-            "failed to update instance runtime state from creating to failed",
-        );
-    }
-
     // Actually delete the record.
-    datastore.project_delete_instance(&opctx, &authz_instance).await?;
+    //
+    // Note that we override the list of states in which deleting the instance
+    // is allowed. Typically, instances may only be deleted when they are
+    // `NoVmm` or `Failed`. Here, however, the instance is in the `Creating`
+    // state, as we are in the process of creating it.
+    //
+    // We would like to avoid transiently changing it to `NoVmm` or `Failed`,
+    // as this would create a temporary period of time during which an instance
+    // whose instance-create saga failed could be restarted. In particular,
+    // marking the instance as `Failed` makes it eligible for auto-restart, so
+    // if the `instance_reincarnation` background task queries for reincarnable
+    // instances during that period, it could attempt to restart the instance,
+    // preventing this saga from unwinding successfully.
+    //
+    // Therefore, we override the set of states in which the instance it is
+    // deleted, which is okay to do because we will only attempt to delete a
+    // `Creating` instance when unwinding an instance-create saga.
+    datastore
+        .project_delete_instance_in_states(
+            &opctx,
+            &authz_instance,
+            &[
+                db::model::InstanceState::NoVmm,
+                db::model::InstanceState::Failed,
+                db::model::InstanceState::Creating,
+            ],
+        )
+        .await?;
 
     Ok(())
 }