Graceful shutdown of an OnDiskStateMachine #381

i512 · 2025-02-04T16:00:04Z

Using dragonboat v4.

Is there a way to exit from a StateMachine Update with an error without causing a panic? Is this something you would consider adding? It could be triggered by returning a special kind of error from the Update, for example.

As I understand, the assumption is that a state machine error always indicates a logic error, and we should not continue. I would like to argue that this is not the case for OnDiskStateMachines: in our project, we have multiple replicas per node host, and each replica's SM stores data on its own HDD. The HDDs can enter RO mode or fail unexpectedly. This is not a logic error but a partial hardware fault in this case. We would like to handle such cases gracefully by stopping the failed replica without affecting other replicas on this NodeHost.

We have to handle two scenarios of disk failures:

RO->RW: FS enters RO mode, an op fixes it, and makes it RW again. In this case, the SM can continue applying updates after a long pause.
Disk fails completely: Wait for a new replica to be created and become up to date, after this we can safely discard the failed replica.

The simplest way to handle these would be to just stop the replica without applying the last update.

i512 · 2025-02-04T16:07:56Z

We can probably handle the first case by stalling in the Update until the disk becomes RW again. Stall could last a couple of hours. This is something you wouldn't expect an SM to do, so not sure if it will cause other problems.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graceful shutdown of an OnDiskStateMachine #381

Graceful shutdown of an OnDiskStateMachine #381

i512 commented Feb 4, 2025

i512 commented Feb 4, 2025

Graceful shutdown of an OnDiskStateMachine #381

Graceful shutdown of an OnDiskStateMachine #381

Comments

i512 commented Feb 4, 2025

i512 commented Feb 4, 2025