Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graceful shutdown of an OnDiskStateMachine #381

Open
i512 opened this issue Feb 4, 2025 · 1 comment
Open

Graceful shutdown of an OnDiskStateMachine #381

i512 opened this issue Feb 4, 2025 · 1 comment

Comments

@i512
Copy link

i512 commented Feb 4, 2025

Using dragonboat v4.

Is there a way to exit from a StateMachine Update with an error without causing a panic? Is this something you would consider adding? It could be triggered by returning a special kind of error from the Update, for example.

As I understand, the assumption is that a state machine error always indicates a logic error, and we should not continue. I would like to argue that this is not the case for OnDiskStateMachines: in our project, we have multiple replicas per node host, and each replica's SM stores data on its own HDD. The HDDs can enter RO mode or fail unexpectedly. This is not a logic error but a partial hardware fault in this case. We would like to handle such cases gracefully by stopping the failed replica without affecting other replicas on this NodeHost.

We have to handle two scenarios of disk failures:

  • RO->RW: FS enters RO mode, an op fixes it, and makes it RW again. In this case, the SM can continue applying updates after a long pause.
  • Disk fails completely: Wait for a new replica to be created and become up to date, after this we can safely discard the failed replica.

The simplest way to handle these would be to just stop the replica without applying the last update.

@i512
Copy link
Author

i512 commented Feb 4, 2025

We can probably handle the first case by stalling in the Update until the disk becomes RW again. Stall could last a couple of hours. This is something you wouldn't expect an SM to do, so not sure if it will cause other problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant