You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is there a way to exit from a StateMachine Update with an error without causing a panic? Is this something you would consider adding? It could be triggered by returning a special kind of error from the Update, for example.
As I understand, the assumption is that a state machine error always indicates a logic error, and we should not continue. I would like to argue that this is not the case for OnDiskStateMachines: in our project, we have multiple replicas per node host, and each replica's SM stores data on its own HDD. The HDDs can enter RO mode or fail unexpectedly. This is not a logic error but a partial hardware fault in this case. We would like to handle such cases gracefully by stopping the failed replica without affecting other replicas on this NodeHost.
We have to handle two scenarios of disk failures:
RO->RW: FS enters RO mode, an op fixes it, and makes it RW again. In this case, the SM can continue applying updates after a long pause.
Disk fails completely: Wait for a new replica to be created and become up to date, after this we can safely discard the failed replica.
The simplest way to handle these would be to just stop the replica without applying the last update.
The text was updated successfully, but these errors were encountered:
We can probably handle the first case by stalling in the Update until the disk becomes RW again. Stall could last a couple of hours. This is something you wouldn't expect an SM to do, so not sure if it will cause other problems.
Using dragonboat v4.
Is there a way to exit from a StateMachine Update with an error without causing a panic? Is this something you would consider adding? It could be triggered by returning a special kind of error from the Update, for example.
As I understand, the assumption is that a state machine error always indicates a logic error, and we should not continue. I would like to argue that this is not the case for OnDiskStateMachines: in our project, we have multiple replicas per node host, and each replica's SM stores data on its own HDD. The HDDs can enter RO mode or fail unexpectedly. This is not a logic error but a partial hardware fault in this case. We would like to handle such cases gracefully by stopping the failed replica without affecting other replicas on this NodeHost.
We have to handle two scenarios of disk failures:
The simplest way to handle these would be to just stop the replica without applying the last update.
The text was updated successfully, but these errors were encountered: