Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc: refine FAQ, add "What actions are required when a node restarts" #928

Merged
merged 1 commit into from
Nov 13, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
125 changes: 71 additions & 54 deletions openraft/src/docs/faq/faq.md
Original file line number Diff line number Diff line change
@@ -1,76 +1,93 @@
# FAQ

- **🤔 Why is log id a tuple of `(term, node_id, log_index)`, while standard Raft uses just
`(term, log_index)`**?
### Why is log id a tuple of `(term, node_id, log_index)`?

💡 The log id `(term, node_id, log_index)` is used to minimize the chance of election conflicts.
This way in every term there could be more than one leaders elected, and the last one is valid.
See: [`leader-id`](`crate::docs::data::leader_id`) for details.
<br/><br/>
In standard Raft log id is `(term, log_index)`, in Openraft he log id `(term,
node_id, log_index)` is used to minimize the chance of election conflicts.
This way in every term there could be more than one leaders elected, and the last one is valid.
See: [`leader-id`](`crate::docs::data::leader_id`) for details.
<br/>


- **🤔 How to remove node-2 safely from a cluster `{1, 2, 3}`**?
### How to remove node-2 safely from a cluster `{1, 2, 3}`?

💡 Call `Raft::change_membership(btreeset!{1, 3})` to exclude node-2 from
the cluster. Then wipe out node-2 data.
**NEVER** modify/erase the data of any node that is still in a raft cluster, unless you know what you are doing.
<br/><br/>
Call `Raft::change_membership(btreeset!{1, 3})` to exclude node-2 from
the cluster. Then wipe out node-2 data.
**NEVER** modify/erase the data of any node that is still in a raft cluster, unless you know what you are doing.
<br/>


- **🤔 Can I wipe out the data of ONE node and wait for the leader to replicate all data to it again**?
### What actions are required when a node restarts?

💡 Avoid doing this. Doing so will panic the leader. But it is permitted
if [`loosen-follower-log-revert`] feature flag is enabled.
None. No calls, e.g., to either [`add_learner()`][] or [`change_membership()`][]
are necessary.

In a raft cluster, although logs are replicated to multiple nodes,
wiping out a node and restarting it is still possible to cause data loss.
Assumes the leader is `N1`, followers are `N2, N3, N4, N5`:
- A log(`a`) that is replicated by `N1` to `N2, N3` is considered committed.
- At this point, if `N3` is replaced with an empty node, and at once the leader `N1` is crashed. Then `N5` may elected as a new leader with granted vote by `N3, N4`;
- Then the new leader `N5` will not have log `a`.
Openraft maintains the membership configuration in [`Membership`][] for for all
nodes in the cluster, including voters and non-voters (learners). When a
`follower` or `learner` restarts, the leader will automatically re-establish
replication.

```text
Ni: Node i
Lj: Leader at term j
Fj: Follower at term j

N1 | L1 a crashed
N2 | F1 a
N3 | F1 a erased F2
N4 | F2
N5 | elect L2
----------------------------+---------------> time
Data loss: N5 does not have log `a`
```
### Can I wipe out the data of ONE node and wait for the leader to replicate all data to it again?

But for even number nodes cluster, Erasing **exactly one** node won't cause data loss.
Thus, in a special scenario like this, or for testing purpose, you can use
`--feature loosen-follower-log-revert` to permit erasing a node.
<br/><br/>
Avoid doing this. Doing so will panic the leader. But it is permitted
if [`loosen-follower-log-revert`] feature flag is enabled.

In a raft cluster, although logs are replicated to multiple nodes,
wiping out a node and restarting it is still possible to cause data loss.
Assumes the leader is `N1`, followers are `N2, N3, N4, N5`:
- A log(`a`) that is replicated by `N1` to `N2, N3` is considered committed.
- At this point, if `N3` is replaced with an empty node, and at once the leader
`N1` is crashed. Then `N5` may elected as a new leader with granted vote by
`N3, N4`;
- Then the new leader `N5` will not have log `a`.

- **🤔 Is Openraft resilient to incorrectly configured clusters?**
```text
Ni: Node i
Lj: Leader at term j
Fj: Follower at term j

💡 No, Openraft, like standard raft, cannot identify errors in cluster configuration.
N1 | L1 a crashed
N2 | F1 a
N3 | F1 a erased F2
N4 | F2
N5 | elect L2
----------------------------+---------------> time
Data loss: N5 does not have log `a`
```

A common error is the assigning a wrong network addresses to a node. In such
a scenario, if this node becomes the leader, it will attempt to replicate
logs to itself. This will cause Openraft to panic because replication
messages can only be received by a follower.
But for even number nodes cluster, Erasing **exactly one** node won't cause data loss.
Thus, in a special scenario like this, or for testing purpose, you can use
`--feature loosen-follower-log-revert` to permit erasing a node.
<br/>

```text
thread 'main' panicked at openraft/src/engine/engine_impl.rs:793:9:
assertion failed: self.internal_server_state.is_following()
```

```ignore
// openraft/src/engine/engine_impl.rs:793
pub(crate) fn following_handler(&mut self) -> FollowingHandler<C> {
debug_assert!(self.internal_server_state.is_following());
// ...
}
```
### Is Openraft resilient to incorrectly configured clusters?

<br/><br/>
No, Openraft, like standard raft, cannot identify errors in cluster configuration.

A common error is the assigning a wrong network addresses to a node. In such
a scenario, if this node becomes the leader, it will attempt to replicate
logs to itself. This will cause Openraft to panic because replication
messages can only be received by a follower.

```text
thread 'main' panicked at openraft/src/engine/engine_impl.rs:793:9:
assertion failed: self.internal_server_state.is_following()
```

```ignore
// openraft/src/engine/engine_impl.rs:793
pub(crate) fn following_handler(&mut self) -> FollowingHandler<C> {
debug_assert!(self.internal_server_state.is_following());
// ...
}
```

<br/>

[`loosen-follower-log-revert`]: `crate::docs::feature_flags#loosen_follower_log_revert`

[`add_learner()`]: `crate::Raft::add_learner`
[`change_membership()`]: `crate::Raft::change_membership`
[`Membership`]: `crate::Membership`
Loading