Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panic in src/membership/membership.rs:309:51 #1242

Closed
sebadob opened this issue Sep 5, 2024 · 5 comments
Closed

panic in src/membership/membership.rs:309:51 #1242

sebadob opened this issue Sep 5, 2024 · 5 comments
Assignees

Comments

@sebadob
Copy link

sebadob commented Sep 5, 2024

Describe the bug
I was able to get a panic using openraft when I would not expect it. I modified my application in a way that new nodes will auto-join existing clusters, if they are configured to do that. When I do a rolling release on Kubernetes and the pods restart fast enough, I always get into the situation where openraft panics, but it is pretty specific and not easily reproducible.

The code panics here:

openraft-0.9.15/src/membership/membership.rs:309:51:
called `Option::unwrap()` on a `None` value

To Reproduce
Steps to reproduce the behavior:
It is pretty hard to reproduce, but probably easy to fix, since it's just an .unwrap() on an Option<_>. I don't think that I can give an easy and quick guide to reproduce it here without quite a big of setup in advance. However, I can try to do this if actually needed. Probably just removing the unwrap() will solve it.

Expected behavior
To not panic.

Actual behavior

2024-09-05T09:06:48.865227Z  INFO change_membership: openraft::raft::impl_raft_blocking_write: change_membership: start to commit joint config changes=ReplaceAllVoters({3}) retain=true
2024-09-05T09:06:48.865299Z  INFO openraft::core::raft_core: received RaftMsg::ChangeMembership: openraft::core::raft_core::RaftCore<_, _, _, _>::handle_api_msg members=ReplaceAllVoters({3}) retain=true
thread 'tokio-runtime-worker' panicked at /home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/openraft-0.9.15/src/membership/membership.rs:309:51:
called `Option::unwrap()` on a `None` value
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Env (please complete the following information):

  • openraft v0.19.5 with serde and storage-v2 features enabled
  • I don't know if it exists on main since I can't easily test because of many breaking changes.

Additional Information

What my application is doing in that szenario is, that I had a cluster with in-memory cache, that was running fine. I then trigger a rolling release on Kubernetes by specifying a new container image tag for instance. The nodes start to shut down and restart one by one as usual. My application code then does a check during startup if the node is a raft cluster member and if not, it tries to connect to the other nodes from the given config to find out the leader and then tries to join the cluster using the API on the leader node. Since in this situation, it is an in-memory cache, the node is un-initialized with each restart of course, because it can't track its state. It then tries to join the cluster on the leader, which I guess still thinks that this node is still a cluster member, while the node itself did a restart and lost its state.

I would assume that a new "join as learner" would simply change the config in a way that the node will become a learner again without any panics or issues. I guess I could do a workaround in this case and check on the leader first, if the joining node currently exists in the config and remove it first, but I have not tried this yet. In the end, it is just an unwrap on some option inside the membership.rs:309.

Copy link

github-actions bot commented Sep 5, 2024

👋 Thanks for opening this issue!

Get help or engage by:

  • /help : to print help messages.
  • /assignme : to assign this issue to you.

@drmingdrmer drmingdrmer self-assigned this Sep 5, 2024
@drmingdrmer
Copy link
Member

I'm not very sure, but the panic you're encountering is likely due to the node losing its state upon startup. When this happens, RaftStateMachine::applied_state() returns an empty configuration, which leads to the panic at:

https://github.com/datafuselabs/openraft/blob/5f5d7e9f60ff2e86af078d2c20401636de2c19f7/openraft/src/storage/helper.rs#L194

This situation arises because Openraft currently doesn't prevent Raft::change_membership() from being called when the node is in an uninitialized state. There's no API to directly set a membership value that would cause self.get_joint_config().last() to return None.

To address this issue, I am going to implement a safeguard that prevents Raft::change_membership() from being called on an uninitialized node. This should help avoid the panic and ensure more robust behavior during node startup and state recovery.

@sebadob
Copy link
Author

sebadob commented Sep 5, 2024

I had another look at it and it happens in the following order:

  • k8s starts node 1, 2, 3
  • node 1 ends up as leader
  • rolling released triggered
    • node 3 is shut down and restarted
    • node 2 is shut down and restarted
    • node 1 is shut down and restarted
    • node 2 + 3 crash (only once) with the panic
    • node 1 is fine
    • k8s restarts the crashed node 2 + 3 and then they end up being healthy without any panics

So it's probably a race condition on my side as well, because the restarts happen quite fast, maybe even within the heartbeat interval, I am not sure.

@drmingdrmer
Copy link
Member

I added a check to change_membership() when the node is not initialized:

It should fix the panic, but calling Raft::change_membership() on a non-initialized node still gets an error. : ()

@sebadob
Copy link
Author

sebadob commented Sep 5, 2024

Thanks! An error is not an issue, it will simply retry on error after a short timeout.

This situation is pretty hard to debug, because I can't reproduce it manually and it does not happen consistently, but I guess it includes a race condition somehow as already mentioned.

Edit:

I will test again by referring to the git branch tomorrow, but it should be fine, so I guess we can close the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants