-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
panic in src/membership/membership.rs:309:51
#1242
Comments
👋 Thanks for opening this issue! Get help or engage by:
|
I'm not very sure, but the panic you're encountering is likely due to the node losing its state upon startup. When this happens, This situation arises because Openraft currently doesn't prevent To address this issue, I am going to implement a safeguard that prevents |
I had another look at it and it happens in the following order:
So it's probably a race condition on my side as well, because the restarts happen quite fast, maybe even within the heartbeat interval, I am not sure. |
I added a check to It should fix the panic, but calling |
Thanks! An error is not an issue, it will simply retry on error after a short timeout. This situation is pretty hard to debug, because I can't reproduce it manually and it does not happen consistently, but I guess it includes a race condition somehow as already mentioned. Edit: I will test again by referring to the git branch tomorrow, but it should be fine, so I guess we can close the issue. |
Describe the bug
I was able to get a panic using
openraft
when I would not expect it. I modified my application in a way that new nodes will auto-join existing clusters, if they are configured to do that. When I do a rolling release on Kubernetes and the pods restart fast enough, I always get into the situation whereopenraft
panics, but it is pretty specific and not easily reproducible.The code panics here:
To Reproduce
Steps to reproduce the behavior:
It is pretty hard to reproduce, but probably easy to fix, since it's just an
.unwrap()
on anOption<_>
. I don't think that I can give an easy and quick guide to reproduce it here without quite a big of setup in advance. However, I can try to do this if actually needed. Probably just removing theunwrap()
will solve it.Expected behavior
To not panic.
Actual behavior
Env (please complete the following information):
openraft v0.19.5
withserde
andstorage-v2
features enabledmain
since I can't easily test because of many breaking changes.Additional Information
What my application is doing in that szenario is, that I had a cluster with in-memory cache, that was running fine. I then trigger a rolling release on Kubernetes by specifying a new container image tag for instance. The nodes start to shut down and restart one by one as usual. My application code then does a check during startup if the node is a raft cluster member and if not, it tries to connect to the other nodes from the given config to find out the leader and then tries to join the cluster using the API on the leader node. Since in this situation, it is an in-memory cache, the node is un-initialized with each restart of course, because it can't track its state. It then tries to join the cluster on the leader, which I guess still thinks that this node is still a cluster member, while the node itself did a restart and lost its state.
I would assume that a new "join as learner" would simply change the config in a way that the node will become a learner again without any panics or issues. I guess I could do a workaround in this case and check on the leader first, if the joining node currently exists in the config and remove it first, but I have not tried this yet. In the end, it is just an
unwrap
on some option inside themembership.rs:309
.The text was updated successfully, but these errors were encountered: