-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introducing piraeus.io/evacuate
Taint for Node Draining Support
#530
base: v2
Are you sure you want to change the base?
Conversation
516d51a
to
9f16929
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution! I'll need to think about it some more, I think some of the logic around "draining" could be simplified.
Having thought some more about it: Would it make sense to use a custom taint instead of a node label? So Other things I noticed: I think the Then finally, we need to remove the |
Hey @WanzenBug, Thanks for your feedback. I'm on board with the idea of using a taint over an annotation. It does seem to fit the context better, especially considering the eventual node deletion. Reflecting on it, I think you're spot on about relying exclusively on the Lastly, taking your suggestions into consideration, I've mapped out a path for the cleanup once everything's in sync and the original node's been decommissioned. I'll try to make some time in the upcoming days and get these changes integrated into the PR. |
Signed-off-by: boedy <[email protected]>
…annotation which triggers the operator to autoplace an additional replica to maintain data availability when the node is taken down. Signed-off-by: boedy <[email protected]>
9f16929
to
6873870
Compare
linstor.linbit.com/prepare-for-removal
Annotation for Node Draining Supportpiraeus.io/evacuate
Taint for Node Draining Support
6873870
to
f500bc0
Compare
Signed-off-by: boedy <[email protected]>
f500bc0
to
670dc7d
Compare
Hey @WanzenBug, I've just pushed and update with my latest changes, taking into account the feedback you've provided. I must admit, there were a few tricky edge cases that really made me scratch my head, particularly around handling nodes that housed resources from another node which had previously been tainted. After some iterations, I'm confident in the solution I've landed on. To provide a high-level overview of the process in such scenarios:
One thing to note: the current approach does result in a higher number of API calls during each reconciliation cycle, which is something you might want to revisit and optimise down the line. Let me know what you think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On first glace looks good. I will need (again) need some time to think about all the corner cases 😄
06a3a00
to
b870d94
Compare
b870d94
to
2261105
Compare
I just pushed some bugfixes which I ran into and overlooked at first glance. One think that isn't yet handled gracefully is the actual deletion of the node / satellite. After I delete the k8s node. The operator spams the following error on every reconciling loop:
When draining the node the satellite pod also gets evicted. As the node is tainted it won't get re-spawned, in turn not allowing the operator to gracefully remove the satellite. How should this be resolved? Lastly as AutoplaceTarget is always reset in the undo method. Evacuating multiple nodes isn't a pleasant experience. |
I think the satellite needs to tolerate it's own taint. |
Ok, sorry for the long delay. I played around with this PR a bit more, but I wasn't really happy with the general workflow. There is some improvement coming on the LINSTOR front, namely LINSTOR will no longer remove the resource on the evacuating node while it is still in use. What that means is that we can simply use I do plan to revisit this PR when that feature is available, and also work on making the behaviour configurable. Some users had trouble with the automatic evacuation when removing a node from k8s, so we should take a look at these processes in general. I envision two new settings
|
Hey @WanzenBug, Thanks for the update. I think the two settings make sense.
Regarding above: How would volumes be handled which only have 1 replica? Would they still be migrated? |
Yes. When the evacuation is triggered, such a resource would have 2 replicas:
In the meantime the affinity controller should have updated the PV, k8s should have evicted the Pod, so the resource can become secondary, at which point it is removed by LINSTOR and you are again left with just one replica. |
I'm pleased to share my first contribution, focused on enhancing node draining in Linstor/Piraeus clusters. Feedback welcome on areas that could be further improved / optimised or things I might have overlooked.
Note: This PR is not yet complete. While the core functionality is in place, documentation should be added once the initial review is done and any suggested changes are incorporated.
Summary:
This pull request introduces a new annotation,
linstor.linbit.com/prepare-for-removal
, for Kubernetes nodes. When a node is annotated with this label, theLinstorSatelliteReconciler
prepares the node for draining by auto placing additional replicas for all its resources.The new annotation serves a crucial role in scenarios where a resource does not already have replicas. By creating additional replicas, it ensures that data remains available even when the original node is offline and allows the workload to be migrated in cases where
allowRemoteVolumeAccess: false
is defined. This feature is also particularly useful as Piraeus currently lacks (please correct me if i'm wrong) an auto-healing functionality to maintain the replica count automatically in case a node is lost.The annotation provides a simple and declarative way to prepare nodes for maintenance or decommissioning while enhancing data availability.
Required Components:
For this feature to fully function and allow Kubernetes to reschedule the workload to a new node, the Persistent Volume (PV) needs to be updated. This is handled by the
linstor-affinity-controller
, which must be installed on the cluster. The controller updates the PV to enable the Pod to move to a new node, complementing the node draining and undo-draining functionalities introduced by this pull request.Changes:
New Annotation:
linstor.linbit.com/prepare-for-removal
: When added to a node, it triggers theprepareNodeForDraining
method. When removed it will trigger theundoNodeDrainingPreparation
method to reverse the changes.prepareNodeForDraining
Method:undoNodeDrainingPreparation
Method:prepareNodeForDraining
.Additional Information:
lc.Resources.Autoplace
method does not return this specific metadata. As per the Swagger documentation for the Linstor API, theobj_refs
field could potentially contain the required information. To streamline this process, the Golinstor library would need to be updated to accommodate these changes.