Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introducing piraeus.io/evacuate Taint for Node Draining Support #530

Open
wants to merge 4 commits into
base: v2
Choose a base branch
from

Conversation

boedy
Copy link

@boedy boedy commented Sep 20, 2023

I'm pleased to share my first contribution, focused on enhancing node draining in Linstor/Piraeus clusters. Feedback welcome on areas that could be further improved / optimised or things I might have overlooked.

Note: This PR is not yet complete. While the core functionality is in place, documentation should be added once the initial review is done and any suggested changes are incorporated.

Summary:

This pull request introduces a new annotation, linstor.linbit.com/prepare-for-removal, for Kubernetes nodes. When a node is annotated with this label, the LinstorSatelliteReconciler prepares the node for draining by auto placing additional replicas for all its resources.

The new annotation serves a crucial role in scenarios where a resource does not already have replicas. By creating additional replicas, it ensures that data remains available even when the original node is offline and allows the workload to be migrated in cases where allowRemoteVolumeAccess: false is defined. This feature is also particularly useful as Piraeus currently lacks (please correct me if i'm wrong) an auto-healing functionality to maintain the replica count automatically in case a node is lost.

The annotation provides a simple and declarative way to prepare nodes for maintenance or decommissioning while enhancing data availability.

Required Components:

For this feature to fully function and allow Kubernetes to reschedule the workload to a new node, the Persistent Volume (PV) needs to be updated. This is handled by the linstor-affinity-controller, which must be installed on the cluster. The controller updates the PV to enable the Pod to move to a new node, complementing the node draining and undo-draining functionalities introduced by this pull request.

Changes:

  1. New Annotation:

    • linstor.linbit.com/prepare-for-removal: When added to a node, it triggers the prepareNodeForDraining method. When removed it will trigger the undoNodeDrainingPreparation method to reverse the changes.
  2. prepareNodeForDraining Method:

    • Fetches the list of resources on the annotated node.
    • Autoplaces an additional replica for each resource.
    • Updates node properties to mark it for removal and disable new resource placements.
  3. undoNodeDrainingPreparation Method:

    • Counterpart to prepareNodeForDraining.
    • Deletes either the newest replica or the one on the annotated node, depending on which resource is currently in use.
    • Removes the node properties set during the preparation for draining.

Additional Information:

  • The current implementation necessitates an extra API call and logic to ascertain the location of the newly-created replica. This is because the lc.Resources.Autoplace method does not return this specific metadata. As per the Swagger documentation for the Linstor API, the obj_refs field could potentially contain the required information. To streamline this process, the Golinstor library would need to be updated to accommodate these changes.

@boedy boedy force-pushed the feat-safe-drain branch 3 times, most recently from 516d51a to 9f16929 Compare September 20, 2023 07:47
Copy link
Member

@WanzenBug WanzenBug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! I'll need to think about it some more, I think some of the logic around "draining" could be simplified.

@WanzenBug WanzenBug added enhancement New feature or request v2 This affects only Operator v2 labels Sep 25, 2023
@WanzenBug
Copy link
Member

Having thought some more about it: Would it make sense to use a custom taint instead of a node label? So piraeus.io/node-removal: NoSchedule? That way we would also automatically prevent kubernetes from trying to create new resources on the node.

Other things I noticed: I think the Aux/PrepareForRemoval property can be removed. With each reconcile attempt, we should check that a replacement resource is ready, by checking for a matching Aux/CopiedOverFromNode resource.
We would then also need to wait for the resource to actually fully sync up before reporting success.

Then finally, we need to remove the Aux/CopiedOverFromNode once the original node is gone, so we don't accidentally remove the resource at a later point because of it.

@boedy
Copy link
Author

boedy commented Oct 13, 2023

Hey @WanzenBug,

Thanks for your feedback. I'm on board with the idea of using a taint over an annotation. It does seem to fit the context better, especially considering the eventual node deletion.

Reflecting on it, I think you're spot on about relying exclusively on the Aux/CopiedOverFromNode property. I had initial reservations about potential race conditions, but given the synchronous nature of the reconciliation process, those concerns seem unfounded.

Lastly, taking your suggestions into consideration, I've mapped out a path for the cleanup once everything's in sync and the original node's been decommissioned. I'll try to make some time in the upcoming days and get these changes integrated into the PR.

boedy added 2 commits October 28, 2023 18:08
…annotation which triggers the operator to autoplace an additional replica to maintain data availability when the node is taken down.

Signed-off-by: boedy <[email protected]>
@boedy boedy changed the title Introducing linstor.linbit.com/prepare-for-removal Annotation for Node Draining Support Introducing piraeus.io/evacuate Taint for Node Draining Support Oct 28, 2023
@boedy
Copy link
Author

boedy commented Oct 28, 2023

Hey @WanzenBug,

I've just pushed and update with my latest changes, taking into account the feedback you've provided. I must admit, there were a few tricky edge cases that really made me scratch my head, particularly around handling nodes that housed resources from another node which had previously been tainted. After some iterations, I'm confident in the solution I've landed on.

To provide a high-level overview of the process in such scenarios:

  1. Consider a three node cluster with nodes A, B & C
  2. Node A is tainted, starting the evacuation process of node A
  3. All resources on Node A are individually copied to either Node B or Node C
  4. Once all resources from Node A are successfully synchronised to the other nodes, Node A is marked with a 'success' state, indicating it's ready to be drained / deleted
  5. At any point after Node A, Node B is also tainted
  6. All resources from Node B are copied to Node C. This includes any resources that were originally on Node A and got replicated to Node B in the earlier step.
  7. As soon as a resource (originally from Node A) is fully synchronised to Node C, the resource on Node B is automatically deleted. Additionally, the reference, Aux/EvacuatedFromNode, is updated on Node C to indicate the original source of the resource (Node A in this case).

One thing to note: the current approach does result in a higher number of API calls during each reconciliation cycle, which is something you might want to revisit and optimise down the line.

Let me know what you think.

Copy link
Member

@WanzenBug WanzenBug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On first glace looks good. I will need (again) need some time to think about all the corner cases 😄

- findMatchingEvacuatedResource should only consider Diskfull resources
- Nil pointer check: can be nil when node is offline
@boedy
Copy link
Author

boedy commented Oct 31, 2023

I just pushed some bugfixes which I ran into and overlooked at first glance.

One think that isn't yet handled gracefully is the actual deletion of the node / satellite. After I delete the k8s node. The operator spams the following error on every reconciling loop:

ERROR Reconciler error {"controller": "linstorsatellite", "controllerGroup": "piraeus.io", "controllerKind": "LinstorSatellite", "LinstorSatellite": {"name":"h-fsn-ded3"}, "namespace": "", "name": "h-fsn-ded3", "reconcileID": "bb2b3d68-4b98-4a24-a715-0b46b1b3fa68", "error": "Message: 'Node: h-fsn-ded3, Resource: pvc-072225f9-b5ca-40d2-9ca2-2adbc2d04d93 preparing for deletion.'; Details: 'Node: h-fsn-ded3, Resource: pvc-072225f9-b5ca-40d2-9ca2-2adbc2d04d93 UUID is: 6a1d14e7-658d-4a89-9d3d-b4810f0d40cb' next error: Message: 'No connection to satellite 'h-fsn-ded3'' next error: Message: 'Preparing deletion of resource on 'h-fsn-ded2'' next error: Message: 'Preparing deletion of resource on 'h-fsn-ded5'' next error: Message: 'Deletion of resource 'pvc-072225f9-b5ca-40d2-9ca2-2adbc2d04d93' on node 'h-fsn-ded3' failed due to an unknown exception.'; Details: 'Node: h-fsn-ded3, Resource: pvc-072225f9-b5ca-40d2-9ca2-2adbc2d04d93'; Reports: '[653FE239-00000-000018]'"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/Users/boedy/projects/go/me/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:329
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/Users/boedy/projects/go/me/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/Users/boedy/projects/go/me/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2023-10-30T18:19:25+01:00 INFO Warning: Reconciler returned both a non-zero result and a non-nil error. The result will always be ignored if the error is non-nil and the non-nil error causes reqeueuing with exponential backoff. For more details, see: https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/reconcile#Reconciler {"controller": "linstorsatellite", "controllerGroup": "piraeus.io", "controllerKind": "LinstorSatellite", "LinstorSatellite": {"name":"h-fsn-ded3"}, "namespace": "", "name": "h-fsn-ded3", "reconcileID": "201d80fc-f00b-4a8d-bf52-6e77d07f4c20"}

When draining the node the satellite pod also gets evicted. As the node is tainted it won't get re-spawned, in turn not allowing the operator to gracefully remove the satellite. How should this be resolved?

Lastly as AutoplaceTarget is always reset in the undo method. Evacuating multiple nodes isn't a pleasant experience.

@WanzenBug
Copy link
Member

I think the satellite needs to tolerate it's own taint.

@WanzenBug WanzenBug mentioned this pull request Jan 31, 2024
@WanzenBug
Copy link
Member

Ok, sorry for the long delay. I played around with this PR a bit more, but I wasn't really happy with the general workflow.

There is some improvement coming on the LINSTOR front, namely LINSTOR will no longer remove the resource on the evacuating node while it is still in use. What that means is that we can simply use Evacuate() instead of having to move volumes around manually. At least once the new feature is released.

I do plan to revisit this PR when that feature is available, and also work on making the behaviour configurable. Some users had trouble with the automatic evacuation when removing a node from k8s, so we should take a look at these processes in general. I envision two new settings

  • nodeDrainPolicy, triggered when unschedulable is set on the node.
    • Ignore: just ignore it, satellite stays online
    • NoSchedule: do not place new resources, but keep existing
    • Evacuate: trigger node evacuation (what this PR tries to do)
  • nodeDeletePolicy, triggered when the matching k8s node gets deleted
    • Orphan: just delete the satellite resource, but leave the registered node in LINSTOR alone
    • Evict: try to evacuate/evict the node, placing the resources on other nodes.

@boedy
Copy link
Author

boedy commented Apr 1, 2024

Hey @WanzenBug, Thanks for the update. I think the two settings make sense.

There is some improvement coming on the LINSTOR front, namely LINSTOR will no longer remove the resource on the evacuating node while it is still in use. What that means is that we can simply use Evacuate() instead of having to move volumes around manually. At least once the new feature is released.

Regarding above: How would volumes be handled which only have 1 replica? Would they still be migrated?

@WanzenBug
Copy link
Member

Regarding above: How would volumes be handled which only have 1 replica? Would they still be migrated?

Yes. When the evacuation is triggered, such a resource would have 2 replicas:

  • on the node to evacuate
  • on the replacement node
    As long as the resource is Primary on the node to evacuate, it will not be deleted.

In the meantime the affinity controller should have updated the PV, k8s should have evicted the Pod, so the resource can become secondary, at which point it is removed by LINSTOR and you are again left with just one replica.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request v2 This affects only Operator v2
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants