-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Draft] Improve CSI Snapshotting Performance #6860
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Anshul Ahuja <[email protected]>
9d2a551
to
8c3ddd7
Compare
Signed-off-by: Anshul Ahuja <[email protected]>
Codecov Report
@@ Coverage Diff @@
## main #6860 +/- ##
==========================================
+ Coverage 60.19% 61.02% +0.83%
==========================================
Files 242 255 +13
Lines 25670 27066 +1396
==========================================
+ Hits 15451 16516 +1065
- Misses 9143 9368 +225
- Partials 1076 1182 +106
|
design/improve-csi-snaphot-perf.md
Outdated
|
||
## Implementation | ||
## For Approach 2 | ||
- Current code flow `backupItem` in backup.go is invoked for each resource -> this further invokes `itembackupper.backupItem` -> `backupItemInternal` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before looking at code, first read through the impl
design/improve-csi-snaphot-perf.md
Outdated
### Approach 1: Add support for VolumeGroupSnapshot in Velero. | ||
- (Volume Group Snapshots)[https://kubernetes.io/blog/2023/05/08/kubernetes-1-27-volume-group-snapshot-alpha/] is introduced as an Alpha feature in Kubernetes v1.27. This feature introduces a Kubernetes API that allows users to take crash consistent snapshots for multiple volumes together. It uses a **label selector to group multiple PersistentVolumeClaims** for snapshotting | ||
|
||
### Approach 2: Invoke CSI Plugin in parallel for a group of PVCs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current code sample is based on this.
This is based on the principals of approach 3 and approach 1 can be further implemented on top of current setup.
itemFiles = append(itemFiles, additionalItemFiles...) | ||
} | ||
wg.Wait() | ||
close(additionalItemFilesChannel) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll refine the channel stuff / code accuracy later, currently consider this only for representing draft idea on approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually think Approach 3 may be the cleanest. It doesn't require adding further special case code to the backup workflow. Create a new Pod action plugin in the CSI plugin which essentially does the work of the CSI PVC plugin for each PVC mounted by the pod (those for which the CSI plugin would match, not the FS backup volumes). Refactor the PVC plugin so that common functionality needed by this and pod plugin is shared, and before returning each PVC as an additional item, set an annotation on the PVC that the PVC CSI plugin will use to know it should ignore that PVC and not snapshot it.
design/improve-csi-snaphot-perf.md
Outdated
|
||
## Approach 3: Create a Pod BIA Plugin which will invoke CSI Plugin in parallel for a group of PVCs. | ||
- Create a Pod BIA Plugin which will invoke CSI Plugin in parallel for a group of PVCs. | ||
- This would lead to code and logic duplication across CSI Plugin and the pod plugin. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could mitigate the duplication by pulling shared functionality into functions called by both plugins.
will try to experiment this route as well. Need to probably see in code how it fits together. |
Signed-off-by: Anshul Ahuja <[email protected]>
Signed-off-by: Anshul Ahuja <[email protected]>
Signed-off-by: Anshul Ahuja <[email protected]>
for itemFilesFromChannel := range additionalItemFilesChannel { | ||
itemFiles = append(itemFiles, itemFilesFromChannel) | ||
} | ||
for err := range errChannel { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is going to eat some errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shawn-hurley ahh yes, we were talking about this on slack earlier. Before this change, velero backs up each additionalItem in turn, and upon the first error, it's returned as an error, and the others aren't attempted. Now that we're doing them in parallel, all will start, so it's possible that more than one will error out.
Since the failing additionalItem should log an error for its own failure, the full error list shouldn't have anything missing. That being said, rather than just returning the error for the additional item, we probably want a more descriptive error here anyway, since err
here doesn't reference the current item at all. Perhaps logic along the lines of this? If errChannel isn't empty return an error with message "One or more additional items for $currentItem failed: (string join of individual err messages from errChannel).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Perhaps logic along the lines of this? If errChannel isn't empty return an error with message "One or more additional items for $currentItem failed: (string join of individual err messages from errChannel)."
I can take care of that.
If that's enough to address this concern
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anshulahuja98 That addresses the concern on my end. I think that should make sure that no errors are swallowed here. Net effect is if a pod has 2 PVCs and both PVC backups fail, then each PVC's backup error should show up as a PVC error, and then the pod will fail with the combined error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if it will be reasonable to put them in one message, think about like 5-10 the message will become unwieldy in the logs.
Can we create a single error, like you said, and then log every other error?
Or is that what was proposed and I missed it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A go routine could be pulling things off the error channel as it runs too, so you can see the logs as they fail.
Then the error is just if the error handling go routine was used.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that works too. Makes sense to me
- Pass a list of PVCs to CSI Plugin for Approach 1? | ||
|
||
## How to Group Snapshots | ||
- PVCs which are being used by a pod/deployment should be snapshotting together. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Practically, any sets of PVCs could be correlated so need to be grouped, not only PVCs from the same pod/deployment.
More accurately, we could say PVCs from one application may need to be snapshotted together, because the group snapshot is actually for helping to achieve data consistency across the application, similar to pre/post hooks. However, there is no real entity for application in Kubernetes, so in order to achieve group snapshot support, more details need to be considered.
Moreover, we also need to consider how to support data movement of grouped snapshots.
So I suggest either we visit more details and create a sophisticated solution for group snapshot or we drop the topic for supporting it in this PR and leave it into a separate PR in future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not personally opposed to waiting on the grouping of PVC's but did want to ask one question.
If we have the snapshot that is taken in a consistent way, do we need to change the data movement? I assume that this process could happen at slightly different times because the data itself should be consistent, and the movement of the bytes shouldn't impact the consistency, or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The data movement doesn't impact the consistency.
On the other hand, the group snapshot does impact data mover on its snapshot expose and manipulation, for example, the snapshot associated objects are different, i.e., VolumeGroupSnapshot and VolumeGroupSnapshotContent, without any changes, the data movement cannot support it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I agree, once those API's go GA or rather before they go GA or beta we should support those API's
I guess I am confused on how this impacts the current idea.
The way I see this design, and I could be way off, is that when no other information about the system besides that a pod is using multiple PVCs and is using CSI, then we should trigger the CSI backups in a way that leads the consistent data, as well as speed up the backup time.
Is it not a good goal, given the above (ignoring all the work around groups and such) to do this incremental thing to help in multiple ways? @anshulahuja98 is the above understanding what you are going for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes correct @shawn-hurley.
We can decouple VolumeSNapshotGroup from this design since it's not even beta yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See these comments, to me the current change won't help to improve the consistency and performance in NORMAL case. At present, under the help of BIAv2, there are two phases for CSI snapshot creation:
- Create snapshot and wait snapshot handle appears. This phase is running in sequence among PVCs. If we dig into the process of snapshot creation on the storage side, this process should be very fast under NORMAL case.
- Wait for snapshot get to ReadyToUse. For this, data is going to moved from the storage side, so it takes time. This phase is running in parallel
Now, let's see the differences with or without the current changes regarding to consistency and performance improve under NORMAL case:
- For consistency: I think the idea is having PVCs in one pod to be snapshotted as closer as possible. Because phase 1 is very fast and phase 2 is asynchronous, PVCs from one pod have actually been snapshotted very closely, among them are simple resource backups. If we simply make the backup workflow async as the current changes, I don't know how much more differences it makes. Moreover don't forget the capability of CSI driver, both CSI driver and storage may have limit to run snapshots together.
- For performance: Since phase 2, the most time consuming part has already been async, I don't know how much performance improvement is there to make main backup workflow async once more
You may argue that the statement of 1 is very fast is not always true for all storages, I agree, but I would rather regard them as the flaw of the storage itself, because technically, this can be very fast.
Anyway, we at least need to do some tests to prove how much improvement for consistency and performance in various environments, then we come back to consider the current changes.
Forgive my caution on these changes, because they actually make the primary workflow very different, there would come many unexpected problems. I have list some, but I cannot tell all. So if we don't know the benefits for sure, I don't think it is a good bargain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that one of the systems that takes a long time to snapshot is aws-ebs as well as azure disks. I may be recalling wrong.
Can someone verify my recall, I can't find any documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The speed of snapshot-taking is not only related to the storage provider.
It's reported that GCP's big-size persistent disk snapshot handle creation consumes more than 10 minutes, but IMO those are not common cases for most storage providers.
design/improve-csi-snaphot-perf.md
Outdated
- Invoke current CSI plugin's Execute() in parallel for a group of PVCs. | ||
- Perf implications of parallel calls? | ||
|
||
## Approach 3: Create a Pod BIA Plugin which will invoke CSI Plugin in parallel for a group of PVCs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the same consideration of data consistency across application, we could not assume that the pre/post plugin is on pod basis, since we are changing the workflow dramatically, we need to target the ultimate solution, instead of basing on the current situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we even have a tracking issue for application level pre-post.
If that's not even on the near roadmap, I would suggest we take assumption on current flow and optimize it.
If the app consistent option comes in the future we can try to accommodate in a similar way. I am not sure how this is blocking in any way to this current PR.
<!-- ## Background --> | ||
|
||
## Goals | ||
- Reduce the time to take CSI snapshots. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that there are cases make the call in unexpected wait, but I also think they are not totally due to Velero's fault, but mostly on the specific CSI drivers, otherwise, the drivers should detect the cases and fail earlier and then Velero would fail earlier.
Therefore, for this single goal, I don't think a dramatic workflow change in Velero is a good bargain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can special case the code to only work for CSI snapshots.
and further we will make it configurable with default parallelism as 1. This won't lead to any drastic workflow change.
If users have CSI drivers which are more performant, they can increase the parallel count.
in yesterday's community meet we even discussed that based on how it plays out here in terms of real world, we can use these parallel approach to other things in the backup /restore flow to enhance velero perf.
|
||
## Goals | ||
- Reduce the time to take CSI snapshots. | ||
- Ensure current behaviour of pre and post hooks in context of PVCs attached to a pod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pre/post hooks are for the consideration of achieving better data consistency. But ultimately speaking, the consistency is not on pod basis but on application basis.
Therefore, the current solution that having hooks on pod basis is not the ultimate solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently velero code works that way - at a pod level
I don't see any tracking issue to introduce app level consistency. If that's not even on the roadmap, I'll suggest to decouple it from this perf enhancement
design/improve-csi-snaphot-perf.md
Outdated
### Approach 1: Add support for VolumeGroupSnapshot in Velero. | ||
- (Volume Group Snapshots)[https://kubernetes.io/blog/2023/05/08/kubernetes-1-27-volume-group-snapshot-alpha/] is introduced as an Alpha feature in Kubernetes v1.27. This feature introduces a Kubernetes API that allows users to take crash consistent snapshots for multiple volumes together. It uses a **label selector to group multiple PersistentVolumeClaims** for snapshotting | ||
|
||
### Approach 2: Invoke CSI Plugin in parallel for a group of PVCs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some details I can come out, and I think there may be more:
- There may be combinations of different backup types. For example, some volumes may be backed up by fs-backup and others by CSI snapshots, and we even don't know which type a volume go before the workflow for that backup type is really launched (considering the opt-in/out options for fs-backup)
- Not only pods result in PVC backups, other resource may also do, for example, VirtualMachine resources for KubeVirt or other k8s managed VM solutions. So I am afraid, in order to get all PVCs, PodAction plugin is not enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the aim is not to focus on all PVCs, the scope of current PR is to optimize it for CSI snapshots.
Other scenarios based on perf requirement we can look into.
@anshulahuja98 Let me reply here centrally, which answers your above comments. Actually, I am somehow confused on what the current PR is going to achieve: 2. Do we want to solve the performance problem in the unexpected cases? Finally, if this goal is the primary thing we focus, I think we may have other ways with smaller impacts, we can discuss. 3. Do we want to introduce a new way to group PVCs on pod basis?
|
- (Volume Group Snapshots)[https://kubernetes.io/blog/2023/05/08/kubernetes-1-27-volume-group-snapshot-alpha/] is introduced as an Alpha feature in Kubernetes v1.27. This feature introduces a Kubernetes API that allows users to take crash consistent snapshots for multiple volumes together. It uses a **label selector to group multiple PersistentVolumeClaims** for snapshotting | ||
|
||
### Approach 2: Invoke CSI Plugin in parallel for a group of PVCs. | ||
- Invoke current CSI plugin's Execute() in parallel for a group of PVCs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in practice, the idea is to group CSI PVC in a way (for example, group the PVCs per pod or ns, with maximum entries in the group of n), and we backup the PVCs in each group in parallel, regardless it's via the CIS plugin or not.
From the code level, the backupper may evolve to support item_groups
, and it will call backupItem
parallel for the entries in one group, will that work? This concept of item_groups
may help us improve the parallelism in one backup in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you check the code changes in current PR
it kind of solves for item_group by invoking backupItem in parallel for additionalItems
in future this can be extended to other resources using similar way where we write BIA to group items and then parallel in backup.
Signed-off-by: Anshul Ahuja <[email protected]>
@anshulahuja98 Will this speed up all plugin calls or just the ones or just for pods? Is there a way to run all the plugins for a given resource X async? Considering that the plugins can add a significant time to a backup of a particular resource, I wonder if this could help many folks who have third-party plugins? |
@shawn-hurley The problem with running plugin actions async is that the primary action of item action plugins is to modify the content of the resource that's being backed up or restored. So for the backup case, if an item has 3 plugins that apply to it, the Execute calls of the 3 plugins are called sequentially, each getting the resource as modified by the previous plugin. After this, the final modified resource content is saved in the backup. Note that this design doesn't call for running any plugin Execute calls asynchronously (even for pods) -- what happens asynchronously is that any additional items returned by plugin calls (which represent other items that must be backed up right now, before continuing with the other items). If a plugin returns 5 items that all need to be backed up now, with this design, they will all be backed up in parallel, with the current item's backup func waiting until they're all done. For each of these in-parallel items being backed up, their own plugins will be called sequentially, just as with any other item. Also, this is not necessarily only going to happen for pods -- any resource with a plugin that returns a list of additional items will have those backed up in parallel before continuing -- although at the moment, I think pods may be the only type for which a plugin is likely to return multiple items -- PVCs in this case. |
@shawn-hurley Note that a plugin can have an aysnchronous component to its action -- that's where the backup item operations fit in (and used for CSI snapshotting), but the plugin itself doesn't launch a direct asynchronous action -- it simply starts an action that might take a while and runs in the background under the control of something else (i.e. create VolumeSnapshot -- CSI infrastructure then does its thing), and the plugin returns with the operation ID. The only other plugin involvement from that point forward is that Velero will later call the plugin's Progress method (passing in the operation ID), so that the plugin can tell velero that the operation is complete, or it's not yet done. |
When plugins say three are run synchronously, are they always in the same order? What determines that order? |
@shawn-hurley The order is always the same with a given set of plugins, but it's best not to rely on it since when one plugin is written, you're never sure what other plugins will be added later. But the order is as follows. First, velero gets a list of binaries that include plugins -- first is "velero" -- to capture item actions included in velero itself, then velero looks in the plugin dir in the pod and grabs the plugin binaries, in filename order (so for example, in my current cluster:
Which means internal plugin actions are returned first, then aws plugin actions, then CSI, then openshift velero plugin actions. AWS plugin doesn't contain BIAs, so this means internal actions, then CSI, then openshift (but if the binary we built was called "openshift-velero-plugin", as it probably should be, then those actions would run before CSI). But since we don't have any interdependencies between CSI actions and openshift actions, that won't matter in practice. Then, for each plugin binary, a list of registered plugin actions is returned, sorted by the name it was registered from. In other words, if you're a plugin author, you can control what order plugins within your own binary are called, and you know they'll be called after any internal actions, but it's best not to make assumptions about whether there are other plugins registered and, if so, when their actions run. |
Another possible solution is integrating the CSI plugin into the Velero code base, then we don't need to consider how to make the plugins work parallel. |
I think I discussed this with @sseago, but that would require lot more effort since current CSI Datamover leverages the plugin format of returning additional items etc. The item itself would be more expensive to light up given we'd need to duplicate bunch of plugin logic in core code. |
Indeed, integrating the CSI plugin requires quite some effort, but it is not just duplicating the CSI plugin code into the Velero server code base. If we choose to this way, there will be no CSI plugin in the future release. |
Checked whether it's possible to skip the waiting on some specific error. From the error handling perspective, I think this vendor-provided CSI driver should mark the error as a final one when it's not reasonable to retry the snapshot creation. |
We have actually done this exact workaround in our downstream consumption. Again to callout this is still a workaround, and scaling this to multiple vendors might be tricky. |
@anshulahuja98 |
Thank you for contributing to Velero!
Please add a summary of your change
Does your change fix a particular issue?
Fixes #6165
Please indicate you've done the following:
/kind changelog-not-required
as a comment on this pull request.site/content/docs/main
.