Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Draft] Improve CSI Snapshotting Performance #6860

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 68 additions & 0 deletions design/improve-csi-snaphot-perf.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# Proposal to improve performance of CSI snapshotting through velero.

- [Proposal to improve performance of CSI snapshotting through velero.]

## Abstract
Currently velero uses the CSI plugin for taking CSI snapshots. The CSI Plugin is modeled as a BIA, where whenever the velero code encounters a PVC, it invokes the PVCAction BIA of CSI Plugin. In the Execute() phase of the plugin the CSI plugin waits for a default 10mins for snapshotting to complete. This is a blocking call and the velero code waits for the snapshotting to complete before proceeding to the next resource. In case of failures due to permissions etc, velero will keep waiting for 10*N minutes. This tracking cannot be made async since we need to ensure the appreance of snapshotHandle on the VolumeSnapshotContent before proceeding. This is because the pre-hooks run on pods first, then PVCs are snapshotted, and then posthooks. Ensuring waiting for actual snapshotting is key here to ensure that the posthooks are not executed before the snapshotting is complete.
Further the Core velero code waits on the CSI Snapshot to have readyToUse=true for 10 minutes.

<!-- ## Background -->

## Goals
- Reduce the time to take CSI snapshots.
Copy link
Contributor

@Lyndon-Li Lyndon-Li Oct 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that there are cases make the call in unexpected wait, but I also think they are not totally due to Velero's fault, but mostly on the specific CSI drivers, otherwise, the drivers should detect the cases and fail earlier and then Velero would fail earlier.
Therefore, for this single goal, I don't think a dramatic workflow change in Velero is a good bargain.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can special case the code to only work for CSI snapshots.
and further we will make it configurable with default parallelism as 1. This won't lead to any drastic workflow change.
If users have CSI drivers which are more performant, they can increase the parallel count.

in yesterday's community meet we even discussed that based on how it plays out here in terms of real world, we can use these parallel approach to other things in the backup /restore flow to enhance velero perf.

- Ensure current behaviour of pre and post hooks in context of PVCs attached to a pod.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre/post hooks are for the consideration of achieving better data consistency. But ultimately speaking, the consistency is not on pod basis but on application basis.
Therefore, the current solution that having hooks on pod basis is not the ultimate solution.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently velero code works that way - at a pod level
I don't see any tracking issue to introduce app level consistency. If that's not even on the roadmap, I'll suggest to decouple it from this perf enhancement


## Non Goals

## Considerations:
- Ensure no existing flows break.
- Ensure thread safety in case of concurrency.
- Provide control over max concurrency tweakable by end user.

## How to Group Snapshots
- PVCs which are being used by a pod/deployment should be snapshotting together.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Practically, any sets of PVCs could be correlated so need to be grouped, not only PVCs from the same pod/deployment.
More accurately, we could say PVCs from one application may need to be snapshotted together, because the group snapshot is actually for helping to achieve data consistency across the application, similar to pre/post hooks. However, there is no real entity for application in Kubernetes, so in order to achieve group snapshot support, more details need to be considered.

Moreover, we also need to consider how to support data movement of grouped snapshots.

So I suggest either we visit more details and create a sophisticated solution for group snapshot or we drop the topic for supporting it in this PR and leave it into a separate PR in future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not personally opposed to waiting on the grouping of PVC's but did want to ask one question.

If we have the snapshot that is taken in a consistent way, do we need to change the data movement? I assume that this process could happen at slightly different times because the data itself should be consistent, and the movement of the bytes shouldn't impact the consistency, or am I missing something?

Copy link
Contributor

@Lyndon-Li Lyndon-Li Oct 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data movement doesn't impact the consistency.
On the other hand, the group snapshot does impact data mover on its snapshot expose and manipulation, for example, the snapshot associated objects are different, i.e., VolumeGroupSnapshot and VolumeGroupSnapshotContent, without any changes, the data movement cannot support it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I agree, once those API's go GA or rather before they go GA or beta we should support those API's

I guess I am confused on how this impacts the current idea.

The way I see this design, and I could be way off, is that when no other information about the system besides that a pod is using multiple PVCs and is using CSI, then we should trigger the CSI backups in a way that leads the consistent data, as well as speed up the backup time.

Is it not a good goal, given the above (ignoring all the work around groups and such) to do this incremental thing to help in multiple ways? @anshulahuja98 is the above understanding what you are going for?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes correct @shawn-hurley.
We can decouple VolumeSNapshotGroup from this design since it's not even beta yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See these comments, to me the current change won't help to improve the consistency and performance in NORMAL case. At present, under the help of BIAv2, there are two phases for CSI snapshot creation:

  1. Create snapshot and wait snapshot handle appears. This phase is running in sequence among PVCs. If we dig into the process of snapshot creation on the storage side, this process should be very fast under NORMAL case.
  2. Wait for snapshot get to ReadyToUse. For this, data is going to moved from the storage side, so it takes time. This phase is running in parallel

Now, let's see the differences with or without the current changes regarding to consistency and performance improve under NORMAL case:

  • For consistency: I think the idea is having PVCs in one pod to be snapshotted as closer as possible. Because phase 1 is very fast and phase 2 is asynchronous, PVCs from one pod have actually been snapshotted very closely, among them are simple resource backups. If we simply make the backup workflow async as the current changes, I don't know how much more differences it makes. Moreover don't forget the capability of CSI driver, both CSI driver and storage may have limit to run snapshots together.
  • For performance: Since phase 2, the most time consuming part has already been async, I don't know how much performance improvement is there to make main backup workflow async once more

You may argue that the statement of 1 is very fast is not always true for all storages, I agree, but I would rather regard them as the flaw of the storage itself, because technically, this can be very fast.

Anyway, we at least need to do some tests to prove how much improvement for consistency and performance in various environments, then we come back to consider the current changes.
Forgive my caution on these changes, because they actually make the primary workflow very different, there would come many unexpected problems. I have list some, but I cannot tell all. So if we don't know the benefits for sure, I don't think it is a good bargain.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that one of the systems that takes a long time to snapshot is aws-ebs as well as azure disks. I may be recalling wrong.

Can someone verify my recall, I can't find any documentation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The speed of snapshot-taking is not only related to the storage provider.
It's reported that GCP's big-size persistent disk snapshot handle creation consumes more than 10 minutes, but IMO those are not common cases for most storage providers.

- PVCs which are not being used by any pod can be snapshotted at any time.
- If there are no hooks provided, snapshotting for all PVCs can be done in parallel.

## How to group Resources:
- This is an additional thinking of how to group resources efficiently in velero so that they can be backed up more efficiently - potentially in parallel.
- Current issues with PVs being backed up through different flows will be potentially solved through some in discussion items.

## Approaches

### Invoke CSI Plugin in parallel for a group of PVCs.
- Invoke current CSI plugin's Execute() in parallel for a group of PVCs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in practice, the idea is to group CSI PVC in a way (for example, group the PVCs per pod or ns, with maximum entries in the group of n), and we backup the PVCs in each group in parallel, regardless it's via the CIS plugin or not.

From the code level, the backupper may evolve to support item_groups, and it will call backupItem parallel for the entries in one group, will that work? This concept of item_groups may help us improve the parallelism in one backup in the future.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you check the code changes in current PR
it kind of solves for item_group by invoking backupItem in parallel for additionalItems

in future this can be extended to other resources using similar way where we write BIA to group items and then parallel in backup.


## Implementation
- Current code flow `backupItem` in backup.go is invoked for each resource -> this further invokes `itembackupper.backupItem` -> `backupItemInternal`
- Now for a Pod -> First Pre Hooks are run -> Then `executeActions` -> iterate over all BIA applicable on Pod -> which will invoke the `PodAction`
- After all actions are run, `executeActions` gets the additionalItems to backup(PVCs)
- For all these PVCs and other additional items we iterate and call `itembackupper.backupItem`.
- After all additional items are backed up -> control returns to `backupItemInternal` -> Post Hooks are run -> and then `backupItem` returns.
- Here the change we will do is that when backup for additionalItems is done, for PVCs, we will run `itembackupper.backupItem` in an async way.

## Open Problems
- Ensuring thread safety
- Ensuring that PVBs triggered in parallel work as expected.

## Alternatives Considered
### Approach 1: Add support for VolumeGroupSnapshot in Velero.
- (Volume Group Snapshots)[https://kubernetes.io/blog/2023/05/08/kubernetes-1-27-volume-group-snapshot-alpha/] is introduced as an Alpha feature in Kubernetes v1.27. This feature introduces a Kubernetes API that allows users to take crash consistent snapshots for multiple volumes together. It uses a **label selector to group multiple PersistentVolumeClaims** for snapshotting
- This is out of scope for current design since the API is not even Beta yet and not impacting current perf improvements.

## Approach 2: Create a Pod BIA Plugin which will invoke CSI Plugin in parallel for a group of PVCs.
- Create a Pod BIA Plugin which will invoke CSI Plugin in parallel for a group of PVCs.
- This would lead to code and logic duplication across CSI Plugin and the pod plugin.
- With BIAv2 it is complicated to achieve since a single pod plugin would have to return N operation IDs for N PVCs, while there is only support for 1 operation id at a time. Hacking around using 1 operation id field would lead to code complications and would not be a clean approach.

## Security Considerations
No security impact.

## Compatibility



## Future enhancement

## Open Issues
NA
53 changes: 45 additions & 8 deletions pkg/backup/item_backupper.go
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
"encoding/json"
"fmt"
"strings"
"sync"
"time"

"k8s.io/apimachinery/pkg/apis/meta/v1/unstructured"
Expand Down Expand Up @@ -166,7 +167,7 @@
namespace: namespace,
name: name,
}

// mutex on BackedUpItems needed.
if _, exists := ib.backupRequest.BackedUpItems[key]; exists {
log.Info("Skipping item because it's already been backed up.")
// returning true since this item *is* in the backup, even though we're not backing it up here
Expand Down Expand Up @@ -408,7 +409,22 @@
itemOperList := ib.backupRequest.GetItemOperationsList()
*itemOperList = append(*itemOperList, &newOperation)
}

// ## Approach 1
// Extract all PVCs from the Additional Items
// Create a label such as <podname>:<>
// Apply label to all these PVCs
// Create VolumeSnapshotGroup CR with label selector
// Invoke VolumeSnapshotGroup Action with the the CR
// Poll for VSG in CSI VSG Plugin's Execute()
// Return Additional Items and continue backup

// ## Approach 2
// Async call the current CSI Plugin's Execute()
// Wait for all snapshots to complete.
// Max parallelism can be controlled by further tweaking the WaitGroup.
additionalItemFilesChannel := make(chan FileForArchive)
errChannel := make(chan error)
var wg sync.WaitGroup
for _, additionalItem := range additionalItemIdentifiers {
gvr, resource, err := ib.discoveryHelper.ResourceFor(additionalItem.GroupResource.WithVersion(""))
if err != nil {
Expand All @@ -433,12 +449,32 @@
if err != nil {
return nil, itemFiles, errors.WithStack(err)
}

_, additionalItemFiles, err := ib.backupItem(log, item, gvr.GroupResource(), gvr, mustInclude, finalize)
if err != nil {
return nil, itemFiles, err
}
itemFiles = append(itemFiles, additionalItemFiles...)
wg.Add(1)
go func(additionalItem velero.ResourceIdentifier, log logrus.FieldLogger, item runtime.Unstructured, gvr schema.GroupVersionResource, mustInclude, finalize bool) {
defer wg.Done()
log.WithFields(logrus.Fields{
"groupResource": additionalItem.GroupResource,
"namespace": additionalItem.Namespace,
"name": additionalItem.Name,
}).Infof("Triggering async backupitem for additional item")
_, additionalItemFiles, err := ib.backupItem(log, item, gvr.GroupResource(), gvr, mustInclude, finalize)
if err != nil {

Check warning on line 461 in pkg/backup/item_backupper.go

View check run for this annotation

Codecov / codecov/patch

pkg/backup/item_backupper.go#L460-L461

Added lines #L460 - L461 were not covered by tests
errChannel <- err
return
}
for _, file := range additionalItemFiles {
additionalItemFilesChannel <- file
}
}(additionalItem, log, item.DeepCopy(), gvr, mustInclude, finalize)
}
wg.Wait()
close(additionalItemFilesChannel)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll refine the channel stuff / code accuracy later, currently consider this only for representing draft idea on approach.

close(errChannel)
for itemFilesFromChannel := range additionalItemFilesChannel {
itemFiles = append(itemFiles, itemFilesFromChannel)

Check warning on line 474 in pkg/backup/item_backupper.go

View check run for this annotation

Codecov / codecov/patch

pkg/backup/item_backupper.go#L472-L474

Added lines #L472 - L474 were not covered by tests
}
for err := range errChannel {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is going to eat some errors

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shawn-hurley ahh yes, we were talking about this on slack earlier. Before this change, velero backs up each additionalItem in turn, and upon the first error, it's returned as an error, and the others aren't attempted. Now that we're doing them in parallel, all will start, so it's possible that more than one will error out.

Since the failing additionalItem should log an error for its own failure, the full error list shouldn't have anything missing. That being said, rather than just returning the error for the additional item, we probably want a more descriptive error here anyway, since err here doesn't reference the current item at all. Perhaps logic along the lines of this? If errChannel isn't empty return an error with message "One or more additional items for $currentItem failed: (string join of individual err messages from errChannel).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Perhaps logic along the lines of this? If errChannel isn't empty return an error with message "One or more additional items for $currentItem failed: (string join of individual err messages from errChannel)."
I can take care of that.
If that's enough to address this concern

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anshulahuja98 That addresses the concern on my end. I think that should make sure that no errors are swallowed here. Net effect is if a pod has 2 PVCs and both PVC backups fail, then each PVC's backup error should show up as a PVC error, and then the pod will fail with the combined error message.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if it will be reasonable to put them in one message, think about like 5-10 the message will become unwieldy in the logs.

Can we create a single error, like you said, and then log every other error?

Or is that what was proposed and I missed it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A go routine could be pulling things off the error channel as it runs too, so you can see the logs as they fail.

Then the error is just if the error handling go routine was used.

Thoughts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that works too. Makes sense to me

return nil, itemFiles, err

Check warning on line 477 in pkg/backup/item_backupper.go

View check run for this annotation

Codecov / codecov/patch

pkg/backup/item_backupper.go#L476-L477

Added lines #L476 - L477 were not covered by tests
}
}
return obj, itemFiles, nil
Expand Down Expand Up @@ -658,6 +694,7 @@
// this function will be called throughout the process of backup, it needs to handle any object
func (ib *itemBackupper) trackSkippedPV(obj runtime.Unstructured, groupResource schema.GroupResource, approach string, reason string, log logrus.FieldLogger) {
if name, err := getPVName(obj, groupResource); len(name) > 0 && err == nil {
// Skip PV Tracker already has Mutex lock.

Check warning on line 697 in pkg/backup/item_backupper.go

View check run for this annotation

Codecov / codecov/patch

pkg/backup/item_backupper.go#L697

Added line #L697 was not covered by tests
ib.backupRequest.SkippedPVTracker.Track(name, approach, reason)
} else if err != nil {
log.WithError(err).Warnf("unable to get PV name, skip tracking.")
Expand Down
2 changes: 2 additions & 0 deletions pkg/backup/pvc_snapshot_tracker.go
Original file line number Diff line number Diff line change
Expand Up @@ -77,12 +77,14 @@ func (t *pvcSnapshotTracker) recordStatus(pod *corev1api.Pod, volumeName string,
for _, volume := range pod.Spec.Volumes {
if volume.Name == volumeName {
if volume.PersistentVolumeClaim != nil {
// lock on pvcPod
t.pvcPod[key(pod.Namespace, volume.PersistentVolumeClaim.ClaimName)] = pod.Name
currStatus, ok := t.pvcs[key(pod.Namespace, volume.PersistentVolumeClaim.ClaimName)]
if !ok {
currStatus = pvcSnapshotStatusNotTracked
}
if currStatus == preReqStatus {
// lock on pvcs
t.pvcs[key(pod.Namespace, volume.PersistentVolumeClaim.ClaimName)] = status
}
}
Expand Down
Loading