Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

velero completely ignores --csi-snapshot-timeout flag #8473

Open
jmtx1020 opened this issue Dec 3, 2024 · 14 comments
Open

velero completely ignores --csi-snapshot-timeout flag #8473

jmtx1020 opened this issue Dec 3, 2024 · 14 comments
Assignees
Labels
Area/CSI Related to Container Storage Interface support

Comments

@jmtx1020
Copy link

jmtx1020 commented Dec 3, 2024

What steps did you take and what happened:

I created a volume like this:

velero backup create data-backup \
  --include-namespaces app-12345-425 \
  --include-resources=persistentvolumeclaims \
  --selector='test=velero-backup' \
  --csi-snapshot-timeout=20m

Results in the following logs. it actually does create the volume backup but deletes it. I've timed how long it takes to create a backup manually and it takes about 47 seconds - so I think the issue might be in the fact it's only checking for 50 seconds.

time="2024-12-03T04:09:00Z" level=info msg="Waiting for CSI driver to reconcile volumesnapshot app-12345-425/velero-app-12345-425-as-dc1-app-123456pfrq. Retrying in 5s" backup=velero/data-backup-with-hook cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:712" pluginName=velero
time="2024-12-03T04:09:05Z" level=info msg="Waiting for VolumeSnapshotContents snapcontent-0bc2e3a0-28bd-488c-98fd-edf09db53c83 to have snapshot handle. Retrying in 5s" backup=velero/data-backup-with-hook cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:739" pluginName=velero
time="2024-12-03T04:09:10Z" level=info msg="Waiting for VolumeSnapshotContents snapcontent-0bc2e3a0-28bd-488c-98fd-edf09db53c83 to have snapshot handle. Retrying in 5s" backup=velero/data-backup-with-hook cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:739" pluginName=velero
time="2024-12-03T04:09:15Z" level=info msg="Waiting for VolumeSnapshotContents snapcontent-0bc2e3a0-28bd-488c-98fd-edf09db53c83 to have snapshot handle. Retrying in 5s" backup=velero/data-backup-with-hook cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:739" pluginName=velero
time="2024-12-03T04:09:20Z" level=info msg="Waiting for VolumeSnapshotContents snapcontent-0bc2e3a0-28bd-488c-98fd-edf09db53c83 to have snapshot handle. Retrying in 5s" backup=velero/data-backup-with-hook cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:739" pluginName=velero
time="2024-12-03T04:09:25Z" level=info msg="Waiting for VolumeSnapshotContents snapcontent-0bc2e3a0-28bd-488c-98fd-edf09db53c83 to have snapshot handle. Retrying in 5s" backup=velero/data-backup-with-hook cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:739" pluginName=velero
time="2024-12-03T04:09:30Z" level=info msg="Waiting for VolumeSnapshotContents snapcontent-0bc2e3a0-28bd-488c-98fd-edf09db53c83 to have snapshot handle. Retrying in 5s" backup=velero/data-backup-with-hook cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:739" pluginName=velero
time="2024-12-03T04:09:35Z" level=info msg="Waiting for VolumeSnapshotContents snapcontent-0bc2e3a0-28bd-488c-98fd-edf09db53c83 to have snapshot handle. Retrying in 5s" backup=velero/data-backup-with-hook cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:739" pluginName=velero
time="2024-12-03T04:09:40Z" level=info msg="Waiting for VolumeSnapshotContents snapcontent-0bc2e3a0-28bd-488c-98fd-edf09db53c83 to have snapshot handle. Retrying in 5s" backup=velero/data-backup-with-hook cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:739" pluginName=velero
time="2024-12-03T04:09:45Z" level=info msg="Waiting for VolumeSnapshotContents snapcontent-0bc2e3a0-28bd-488c-98fd-edf09db53c83 to have snapshot handle. Retrying in 5s" backup=velero/data-backup-with-hook cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:739" pluginName=velero
time="2024-12-03T04:09:50Z" level=info msg="Deleting Volumesnapshot app-12345-425/velero-app-12345-425-as-dc1-app-123456pfrq" backup=velero/data-backup-with-hook cmd=/velero logSource="pkg/util/csi/volume_snapshot.go:485" pluginName=velero
time="2024-12-03T04:09:50Z" level=info msg="1 errors encountered backup up item" backup=velero/data-backup-with-hook logSource="pkg/backup/backup.go:720" name=app-12345-425-as-dc1-app-12345-425-as-dc1-0
time="2024-12-03T04:09:50Z" level=error msg="Error backing up item" backup=velero/data-backup-with-hook error="error executing custom action (groupResource=volumesnapshots.snapshot.storage.k8s.io, namespace=app-12345-425, name=velero-app-12345-425-as-dc1-app-123456pfrq): rpc error: code = Unknown desc = failed to get volumesnapshot app-12345-425/velero-app-12345-425-as-dc1-app-123456pfrq: volumesnapshots.snapshot.storage.k8s.io \"velero-app-12345-425-as-dc1-app-123456pfrq\" not found" logSource="pkg/backup/backup.go:724" name=app-12345-425-as-dc1-app-12345-425-as-dc1-0
time="2024-12-03T04:09:50Z" level=info msg="Excluding item because resource is cluster-scoped and is excluded by cluster filter." backup=velero/data-backup-with-hook logSource="pkg/backup/item_backupper.go:133" name=csi-d3a39800-f02e-4456-83ec-0127bba2ad71 namespace= resource=persistentvolumes
time="2024-12-03T04:09:50Z" level=info msg="Excluding item because resource is excluded" backup=velero/data-backup-with-hook logSource="pkg/backup/item_backupper.go:140" name=app-12345-425-as-dc1-0 namespace=app-12345-425 resource=pods
time="2024-12-03T04:09:50Z" level=info msg="Backed up 2 items out of an estimated total of 2 (estimate will change throughout the backup)" backup=velero/data-backup-with-hook logSource="pkg/backup/backup.go:499" name=app-12345-425-as-dc1-app-12345-425-as-dc1-0 namespace=app-12345-425 progress= resource=persistentvolumeclaims
time="2024-12-03T04:09:50Z" level=info msg="Summary for skipped PVs: []" backup=velero/data-backup-with-hook logSource="pkg/backup/backup.go:542"
time="2024-12-03T04:09:50Z" level=info msg="Backed up a total of 2 items" backup=velero/data-backup-with-hook logSource="pkg/backup/backup.go:546" progress=

What did you expect to happen:
A successfully created backup.

Anything else you would like to add:

Creating a volume snapshot manually like this, but takes about 47 seconds to complete and be in the "Ready" state in kubernetes. We should be able to toggle the wait time to maybe a few minutes so that we can always succeed.

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: test-manual-snapshot
  namespace: app-12345-425
spec:
  volumeSnapshotClassName: my-snapclass
  source:
    persistentVolumeClaimName: <claim_name>

Environment:

  • Velero version (use velero version):
Client:
        Version: v1.15.0
        Git commit: -
Server:
        Version: v1.15.0
  • Velero features (use velero client config get features):
velero client config get features
features: <NOT SET>
  • Kubernetes version (use kubectl version):
Client Version: v1.29.1
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.1
  • Cloud provider or hardware configuration: Oracle Cloud Kubernetes

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@kaovilai
Copy link
Member

kaovilai commented Dec 3, 2024

kubectl get backup -oyaml
and see if timeout propagated to backup object velero would use to run the backup.

@kaovilai kaovilai self-assigned this Dec 3, 2024
@kaovilai kaovilai added the Area/CSI Related to Container Storage Interface support label Dec 3, 2024
@kaovilai
Copy link
Member

kaovilai commented Dec 3, 2024

Also please give an example how you used --csi-snapshot-timeout

@jmtx1020
Copy link
Author

jmtx1020 commented Dec 3, 2024

Also please give an example how you used --csi-snapshot-timeout

Sorry, about missing that detail. I just updated the original shell command to show it. What's really strange is I have a very small volume that I can snapshot just fine, it's just these that have actual data (8gbs) that take longer and have this problem.

My demo volume is 50gigs but probably has less than 1mb of dummy data and it works fine to snapshot. The one with real data always fails.

@jmtx1020
Copy link
Author

jmtx1020 commented Dec 3, 2024

I guess it is sending that value over? but then why is it timing out before it sees the volume contents in the attached logs 🤔

For the record manual backups (outside of velero) work just fine, if I create a volume snapshot using kubernetes manifests. So Oracle CSI is working as expected and I've just tried a different backup scheduler and its not having issues at all.

I would like to keep using velero though since it has the neat CLI to go with it.

kubectl get backup -oyaml and see if timeout propagated to backup object velero would use to run the backup.

velero backup create manual-backup-11 \
  --include-resources="pvc" \
  --include-namespaces app-12345-425 \
  --include-resources=persistentvolumeclaims \
  --snapshot-volumes=true \
  --selector='test=velero-backup' \
  --csi-snapshot-timeout=20m
- apiVersion: velero.io/v1
  kind: Backup
  metadata:
    annotations:
      velero.io/resource-timeout: 10m0s
      velero.io/source-cluster-k8s-gitversion: v1.29.1
      velero.io/source-cluster-k8s-major-version: "1"
      velero.io/source-cluster-k8s-minor-version: "29"
    creationTimestamp: "2024-12-03T17:45:36Z"
    generation: 6
    labels:
      velero.io/storage-location: default
    name: manual-backup-11
    namespace: velero
    resourceVersion: "509394137"
    uid: 62b9fd3d-abfe-4a3e-b804-0d2278e21d17
  spec:
    csiSnapshotTimeout: 20m0s
    defaultVolumesToFsBackup: false
    hooks: {}
    includedNamespaces:
    - app-12345-425
    includedResources:
    - persistentvolumeclaims
    itemOperationTimeout: 4h0m0s
    labelSelector:
      matchLabels:
        test: velero-backup
    metadata: {}
    snapshotMoveData: false
    snapshotVolumes: true
    storageLocation: default
    ttl: 720h0m0s
  status:
    completionTimestamp: "2024-12-03T17:46:32Z"
    errors: 1
    expiration: "2025-01-02T17:45:41Z"
    formatVersion: 1.1.0
    hookStatus: {}
    phase: PartiallyFailed
    progress:
      itemsBackedUp: 2
      totalItems: 2
    startTimestamp: "2024-12-03T17:45:41Z"
    version: 1

@kaovilai
Copy link
Member

kaovilai commented Dec 3, 2024

Yours did not fail due to timeout. Otherwise you will see log "Timed out awaiting reconciliation of VolumeSnapshot, VolumeSnapshotContent %s has error: %v",

if err != nil {
if wait.Interrupted(err) {
if vsc != nil &&
vsc.Status != nil &&
vsc.Status.Error != nil {
log.Errorf(
"Timed out awaiting reconciliation of VolumeSnapshot, VolumeSnapshotContent %s has error: %v",
vsc.Name, *vsc.Status.Error.Message)
return nil,
errors.Errorf("CSI got timed out with error: %v",
*vsc.Status.Error.Message)
} else {
log.Errorf(
"Timed out awaiting reconciliation of volumesnapshot %s/%s",
volSnap.Namespace, volSnap.Name)
}
}
return nil, err
}

@jmtx1020
Copy link
Author

jmtx1020 commented Dec 3, 2024

So I changed, nothing. and it's working now on a schedule now.

app-12345-425-20241203190531   Completed         0        0          2024-12-03 12:05:31 -0700 MST   29d       default            test=velero-backup
app-12345-425-20241203190031   Completed         0        0          2024-12-03 12:00:31 -0700 MST   29d       default            test=velero-backup
app-12345-425-20241203185531   Completed         0        0          2024-12-03 11:55:31 -0700 MST   29d       default            test=velero-backup
app-12345-425-20241203185031   Completed         0        0          2024-12-03 11:50:31 -0700 MST   29d       default            test=velero-backup
app-12345-425-20241203184531   Completed         0        0          2024-12-03 11:45:31 -0700 MST   29d       default            test=velero-backup

@kaovilai
Copy link
Member

kaovilai commented Dec 3, 2024

Was it reliably reproducible (how many out of 10 fails etc.)? I think it might've been a oneoff.. we can add some retry in here to make it more reliable if apiserver is slow.

@jmtx1020
Copy link
Author

jmtx1020 commented Dec 3, 2024

Was it reliably reproducible (how many out of 10 fails etc.)? I think it might've been a oneoff.. we can add some retry in here to make it more reliable if apiserver is slow.

Yes it was, for the past 24 hours it ran every 5 minutes and only one succeeded and now its succeeding everytime.

@kaovilai
Copy link
Member

kaovilai commented Dec 3, 2024

ok.. so not 100% reproducible given enough retries.. so env flakes.

@kaovilai
Copy link
Member

kaovilai commented Dec 3, 2024

I can try PR something. But since your env is now succeeding every time, I assume you won't be able to test it.

@jmtx1020
Copy link
Author

jmtx1020 commented Dec 3, 2024

ok.. so not 100% reproducible given enough retries.. so env flakes.

this is what I did just before it started working:
1.) I installed / setup a schedule for backups using Backube instead of Velero
2.) I scaled down the statefulset to 0
3.) I deleted the PVC
4.) I restored using the below manifest
6.) I scaled back up the statefulset to 1
7.) Now the backups work.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: app-12345-425-as-as-dc1-app-app-12345-425-as-as-dc1-app-0
  namespace: app-12345-425
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 250Gi
  dataSource:
    name: app-12345-425-as-as-dc1-app-app-12345-425-as-as-dc1-app-0-app-12345-425-snapshot-202412031840 
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io

Then I scaled backup my statefulset.

@jmtx1020
Copy link
Author

jmtx1020 commented Dec 3, 2024

I can try PR something. But since your env is now succeeding every time, I assume you won't be able to test it.

I can still test it because I have a lot more deployments that need to be backed up as well.

@kaovilai
Copy link
Member

kaovilai commented Dec 4, 2024

@jmtx1020 do you have argo by any chance?

If so, you could be hitting #7905

@kaovilai
Copy link
Member

kaovilai commented Dec 4, 2024

ghcr.io/kaovilai/velero:verboseCleanup-r1.15-b26183826
I made an image for you from #8480 btw.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area/CSI Related to Container Storage Interface support
Projects
None yet
Development

No branches or pull requests

2 participants