velero is unable to prune PartiallyFailed backup from object storage #5940

jhuisss · 2023-03-02T09:08:49Z

What steps did you take and what happened:
I'm using velero 1.9.5, there is a PartiallyFailed backup (restic backup with nfs pv), which results in 4.1G disk usage on the minio server. When I use velero backup delete <partiallyfailed backup> cmd to delete the backup, it did delete the backup CR, but it won't release disk space in bucket in the object storage, even though I have setdefault-restic-prune-frequency=0h30m0s and make sure in the resticrepository, the maintenanceFrequency equals to 30m0s.

# kubectl -n velero describe resticrepository testmypv-large-ns-default-pb6mq
Name:         testmypv-large-ns-default-pb6mq
Namespace:    velero
Labels:       velero.io/storage-location=default
              velero.io/volume-namespace=testmypv-large-ns
Annotations:  <none>
API Version:  velero.io/v1
Kind:         ResticRepository
Metadata:
  Creation Timestamp:  2023-03-02T07:50:13Z
  Generate Name:       testmypv-large-ns-default-
  Generation:          4
  Managed Fields:
    API Version:  velero.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:generateName:
        f:labels:
          .:
          f:velero.io/storage-location:
          f:velero.io/volume-namespace:
      f:spec:
        .:
        f:backupStorageLocation:
        f:maintenanceFrequency:
        f:resticIdentifier:
        f:volumeNamespace:
      f:status:
        .:
        f:lastMaintenanceTime:
        f:phase:
    Manager:         velero-server
    Operation:       Update
    Time:            2023-03-02T07:50:15Z
  Resource Version:  5610009
  UID:               db87d044-2d57-43e6-9763-376abb727c4d
Spec:
  Backup Storage Location:  default
  Maintenance Frequency:    30m0s
  Restic Identifier:        s3:https://***:9000/restic7.xkou/restic/testmypv-large-ns
  Volume Namespace:         testmypv-large-ns
Status:
  Last Maintenance Time:  2023-03-02T08:23:39Z
  Phase:                  Ready
Events:                   <none>

root@minio:/minio/data# du -h -d 1
16K	./lost+found
2.9M	./.minio.sys
4.0K	./second-bucket.xkou
4.0K	./first-bucket.xkou
28K	./restic2.xkou
4.1G	./restic7.xkou.          <----- after 1 hour, it is still 4.1G, expect to be 4.0K as empty dir.
4.1G	.

What did you expect to happen:
When running cmd velero backup delete <partiallyfailed backup>, after half an hour, the disk space used by this backup should be released.

The following information will help us better understand what's going on:
One wired thing is that deleting Completed backup will release the disk used in objectstorage, but to PartiallyFailed backup, it won’t release disk space.

BTW, Only 1 backup in this namespace and this object storage bucket, no incremental backup, so delete the backup should release all disk usage in this bucket.

Anything else you would like to add:
velero logs during running velero backup delete <partiallyfailed backup>:

time="2023-03-02T01:50:58Z" level=info msg="Starting to check for items in namespace" logSource="internal/delete/delete_item_action_handler.go:95" namespace=testmypv-large-ns
time="2023-03-02T01:50:58Z" level=info msg="invoking DeleteItemAction plugins" item=pvc-test1 logSource="internal/delete/delete_item_action_handler.go:111" namespace=testmypv-large-ns
time="2023-03-02T01:50:58Z" level=info msg="Skipping PVCDeleteItemAction for PVC testmypv-large-ns/pvc-test1, PVC does not have a vSphere BackupItemAction snapshot." backup=br-e2e-wc1-backup cmd=/plugins/velero-plugin-for-vsphere controller=backup-deletion deletebackuprequest=velero/br-e2e-wc1-backup-lzpck logSource="/go/src/github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/plugin/delete_pvc_action_plugin.go:57" pluginName=velero-plugin-for-vsphere
time="2023-03-02T01:50:58Z" level=info msg="Removing PV snapshots" backup=br-e2e-wc1-backup controller=backup-deletion deletebackuprequest=velero/br-e2e-wc1-backup-lzpck logSource="pkg/controller/backup_deletion_controller.go:278"
time="2023-03-02T01:50:58Z" level=info msg="Removing restic snapshots" backup=br-e2e-wc1-backup controller=backup-deletion deletebackuprequest=velero/br-e2e-wc1-backup-lzpck logSource="pkg/controller/backup_deletion_controller.go:303"
time="2023-03-02T01:50:58Z" level=info msg="Removing backup from backup storage" backup=br-e2e-wc1-backup controller=backup-deletion deletebackuprequest=velero/br-e2e-wc1-backup-lzpck logSource="pkg/controller/backup_deletion_controller.go:311"
time="2023-03-02T01:50:58Z" level=info msg="Removing restores" backup=br-e2e-wc1-backup controller=backup-deletion deletebackuprequest=velero/br-e2e-wc1-backup-lzpck logSource="pkg/controller/backup_deletion_controller.go:317"

Environment:

Velero version (use velero version): 1.9.5
Velero features (use velero client config get features):
Kubernetes version (use kubectl version):
Kubernetes installer & version:
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

👍 for "I would like to see this bug fixed as soon as possible"
👎 for "There are more important bugs to focus on right now"

The text was updated successfully, but these errors were encountered:

Lyndon-Li · 2023-03-03T01:22:38Z

@jhuisss Please also share the podVolumeBackup CRs included in the partially failed backup

jhuisss · 2023-03-06T07:40:57Z

Hi, now attached logbundle for the PartiallyFailed backup and podVolumeBackup CR yaml file.
bundle-2023-03-06-07-30-48.tar.gz
podvolumebackup-br-e2e-wc1-backup-mts58.yaml.txt

Lyndon-Li · 2023-03-07T02:32:12Z

This looks like a bug of handling Restic snapshot erroneously:

When restic backup encounter this kind of error, it just prints the error to terminal
However, restic doesn't stop half-way, instead it continues the backup
Finally, restic saves the snapshot into backup store

However, on Velero side:

When restic backup command returns, Velero checks the output from restic
When Velero sees the error, it treats it as a fatal error and abort
Therefore the restic snapshot is not retrieved and recorded by PVB
As a result, the snapshot and its data is lost forever

Lyndon-Li · 2023-03-07T03:06:11Z

Reproduce steps:

Deploy a pod which has a PV mounted
Login the pod and cd to mounted volume
Generate 5 files with each file has 10G data: dd if=/dev/urandom bs=1024M count=10 iflag=fullblock of=tmpfilea
Create velcro backup with parameters to include the namespace of above pod
During the backup process, e.g. after 20s of the creation beginning, login the pod again and cd to mounted volume.
Delete 3 files.
Wait until velcro backup to finish which should be end with partial failure.

Lyndon-Li · 2023-04-04T11:41:37Z

This problem also happens with Kopia uploader

Lyndon-Li · 2023-05-23T07:38:29Z

After double confirm from the code and testing, Kopia path doesn't have this problem because the snapshot for Kopia path is saved by Velero, if Kopia uploader reports any error, Velero will abort saving the snapshot, so the backed up data will be removed by Kopia GC some time later.

Since in v1.12 and later, we will suppress Restic path, this issue is not with high priority.

sseago · 2023-05-23T13:46:07Z

@Lyndon-Li note that restic will still be available as a non-default option in 1.12 and (assuming we deprecate this in 1.12), until 1.14, assuming the proposed deprecation policy is the one we end up with -- i.e. earliest time we can remove a feature is 2 releases post-deprecation. This assumes that we deprecate restic in the same release we make it non-default. It we deprecate later, then those numbers shift as well.

Lyndon-Li · 2023-05-23T15:10:03Z

@sseago
Yes, you are right. Here by suppress, I meant we will set Restic path non-default in v1.12.

ctrought · 2024-02-12T22:44:24Z

also experience this volume backups using CSI snapshots with datamover, after the BackupDeleteRequest is completed the objects for the volume backup remain in the S3 bucket.

Lyndon-Li self-assigned this Mar 6, 2023

Lyndon-Li added Restic Relates to the restic integration Needs investigation labels Mar 6, 2023

Lyndon-Li added Area/fs-backup/deletion Bug and removed Needs investigation labels Mar 7, 2023

Lyndon-Li added the 1.12-candidate label Mar 27, 2023

Lyndon-Li added the Kopia label Apr 4, 2023

reasonerjt added area/fs-backup backlog labels Apr 18, 2023

Lyndon-Li removed the Kopia label May 23, 2023

Lyndon-Li removed the 1.12-candidate label May 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

velero is unable to prune PartiallyFailed backup from object storage #5940

velero is unable to prune PartiallyFailed backup from object storage #5940

jhuisss commented Mar 2, 2023

Lyndon-Li commented Mar 3, 2023

jhuisss commented Mar 6, 2023

Lyndon-Li commented Mar 7, 2023

Lyndon-Li commented Mar 7, 2023

Lyndon-Li commented Apr 4, 2023

Lyndon-Li commented May 23, 2023

sseago commented May 23, 2023

Lyndon-Li commented May 23, 2023

ctrought commented Feb 12, 2024

velero is unable to prune PartiallyFailed backup from object storage #5940

velero is unable to prune PartiallyFailed backup from object storage #5940

Comments

jhuisss commented Mar 2, 2023

Lyndon-Li commented Mar 3, 2023

jhuisss commented Mar 6, 2023

Lyndon-Li commented Mar 7, 2023

Lyndon-Li commented Mar 7, 2023

Lyndon-Li commented Apr 4, 2023

Lyndon-Li commented May 23, 2023

sseago commented May 23, 2023

Lyndon-Li commented May 23, 2023

ctrought commented Feb 12, 2024