Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

velero is unable to prune PartiallyFailed backup from object storage #5940

Open
jhuisss opened this issue Mar 2, 2023 · 9 comments
Open
Assignees

Comments

@jhuisss
Copy link

jhuisss commented Mar 2, 2023

What steps did you take and what happened:
I'm using velero 1.9.5, there is a PartiallyFailed backup (restic backup with nfs pv), which results in 4.1G disk usage on the minio server. When I use velero backup delete <partiallyfailed backup> cmd to delete the backup, it did delete the backup CR, but it won't release disk space in bucket in the object storage, even though I have setdefault-restic-prune-frequency=0h30m0s and make sure in the resticrepository, the maintenanceFrequency equals to 30m0s.

# kubectl -n velero describe resticrepository testmypv-large-ns-default-pb6mq
Name:         testmypv-large-ns-default-pb6mq
Namespace:    velero
Labels:       velero.io/storage-location=default
              velero.io/volume-namespace=testmypv-large-ns
Annotations:  <none>
API Version:  velero.io/v1
Kind:         ResticRepository
Metadata:
  Creation Timestamp:  2023-03-02T07:50:13Z
  Generate Name:       testmypv-large-ns-default-
  Generation:          4
  Managed Fields:
    API Version:  velero.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:generateName:
        f:labels:
          .:
          f:velero.io/storage-location:
          f:velero.io/volume-namespace:
      f:spec:
        .:
        f:backupStorageLocation:
        f:maintenanceFrequency:
        f:resticIdentifier:
        f:volumeNamespace:
      f:status:
        .:
        f:lastMaintenanceTime:
        f:phase:
    Manager:         velero-server
    Operation:       Update
    Time:            2023-03-02T07:50:15Z
  Resource Version:  5610009
  UID:               db87d044-2d57-43e6-9763-376abb727c4d
Spec:
  Backup Storage Location:  default
  Maintenance Frequency:    30m0s
  Restic Identifier:        s3:https://***:9000/restic7.xkou/restic/testmypv-large-ns
  Volume Namespace:         testmypv-large-ns
Status:
  Last Maintenance Time:  2023-03-02T08:23:39Z
  Phase:                  Ready
Events:                   <none>
root@minio:/minio/data# du -h -d 1
16K	./lost+found
2.9M	./.minio.sys
4.0K	./second-bucket.xkou
4.0K	./first-bucket.xkou
28K	./restic2.xkou
4.1G	./restic7.xkou.          <----- after 1 hour, it is still 4.1G, expect to be 4.0K as empty dir.
4.1G	.

What did you expect to happen:
When running cmd velero backup delete <partiallyfailed backup>, after half an hour, the disk space used by this backup should be released.

The following information will help us better understand what's going on:
One wired thing is that deleting Completed backup will release the disk used in objectstorage, but to PartiallyFailed backup, it won’t release disk space.

BTW, Only 1 backup in this namespace and this object storage bucket, no incremental backup, so delete the backup should release all disk usage in this bucket.

Anything else you would like to add:
velero logs during running velero backup delete <partiallyfailed backup>:

time="2023-03-02T01:50:58Z" level=info msg="Starting to check for items in namespace" logSource="internal/delete/delete_item_action_handler.go:95" namespace=testmypv-large-ns
time="2023-03-02T01:50:58Z" level=info msg="invoking DeleteItemAction plugins" item=pvc-test1 logSource="internal/delete/delete_item_action_handler.go:111" namespace=testmypv-large-ns
time="2023-03-02T01:50:58Z" level=info msg="Skipping PVCDeleteItemAction for PVC testmypv-large-ns/pvc-test1, PVC does not have a vSphere BackupItemAction snapshot." backup=br-e2e-wc1-backup cmd=/plugins/velero-plugin-for-vsphere controller=backup-deletion deletebackuprequest=velero/br-e2e-wc1-backup-lzpck logSource="/go/src/github.com/vmware-tanzu/velero-plugin-for-vsphere/pkg/plugin/delete_pvc_action_plugin.go:57" pluginName=velero-plugin-for-vsphere
time="2023-03-02T01:50:58Z" level=info msg="Removing PV snapshots" backup=br-e2e-wc1-backup controller=backup-deletion deletebackuprequest=velero/br-e2e-wc1-backup-lzpck logSource="pkg/controller/backup_deletion_controller.go:278"
time="2023-03-02T01:50:58Z" level=info msg="Removing restic snapshots" backup=br-e2e-wc1-backup controller=backup-deletion deletebackuprequest=velero/br-e2e-wc1-backup-lzpck logSource="pkg/controller/backup_deletion_controller.go:303"
time="2023-03-02T01:50:58Z" level=info msg="Removing backup from backup storage" backup=br-e2e-wc1-backup controller=backup-deletion deletebackuprequest=velero/br-e2e-wc1-backup-lzpck logSource="pkg/controller/backup_deletion_controller.go:311"
time="2023-03-02T01:50:58Z" level=info msg="Removing restores" backup=br-e2e-wc1-backup controller=backup-deletion deletebackuprequest=velero/br-e2e-wc1-backup-lzpck logSource="pkg/controller/backup_deletion_controller.go:317"

Environment:

  • Velero version (use velero version): 1.9.5
  • Velero features (use velero client config get features):
  • Kubernetes version (use kubectl version):
  • Kubernetes installer & version:
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@Lyndon-Li
Copy link
Contributor

@jhuisss Please also share the podVolumeBackup CRs included in the partially failed backup

@Lyndon-Li Lyndon-Li self-assigned this Mar 6, 2023
@Lyndon-Li Lyndon-Li added Restic Relates to the restic integration Needs investigation labels Mar 6, 2023
@jhuisss
Copy link
Author

jhuisss commented Mar 6, 2023

Hi, now attached logbundle for the PartiallyFailed backup and podVolumeBackup CR yaml file.
bundle-2023-03-06-07-30-48.tar.gz
podvolumebackup-br-e2e-wc1-backup-mts58.yaml.txt

@Lyndon-Li
Copy link
Contributor

This looks like a bug of handling Restic snapshot erroneously:

  • When restic backup encounter this kind of error, it just prints the error to terminal
  • However, restic doesn't stop half-way, instead it continues the backup
  • Finally, restic saves the snapshot into backup store

However, on Velero side:

  • When restic backup command returns, Velero checks the output from restic
  • When Velero sees the error, it treats it as a fatal error and abort
  • Therefore the restic snapshot is not retrieved and recorded by PVB
  • As a result, the snapshot and its data is lost forever

@Lyndon-Li
Copy link
Contributor

Reproduce steps:

  • Deploy a pod which has a PV mounted
  • Login the pod and cd to mounted volume
  • Generate 5 files with each file has 10G data: dd if=/dev/urandom bs=1024M count=10 iflag=fullblock of=tmpfilea
  • Create velcro backup with parameters to include the namespace of above pod
  • During the backup process, e.g. after 20s of the creation beginning, login the pod again and cd to mounted volume.
  • Delete 3 files.
  • Wait until velcro backup to finish which should be end with partial failure.

@Lyndon-Li
Copy link
Contributor

This problem also happens with Kopia uploader

@Lyndon-Li
Copy link
Contributor

After double confirm from the code and testing, Kopia path doesn't have this problem because the snapshot for Kopia path is saved by Velero, if Kopia uploader reports any error, Velero will abort saving the snapshot, so the backed up data will be removed by Kopia GC some time later.

Since in v1.12 and later, we will suppress Restic path, this issue is not with high priority.

@Lyndon-Li Lyndon-Li removed the Kopia label May 23, 2023
@sseago
Copy link
Collaborator

sseago commented May 23, 2023

@Lyndon-Li note that restic will still be available as a non-default option in 1.12 and (assuming we deprecate this in 1.12), until 1.14, assuming the proposed deprecation policy is the one we end up with -- i.e. earliest time we can remove a feature is 2 releases post-deprecation. This assumes that we deprecate restic in the same release we make it non-default. It we deprecate later, then those numbers shift as well.

@Lyndon-Li
Copy link
Contributor

@sseago
Yes, you are right. Here by suppress, I meant we will set Restic path non-default in v1.12.

@ctrought
Copy link

also experience this volume backups using CSI snapshots with datamover, after the BackupDeleteRequest is completed the objects for the volume backup remain in the S3 bucket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants