Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VolumeSnapshots not cleaned up after backup is completed #7556

Closed
abh opened this issue Mar 25, 2024 · 12 comments
Closed

VolumeSnapshots not cleaned up after backup is completed #7556

abh opened this issue Mar 25, 2024 · 12 comments
Assignees

Comments

@abh
Copy link

abh commented Mar 25, 2024

Using snapshotMoveData: true velero seems to leave volumeSnapshots behind after the backup has completed.

I'm not sure if this is a bug or a deliberate feature, but it makes the feature not work for my environment (backing up Ceph volumes to an S3 store).

This has some overlap with #7550.

Best I can tell this schedule makes a snapshot of every volume in the chosen namespaces, but only volumes associated with pods annotated with the backup.velero.io/backup-volumes annotation are backed up and (for me more importantly) deleted. This leaves all the snapshots not backed up around. (Some of them in our system have really high data churn so the Ceph system fills up).

I'd expect either every PVC being backed up (or another option for annotating or labeling the PVCs to exist) by the volume snapshot + mover feature; or only the ones otherwise annotated -- but most importantly I'd expect every snapshot that velero creates when using the snapshotMoveData feature to be cleaned up when the backup is done.

Debug bundle attached below.

apiVersion: velero.io/v1
kind: Schedule
metadata:
  creationTimestamp: "2024-03-23T00:21:05Z"
  generation: 7
  name: test-mover
  namespace: velero
  resourceVersion: "1738203723"
  uid: c84d9f1a-5f55-4ec4-9bd8-e78b9409f200
spec:
  schedule: '@every 168h'
  skipImmediately: false
  template:
    csiSnapshotTimeout: 10m
    hooks: {}
    includedNamespaces:
    - ask
    - mailserver
    - postgres
    - mirrors
    itemOperationTimeout: 8h
    metadata: {}
    snapshotMoveData: true
    ttl: 720h
    uploaderConfig:
      parallelFilesUpload: 10
  useOwnerReferencesInBackup: false
status:
  phase: Enabled

bundle-2024-03-20-17-52-52.tar.gz

@blackpiglet
Copy link
Contributor

@abh
No, leaving VolumeSnapshots behind is not the correct behavior of snapshot data mover.
Just checked the debug bundle, but I didn't find the backup generated from the schedule test-move, nor the test-move
schedule itself. There is no backup enabling the snapshotMoveData option.

@Lyndon-Li
Copy link
Contributor

From the description only volumes associated with pods annotated with the backup.velero.io/backup-volumes annotation are backed up, looks like you are using the wrong backup type -- fs-backup instead of data mover backup.
From the log, we can see one fs-backup also, no data mover backup.

Therefore, please double check the prerequisites for data mover backups --- 1. CSI plugin is installed; 2. Enable-CSI feature gate is enabled; 3. snapshotMoveData is set and default-volumes-to-fs-backup is not set

@abh
Copy link
Author

abh commented Mar 25, 2024

Thanks @Lyndon-Li & @blackpiglet! I must have specified the wrong --backup when running velero debug, sorry!

This bundle has more data, and the backups from test-mover which should be configured as you described, @Lyndon-Li.

bundle-2024-03-25-00-14-40.tar.gz

@Lyndon-Li
Copy link
Contributor

This log bundle still doesn't have the data mover backups.

@abh
Copy link
Author

abh commented Mar 25, 2024

I thought these were it? From the bundle backups-202403250014.5897.json:

{
  "apiVersion": "velero.io/v1",
  "kind": "Backup",
  "metadata": {
    "annotations": {
      "velero.io/resource-timeout": "10m0s",
      "velero.io/source-cluster-k8s-gitversion": "v1.27.10",
      "velero.io/source-cluster-k8s-major-version": "1",
      "velero.io/source-cluster-k8s-minor-version": "27"
    },
    "creationTimestamp": "2024-03-24T04:33:33Z",
    "generation": 212,
    "labels": {
      "velero.io/schedule-name": "test-mover",
      "velero.io/storage-location": "default"
    },
    "managedFields": [
      {
        "apiVersion": "velero.io/v1",
        "fieldsType": "FieldsV1",
        "fieldsV1": {
          "f:metadata": {
            "f:labels": {
              ".": {},
              "f:velero.io/schedule-name": {}
            }
          },
          "f:spec": {
            ".": {},
            "f:csiSnapshotTimeout": {},
            "f:hooks": {},
            "f:includedNamespaces": {},
            "f:itemOperationTimeout": {},
            "f:metadata": {},
            "f:snapshotMoveData": {},
            "f:ttl": {},
            "f:uploaderConfig": {
              ".": {},
              "f:parallelFilesUpload": {}
            }
          },
          "f:status": {}
        },
        "manager": "velero",
        "operation": "Update",
        "time": "2024-03-24T04:33:33Z"
      },
      {
        "apiVersion": "velero.io/v1",
        "fieldsType": "FieldsV1",
        "fieldsV1": {
          "f:metadata": {
            "f:annotations": {
              ".": {},
              "f:velero.io/resource-timeout": {},
              "f:velero.io/source-cluster-k8s-gitversion": {},
              "f:velero.io/source-cluster-k8s-major-version": {},
              "f:velero.io/source-cluster-k8s-minor-version": {}
            },
            "f:labels": {
              "f:velero.io/storage-location": {}
            }
          },
          "f:spec": {
            "f:defaultVolumesToFsBackup": {},
            "f:storageLocation": {}
          },
          "f:status": {
            "f:completionTimestamp": {},
            "f:expiration": {},
            "f:formatVersion": {},
            "f:hookStatus": {},
            "f:phase": {},
            "f:progress": {
              ".": {},
              "f:itemsBackedUp": {},
              "f:totalItems": {}
            },
            "f:startTimestamp": {},
            "f:version": {}
          }
        },
        "manager": "velero-server",
        "operation": "Update",
        "time": "2024-03-24T04:53:44Z"
      }
    ],
    "name": "test-mover-20240324043333",
    "namespace": "velero",
    "resourceVersion": "1737365217",
    "uid": "e3bb47a2-9afa-4abd-9d99-52b4adbad13f"
  },
  "spec": {
    "csiSnapshotTimeout": "10m0s",
    "defaultVolumesToFsBackup": false,
    "hooks": {},
    "includedNamespaces": [
      "*"
    ],
    "itemOperationTimeout": "3m0s",
    "metadata": {},
    "snapshotMoveData": true,
    "storageLocation": "default",
    "ttl": "720h0m0s",
    "uploaderConfig": {
      "parallelFilesUpload": 10
    }
  },
  "status": {
    "completionTimestamp": "2024-03-24T04:53:44Z",
    "expiration": "2024-04-23T04:33:33Z",
    "formatVersion": "1.1.0",
    "hookStatus": {},
    "phase": "Completed",
    "progress": {
      "itemsBackedUp": 18752,
      "totalItems": 18752
    },
    "startTimestamp": "2024-03-24T04:33:33Z",
    "version": 1
  }
}

and backup_describe_test-mover-20240324192457.txt:

Name:         �[1mtest-mover-20240324192457�[0m
Namespace:    velero
Labels:       velero.io/schedule-name=test-mover
              velero.io/storage-location=default
Annotations:  velero.io/resource-timeout=10m0s
              velero.io/source-cluster-k8s-gitversion=v1.27.10
              velero.io/source-cluster-k8s-major-version=1
              velero.io/source-cluster-k8s-minor-version=27

Phase:  �[32mCompleted�[0m

Uploader config:
  Parallel files upload:  10


Namespaces:
  Included:  ask, ntppool, ntpbeta, ntpdb, askntp, mailserver, geodns-data, postgres, mirrors, ntpvault, spamsources, robert
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        <none>
  Cluster-scoped:  auto

Label selector:  <none>

Or label selector:  <none>

Storage Location:  default

Velero-Native Snapshot PVs:  auto
Snapshot Move Data:          true
Data Mover:                  velero

TTL:  720h0m0s

CSISnapshotTimeout:    10m0s
ItemOperationTimeout:  8h0m0s

Hooks:  <none>

Backup Format Version:  1.1.0

Started:    2024-03-24 12:24:57 -0700 PDT
Completed:  2024-03-24 12:37:42 -0700 PDT

Expiration:  2024-04-23 12:24:57 -0700 PDT

Total items to be backed up:  4983
Items backed up:              4983

Resource List:  <error getting backup resource list: Get "https://kube-backup-store.tailscale.svc.cluster.local:9000/velero/backups/test-mover-20240324192457/test-mover-20240324192457-resource-list.json.gz?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=G7sHZ6yEbvG0ZUBLGNAe%2F20240325%2Fminio%2Fs3%2Faws4_request&X-Amz-Date=20240325T071802Z&X-Amz-Expires=600&X-Amz-SignedHeaders=host&x-id=GetObject&X-Amz-Signature=0e8c9d0ed3aa25c4323d87a507a81659ccb351531663abcc49d02f2f8c76863a": tls: failed to verify certificate: x509: certificate is valid for kube-backup-store.ntp.ts.net, not kube-backup-store.tailscale.svc.cluster.local>

Backup Volumes:
  <error getting backup volume info: Get "https://kube-backup-store.tailscale.svc.cluster.local:9000/velero/backups/test-mover-20240324192457/test-mover-20240324192457-volumeinfo.json.gz?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=G7sHZ6yEbvG0ZUBLGNAe%2F20240325%2Fminio%2Fs3%2Faws4_request&X-Amz-Date=20240325T071803Z&X-Amz-Expires=600&X-Amz-SignedHeaders=host&x-id=GetObject&X-Amz-Signature=925f255464fb2d8e0a04f1c1971aab08740ea6b05aafa916fff6dd5f3acadf01": tls: failed to verify certificate: x509: certificate is valid for kube-backup-store.ntp.ts.net, not kube-backup-store.tailscale.svc.cluster.local>

HooksAttempted:  0
HooksFailed:     0

@blackpiglet
Copy link
Contributor

Indeed, the backup enabled the snapshot data move feature, but I also didn't find snapshot-data-move-related logs.
Could you post the Velero installation command here?
Need to find out whether the Velero environment was set up correctly for the data-mover feature.

@abh
Copy link
Author

abh commented Mar 27, 2024

this is the installation command I used (plus editing the deployment and daemonset to increase the memory limits):

velero install \
     --provider aws \
     --features=EnableCSI \
     --plugins velero/velero-plugin-for-aws:v1.9.1,velero/velero-plugin-for-csi:v0.3.0 \
     --bucket velero \
     --secret-file ./credentials-velero \
     --use-volume-snapshots=false \
     --use-node-agent \
     --privileged-node-agent \
     --backup-location-config region=minio,s3ForcePathStyle="true",s3Url=https://kube-backup-store.tailscale.svc.cluster.local:9000

@blackpiglet
Copy link
Contributor

Please use the newer version of the Velero CSI plugin.
For v1.13.x, the compatible CSI plugin is velero/velero-plugin-for-csi:v0.7.0.

@abh
Copy link
Author

abh commented Mar 27, 2024

Oh! Thank you; I will upgrade.

0.3.0 is documented as the one to use for 1.13 at https://velero.io/docs/v1.13/csi/

Screenshot 2024-03-26 at 19 37 50

@blackpiglet
Copy link
Contributor

Thanks for the feedback. I will update the document.

@abh
Copy link
Author

abh commented Mar 27, 2024

Upgrading to 0.7.0 made the snapshots get deleted as expected. 🎉🥳 Thank you so much for the prompt assistance.

@blackpiglet I'll leave it to you to close this issue or keep it and maybe add a feature to have the plugin or velero check that the other component is within an expected range of versions.

I now have new feature requests around how the PVCs are chosen to be snapshot and moved; but that's a separate issue. :-)

@blackpiglet
Copy link
Contributor

Close this issue for now.
Please check this in-progress design addresses your needs. #6956

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants