Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Velero crashes with an "invalid memory address or nil pointer dereference" #8440

Closed
sentia-be-ops opened this issue Nov 21, 2024 · 4 comments
Assignees
Labels
Bug target/1.15.1 Volumes Relating to volume backup and restore
Milestone

Comments

@sentia-be-ops
Copy link

sentia-be-ops commented Nov 21, 2024

What steps did you take and what happened:
Installed velero v1.15.0 using the helm chart v8.0.0 on a k8s v1.30 cluster with vmware-csi based PVs.

Velero will start correctly after install. As soon as the first backup runs the velero pod gets into a CrashLoopBackoff state.
Error shown in the log mentions an "invalid memory address or nil pointer dereference".

I initially installed with the previous helm chart and figured it might be fixed by using chart v8.0.0 as that includes new CRD fields, but no change.

What did you expect to happen:
Velero to not crash and backup my cluster.

The following information will help us better understand what's going on:

Logs

Environment:

  • Velero version: v1.15.0
  • Velero features: EnableCSI
  • Kubernetes version: v1.30.5
  • Kubernetes installer & version: ClusterAPI v1.8.4
  • Cloud provider or hardware configuration: vsphere-cpi
  • OS: Flatcar Container Linux by Kinvolk 3975.2.2 (Oklo)

Other Info
Backup repository is backed by a NetApp based S3 bucket.

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@blackpiglet blackpiglet self-assigned this Nov 22, 2024
@blackpiglet blackpiglet added Volumes Relating to volume backup and restore Needs info Waiting for information Needs triage We need discussion to understand problem and decide the priority labels Nov 22, 2024
@blackpiglet
Copy link
Contributor

volumeInfos[index].SnapshotDataMovementInfo.SnapshotHandle = dataUpload.Status.SnapshotID

The code that triggers the panic is clear, but it is still important to find out how this happens.

The panic only happens when the DataUpload CR's related VolumeInfo doesn't have the SnapshotDataMovementInfo section.
The Velero backup creates a VolumeInfo metadata file in the Object Storage bucket.
截屏2024-11-22 14 56 15
Could you help check the content of your failed backup's VolumeInfo metadata file?
The name should be something like this: backup-name-volumeinfo.json.gz.

@sentia-be-ops
Copy link
Author

I've added the volumeinfo.json to the gist.

Could the issue be caused because there is a mix of CSI-based and non-CSI-based PVs on the cluster? The PV's it can't backup up because of that have skipped: true and don't have a SnapshotDataMovementInfo section because it can't backup those.

The goal is to get rid of the non-CSI-based ones but that is a work in progress. In the meantime it can't backup the data from those non-CSI-based PVs but that is fine, though it shouldn't completely fail on it I think.

@blackpiglet
Copy link
Contributor

Thanks for the information.
It's a reasonable request. I will create a PR to fix it, and the fix should be included in the coming v1.15.1 patch release.

@reasonerjt reasonerjt removed Needs info Waiting for information Needs triage We need discussion to understand problem and decide the priority labels Nov 25, 2024
@reasonerjt reasonerjt added this to the v1.16 milestone Nov 25, 2024
@reasonerjt reasonerjt added the Bug label Nov 25, 2024
@blackpiglet
Copy link
Contributor

Hi @sentia-be-ops,
I created PR #8465 to fix this issue, but after thinking twice, I still couldn't figure out a scenario that could trigger this error.
I built an image based on PR #8465, and the image address is blackpiglet/velero:8440. It's a public image. You should have permission to access it.
Could you help verify whether this PR can fix your issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug target/1.15.1 Volumes Relating to volume backup and restore
Projects
None yet
Development

No branches or pull requests

4 participants