Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backups stuck Deleting because of "filename too long" #8434

Open
ameer2rock opened this issue Nov 20, 2024 · 18 comments · May be fixed by #8449
Open

Backups stuck Deleting because of "filename too long" #8434

ameer2rock opened this issue Nov 20, 2024 · 18 comments · May be fixed by #8449
Assignees
Milestone

Comments

@ameer2rock
Copy link

What steps did you take and what happened:
Backups fail to delete from the cluster (stuck in Deleting state) and have to be manually removed from s3 and kubectl delete backups.velero.io <backup-name>

What did you expect to happen:
Backups maintenance to occur without getting stuck.

The following information will help us better understand what's going on:
I cannot attach the debug logs due to privacy issues, but the issues is easily repeatable by backing up a filename longer than 255 characters. The backup succeeds, but the backup will not delete. I assume a restore would fail in a similar fashion. With debug logging on while deleting the below error is in the log:

error invoking delete item actions: error extracting backup: open file name too long

Anything else you would like to add:
It appears that the container image is based on Ubuntu 22.04, and that has a limit of 255 characters which is getting triggered when Velero tries to unzip the backup bundle.

The error is coming from here:
https://github.com/vmware-tanzu/velero/blob/main/internal/delete/delete_item_action_handler.go
And from the os class here:
https://github.com/golang/go/blob/a3c068c57ae3f71a7720fe68da379143bb579362/src/os/getwd.go#L57
Ubuntu 22.04 ext4 filename limit
https://help.ubuntu.com/stable/ubuntu-help/files-rename.html.he#:~:text=This%20255%20character%20limit%20includes,and%20folder%20names%20where%20possible.

Is it possible to have a container image that supports longer filenames? If not, would it be possible to not backup files with filenames > 255 characters and skip them with an error to prevent the issue I describe?

Environment:

  • Velero version (use velero version): v1.13.2
  • Velero features (use velero client config get features):
  • Kubernetes version (use kubectl version): 1.25.13
  • Kubernetes installer & version: kubeadm + Spectrocloud
  • Cloud provider or hardware configuration: VMWare ESX
  • OS (e.g. from /etc/os-release): Ubuntu 22.04.3 LTS

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@kaovilai
Copy link
Member

❯ docker run --rm -it ubuntu:22.04 sh 
Unable to find image 'ubuntu:22.04' locally
a186900671ab: Download complete 
981912c48e9a: Download complete 
# printf '%255s\n' | tr ' ' 'a'^[[D^[[D^C
# touch $(printf '%255s\n' | tr ' ' 'a').txt
touch: cannot touch 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.txt': File name too long

It appears that the container image is based on Ubuntu 22.04

Who decided to use Ubuntu 22.04? is it Velero? or your workload?

@blackpiglet
Copy link
Contributor

Thanks for the detailed analysis.
It's valuable to address the corner case.
I still don't quite understand the backup deletion failed with the backup metadata file name or the data file name in the back-up volumes?

If the reason is the backup metadata file name is too long, is it possible to shorten the backup name to resolve the issue?

@blackpiglet
Copy link
Contributor

@kaovilai
The Velero image is based on the Ubuntu Jammy version.

FROM paketobuildpacks/run-jammy-tiny:0.2.19

@kaovilai
Copy link
Member

❯ docker run --rm -it ubuntu:22.04 sh -c "touch $(printf '%255s\n' | tr ' ' 'a').txt"
touch: cannot touch 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.txt': File name too long

~
❯ docker run --rm -it fedora sh -c "touch $(printf '%255s\n' | tr ' ' 'a').txt"
Unable to find image 'fedora:latest' locally
e3c408ecb0c2: Download complete 
28e90bb0aca9: Download complete 
touch: cannot touch 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.txt': File name too long

~
❯ docker run --rm -it alpine sh -c "touch $(printf '%255s\n' | tr ' ' 'a').txt"
Unable to find image 'alpine:latest' locally
9986a736f7d3: Download complete 
511a44083d3a: Download complete 
touch: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.txt: Filename too long

I think 255 is a common limit.

@kaovilai
Copy link
Member

Goal is to make sure this line is less than 255 because it is path+filename that is going to be opened.

target := filepath.Join(dir, header.Name) //nolint:gosec // Internal usage. No need to check.

Possible workaround could include, not directly writing to this filename, but to generate uuid short enough that path+uuid fit within 255. We would then lookup in a map the full file name (if file name matters at all).

If file name do not matter, because it's read in but thrown away, we can just write to a file named in single digits 1, 2, 3, ...

The resulting dir with filenames is read here.

backupResources, err := archive.NewParser(ctx.Log, ctx.Filesystem).Parse(dir)

It does not appear that Parse func care for file name, just dir name.

for _, namespaceDir := range namespaceDirs {
if !namespaceDir.IsDir() {
p.log.Warnf("Ignoring unexpected file %q in directory %q", namespaceDir.Name(), strings.TrimPrefix(namespaceScopedDir, dir+"/"))
continue
}
items, err := p.getResourceItemsForScope(filepath.Join(namespaceScopedDir, namespaceDir.Name()), dir)
if err != nil {
return nil, err
}
if len(items) > 0 {
resourceItems.ItemsByNamespace[namespaceDir.Name()] = items
}
}
}
resources[resourceDir.Name()] = resourceItems
}
return resources, nil

ok but looking further.. file name is used here

items = append(items, strings.TrimSuffix(file.Name(), ".json"))

Depending on how common this is we could certainly do something about it.

  • use shorter file name, move long names into index.json or similar file name mapping mechanisms.
  • use tempdir that is shorter in path.

@ameer2rock
Copy link
Author

Thanks for the detailed analysis. It's valuable to address the corner case. I still don't quite understand the backup deletion failed with the backup metadata file name or the data file name in the back-up volumes?

If the reason is the backup metadata file name is too long, is it possible to shorten the backup name to resolve the issue?

It is the length of the filename IN the backup, not the length of the backup name itself. I will communicate with the customer to see if they can reduce the length of the names and to skip the workloads so the rest of things can be backed up normally and so the retention deletes succeed.

Seems like for the long term it may be worthwhile to not add too long files to the tarball to prevent issues with restoring and retention deletes. Or perhaps change the container image to an image that has a higher filename length limit. Definitely interesting because the container image of the application we are backing up itself can support the longer filenames, but the velero pod cannot untar the long file.

@kaovilai
Copy link
Member

container image of the application we are backing up itself

enlighten us?

@ameer2rock
Copy link
Author

container image of the application we are backing up itself

enlighten us?

I stand corrected... The container being backed up is RHEL 8.6 and is has the same filename length limit. These files are part of kafka topics, so I think they must be less that 255 characters, but since the Velero pod adds a temp directory path; it is exceeding the limit when trying to untar.

@kaovilai
Copy link
Member

@ameer2rock how far are you beyond 255? ie. is this something where velero untarring into a very short mountpoint name would help?

@kaovilai
Copy link
Member

tempdir name came from

dir, err := e.fs.TempDir("", "")

calls
return os.MkdirTemp(dir, prefix)

example run produces /tmp/2614631023
go playground also have similar issue

@ameer2rock
Copy link
Author

@ameer2rock how far are you beyond 255? ie. is this something where velero untarring into a very short mountpoint name would help?

Looks like the tmp path is 80 characters, and the filename is 259. I am going to check with the app team as its possible that the file is problematic within the container itself. It may be a good idea to implement a container filesystem that can handle longer filenames to prevent this sort of thing as well.

This is the path:
/tmp/3555729486/resources/kafkatopics.kafka.strimzi.io/namespaces/xxxx-preprod/

This is the filename:
api-xx-xxxx102.xxx.x-xxxxxx.com.api-xx-xxxx102.xxx.x-xxxxxx.com.api-xx-xxxx102.xxx.x-xxxxxx.com.api-xx-xxxx102.xxx.x-xxxxxx.com.api-xx-xxxx102.xxx.x-xxxxxx.com.api-xx-xxxx102.xxx.x-xxxxxx.com.stmig-data-transac---b5730eb1164e2e41214d1fef9b727062e8757868.json

@kaovilai kaovilai linked a pull request Nov 24, 2024 that will close this issue
3 tasks
@kaovilai
Copy link
Member

@ameer2rock try ghcr.io/kaovilai/velero:maxpathlimits-a2699e765 from #8449 and lmk if it works.

@kaovilai
Copy link
Member

kaovilai commented Dec 3, 2024

@ameer2rock any updates?

@kaovilai kaovilai added this to the v1.16 milestone Dec 4, 2024
@ameer2rock
Copy link
Author

Sorry for the delay getting back to you. Working with the application owner, we found Kafka topics that were at the 254 characters, basically just before the limit of their container. When Velero backs that up; we it is saved with long-file-name.json. So the .json is what is exceeding the 255 character limit. They are working on fixing those topics right now.

@shubham-pampattiwar
Copy link
Collaborator

@ameer2rock please close the issue if fixing the topics works for you.

@kaovilai
Copy link
Member

kaovilai commented Dec 5, 2024

Do we still want this fix? At least the PR is done and ready when this becomes needed again.

@ameer2rock
Copy link
Author

We can get around this by modifying Kafka, so I did not test the fix. I still think its valuable because the failure condition causes backups to get stuck deleting and can build up etcd objects. Closing the issue, thank you for your help.

@kaovilai
Copy link
Member

kaovilai commented Dec 6, 2024

@ameer2rock thanks for confirming. We'll reopen the issue tho to track #8449 fix for 1.16 release.

@kaovilai kaovilai reopened this Dec 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants