Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debug requst #39

Open
syguan96 opened this issue Apr 3, 2024 · 4 comments
Open

Debug requst #39

syguan96 opened this issue Apr 3, 2024 · 4 comments

Comments

@syguan96
Copy link

syguan96 commented Apr 3, 2024

Sorry to bother you. I only downloaded 653 videos of the test split. So I tried to debug. May I ask whether you met this problem?

Failed to delete the file: /tmp/3c7cd5e6-e6ff-4bc5-82ca-32208f049f25.mp4. Error: [Errno 2] No such file or directory: '/tmp/3c7cd5e6-e6ff-4bc5-82ca-32208f049f25.mp4'
@syguan96
Copy link
Author

syguan96 commented Apr 3, 2024

I can found 3c7cd5e6-e6ff-4bc5-82ca-32208f049f25.mp4.mkv in \tmp

@kuno989
Copy link

kuno989 commented Apr 5, 2024

I have a bug like this too, but I can't find the cause yet. haha

@Zhidong-Gao
Copy link

Zhidong-Gao commented Apr 5, 2024

I got same error, it was caused due to the mismatch between downloaded file and pre-defined name (yt_dlp will add extra extension .mkv after the original name)

I solve the problem by removing the mp4 extension and filtering out the matched files in the cache folder,

Below is my modification:
dataset_dataloading/video2dataset/video2dataset/data_reader.py,

line 214
original:
video_path = f"{self.tmp_dir}/{str(uuid.uuid4())}.mp4"
now:
video_path = f"{self.tmp_dir}/{str(uuid.uuid4())}"

line 269-271
original:

with portalocker.Lock(modality_path, 'rb', timeout=180) as locked_file:
    streams[modality] = locked_file.read()
os.remove(modality_path)

now:

matching_files = glob.glob(modality_path+'*')
with portalocker.Lock(matching_files[0], 'rb', timeout=180) as locked_file:
    streams[modality] = locked_file.read()
for file in matching_files:
    os.remove(file)

the above solution works for me but its not perfect, hope the authors could fix this bug

@tsaishien-chen
Copy link
Contributor

Hi @syguan96, @kuno989, @Zhidong-Gao,
Thanks for the interest about the dataset! Also thanks for letting me know the bug.
I have dig into the problems and found the bugs are caused from: no extension limit for the audio.
So if the audio is not downloaded in mp4 format, the downloaded video will get double extension.
To fix that, please replace the format here

video_format_string = (
f"wv*[height>={self.video_size}][ext=mp4]{'[codec=avc1]' if self.specify_codec else ''}{'+ba' if self.download_audio else ''}/"
f"w[height>={self.video_size}][ext=mp4]{'[codec=avc1]' if self.specify_codec else ''}{'+ba' if self.download_audio else ''}/"
f"bv/b[ext=mp4]{'[codec=avc1]' if self.specify_codec else ''}{'+ba' if self.download_audio else ''}"
)

to this one:

    video_format_string = (
        f"wv*[height>={self.video_size}][ext=mp4]{'[codec=avc1]' if self.specify_codec else ''}{'+ba[ext=mp4]' if self.download_audio else ''}/"
        f"w[height>={self.video_size}][ext=mp4]{'[codec=avc1]' if self.specify_codec else ''}{'+ba[ext=mp4]' if self.download_audio else ''}/"
        f"bv[ext=mp4]{'[codec=avc1]' if self.specify_codec else ''}{'+ba[ext=mp4]' if self.download_audio else ''}/"
        f"b[ext=mp4]{'[codec=avc1]' if self.specify_codec else ''}{'+ba[ext=mp4]' if self.download_audio else ''}"
    )

This should help. I'll also update the code soon to this repo.
By the way, for the solution from @Zhidong-Gao, you will miss lots of samples if doing so, so I strongly recommend you to follow the above steps to fix this issue.
Please let me know if there is any problem!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants