[FEATURE] add support for sites that yt-dlp supports #276

NTFSvolume · 2024-11-14T00:19:35Z

Is your feature request related to a problem? Please describe.
There are multiple popular sites that we do not currently support. Creating a custom crawler for each one of them is possible but would require significant time and consideration. Instead, we could leverage other tools to do the heavy lifting. One of those tools is yt-dlp

Although yt-dlp has it own "database" like functionality to skip already downloaded URLs with the --archive flag, integrating it into CDL will allow the user to use CDL deduplication functions and unify the databases into 1 place.

Describe the solution you'd like
We can expand the supported site list by creating a generic crawler that uses yt-dlp as a back-end downloader.

Limitations

Supported sites within CDL will be a subset of the supported sites by yt-dlp. The list of supported sites will be hardcoded into CDL
Supported sites will be sites that work out-of-the-box in yt-dlp with no configuration. Ex: No cookies needed, no geo-location bypass, no impersonate X/Y/Z browser, no OAuth
yt-dlp code is entirely synchronous. Which means we will have to run it on a different tread to prevent blocking the main loop but still be able read from its progress hooks
Only natively supported sites will be enabled. Ex: Any site that does not require ffmpeg to download it. ffmpeg is called as a subprocess, so it will block the entire thread
CDL will have no control over the requests behavior of yt-dlp. CDL options like rate-limit, size limits and extensions will not work with yt-dlp. It is possible to implement a custom filter to translate CDL options into yt-dlp options but initially it will just be a limit on the max number of concurrent downloads.

The text was updated successfully, but these errors were encountered:

datawhores · 2024-11-14T11:45:51Z

The user could use a config file to pass any argument they want

https://old.reddit.com/r/youtubedl/comments/o88mtn/ytdlp_ytdl_config_files_line_order/

NTFSvolume · 2025-01-07T00:12:11Z

I got a working crawler for this but integrating it into CDL would required a major refactor of the downloader and some changes in the logic up to the scrapper mapper.

The probability of breaking something is really high and we don't have any tests. I don't think is worth to implement and troubleshoot it.

What I'm thinking right row instead is to add yt-dlp as an external downloader (similar to jdownloader)

This is way simpler and would require 2 changes:

Make CDL capable of reading the archive.txt file generated by yt-dlp. This is to skip URLs already downloaded
Check if an URL is supported by yt-dlp. This should be easy by just calling the YoutubeDL parser. The code is synchronous but it's all internal Python logic, ~~no requests are made~~¹. We can execute it on the same thread

CDL would save all the supported URLs to a special file. At the end of a run, it will call yt-dlp itself to process the file, with some commands equivalent to CDL commands. The archive.txt file will be saved inside AppData/Cache

Pros

This is way easier to maintain and scale up
It will allow CDL to "support" every URL that yt-dlp supports
CDL code will be completely independent from yt-dlp updates
The database schema does not need to be updated

Cons

Only some cosmetic inconvenience, like the yt-dlp downloads not being included in the logs
Even though yt-dlp will run after CDL, we still need to include it as a dependency to use the parser

I was wrong. It does make a request. We can call the parser within generic request limiter and it should be fine ↩

NTFSvolume added the enhancement New feature or request label Nov 14, 2024

NTFSvolume self-assigned this Nov 14, 2024

NTFSvolume mentioned this issue Nov 24, 2024

refactor: add proper integration of crawlers that support multiple sites/domains #315

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] add support for sites that yt-dlp supports #276

[FEATURE] add support for sites that yt-dlp supports #276

NTFSvolume commented Nov 14, 2024

datawhores commented Nov 14, 2024 •

edited

Loading

NTFSvolume commented Jan 7, 2025 •

edited

Loading

[FEATURE] add support for sites that yt-dlp supports #276

[FEATURE] add support for sites that yt-dlp supports #276

Comments

NTFSvolume commented Nov 14, 2024

datawhores commented Nov 14, 2024 • edited Loading

NTFSvolume commented Jan 7, 2025 • edited Loading

Pros

Cons

Footnotes

datawhores commented Nov 14, 2024 •

edited

Loading

NTFSvolume commented Jan 7, 2025 •

edited

Loading