You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
There are multiple popular sites that we do not currently support. Creating a custom crawler for each one of them is possible but would require significant time and consideration. Instead, we could leverage other tools to do the heavy lifting. One of those tools is yt-dlp
Although yt-dlp has it own "database" like functionality to skip already downloaded URLs with the --archive flag, integrating it into CDL will allow the user to use CDL deduplication functions and unify the databases into 1 place.
Describe the solution you'd like
We can expand the supported site list by creating a generic crawler that uses yt-dlp as a back-end downloader.
Limitations
Supported sites within CDL will be a subset of the supported sites by yt-dlp. The list of supported sites will be hardcoded into CDL
Supported sites will be sites that work out-of-the-box in yt-dlp with no configuration. Ex: No cookies needed, no geo-location bypass, no impersonate X/Y/Z browser, no OAuth
yt-dlp code is entirely synchronous. Which means we will have to run it on a different tread to prevent blocking the main loop but still be able read from its progress hooks
Only natively supported sites will be enabled. Ex: Any site that does not require ffmpeg to download it. ffmpeg is called as a subprocess, so it will block the entire thread
CDL will have no control over the requests behavior of yt-dlp. CDL options like rate-limit, size limits and extensions will not work with yt-dlp. It is possible to implement a custom filter to translate CDL options into yt-dlp options but initially it will just be a limit on the max number of concurrent downloads.
The text was updated successfully, but these errors were encountered:
I got a working crawler for this but integrating it into CDL would required a major refactor of the downloader and some changes in the logic up to the scrapper mapper.
The probability of breaking something is really high and we don't have any tests. I don't think is worth to implement and troubleshoot it.
What I'm thinking right row instead is to add yt-dlp as an external downloader (similar to jdownloader)
This is way simpler and would require 2 changes:
Make CDL capable of reading the archive.txt file generated by yt-dlp. This is to skip URLs already downloaded
Check if an URL is supported by yt-dlp. This should be easy by just calling the YoutubeDL parser. The code is synchronous but it's all internal Python logic, no requests are made1. We can execute it on the same thread
CDL would save all the supported URLs to a special file. At the end of a run, it will call yt-dlp itself to process the file, with some commands equivalent to CDL commands. The archive.txt file will be saved inside AppData/Cache
Pros
This is way easier to maintain and scale up
It will allow CDL to "support" every URL that yt-dlp supports
CDL code will be completely independent from yt-dlp updates
The database schema does not need to be updated
Cons
Only some cosmetic inconvenience, like the yt-dlp downloads not being included in the logs
Even though yt-dlp will run after CDL, we still need to include it as a dependency to use the parser
Footnotes
I was wrong. It does make a request. We can call the parser within generic request limiter and it should be fine ↩
Is your feature request related to a problem? Please describe.
There are multiple popular sites that we do not currently support. Creating a custom crawler for each one of them is possible but would require significant time and consideration. Instead, we could leverage other tools to do the heavy lifting. One of those tools is yt-dlp
Although
yt-dlp
has it own "database" like functionality to skip already downloaded URLs with the--archive
flag, integrating it into CDL will allow the user to use CDL deduplication functions and unify the databases into 1 place.Describe the solution you'd like
We can expand the supported site list by creating a generic crawler that uses
yt-dlp
as a back-end downloader.Limitations
yt-dlp
. The list of supported sites will be hardcoded into CDLyt-dlp
with no configuration. Ex: No cookies needed, no geo-location bypass, no impersonate X/Y/Z browser, no OAuthyt-dlp
code is entirely synchronous. Which means we will have to run it on a different tread to prevent blocking the main loop but still be able read from its progress hooksffmpeg
to download it.ffmpeg
is called as a subprocess, so it will block the entire threadyt-dlp
. CDL options like rate-limit, size limits and extensions will not work withyt-dlp
. It is possible to implement a custom filter to translate CDL options into yt-dlp options but initially it will just be a limit on the max number of concurrent downloads.The text was updated successfully, but these errors were encountered: