Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] add support for sites that yt-dlp supports #276

Open
NTFSvolume opened this issue Nov 14, 2024 · 2 comments
Open

[FEATURE] add support for sites that yt-dlp supports #276

NTFSvolume opened this issue Nov 14, 2024 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@NTFSvolume
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
There are multiple popular sites that we do not currently support. Creating a custom crawler for each one of them is possible but would require significant time and consideration. Instead, we could leverage other tools to do the heavy lifting. One of those tools is yt-dlp

Although yt-dlp has it own "database" like functionality to skip already downloaded URLs with the --archive flag, integrating it into CDL will allow the user to use CDL deduplication functions and unify the databases into 1 place.

Describe the solution you'd like
We can expand the supported site list by creating a generic crawler that uses yt-dlp as a back-end downloader.

Limitations

  1. Supported sites within CDL will be a subset of the supported sites by yt-dlp. The list of supported sites will be hardcoded into CDL
  2. Supported sites will be sites that work out-of-the-box in yt-dlp with no configuration. Ex: No cookies needed, no geo-location bypass, no impersonate X/Y/Z browser, no OAuth
  3. yt-dlp code is entirely synchronous. Which means we will have to run it on a different tread to prevent blocking the main loop but still be able read from its progress hooks
  4. Only natively supported sites will be enabled. Ex: Any site that does not require ffmpeg to download it. ffmpeg is called as a subprocess, so it will block the entire thread
  5. CDL will have no control over the requests behavior of yt-dlp. CDL options like rate-limit, size limits and extensions will not work with yt-dlp. It is possible to implement a custom filter to translate CDL options into yt-dlp options but initially it will just be a limit on the max number of concurrent downloads.
@NTFSvolume NTFSvolume added the enhancement New feature or request label Nov 14, 2024
@NTFSvolume NTFSvolume self-assigned this Nov 14, 2024
@datawhores
Copy link
Collaborator

datawhores commented Nov 14, 2024

The user could use a config file to pass any argument they want

https://old.reddit.com/r/youtubedl/comments/o88mtn/ytdlp_ytdl_config_files_line_order/

@NTFSvolume
Copy link
Collaborator Author

NTFSvolume commented Jan 7, 2025

I got a working crawler for this but integrating it into CDL would required a major refactor of the downloader and some changes in the logic up to the scrapper mapper.

The probability of breaking something is really high and we don't have any tests. I don't think is worth to implement and troubleshoot it.

What I'm thinking right row instead is to add yt-dlp as an external downloader (similar to jdownloader)

This is way simpler and would require 2 changes:

  1. Make CDL capable of reading the archive.txt file generated by yt-dlp. This is to skip URLs already downloaded

  2. Check if an URL is supported by yt-dlp. This should be easy by just calling the YoutubeDL parser. The code is synchronous but it's all internal Python logic, no requests are made1. We can execute it on the same thread

CDL would save all the supported URLs to a special file. At the end of a run, it will call yt-dlp itself to process the file, with some commands equivalent to CDL commands. The archive.txt file will be saved inside AppData/Cache

Pros

  • This is way easier to maintain and scale up
  • It will allow CDL to "support" every URL that yt-dlp supports
  • CDL code will be completely independent from yt-dlp updates
  • The database schema does not need to be updated

Cons

  • Only some cosmetic inconvenience, like the yt-dlp downloads not being included in the logs
  • Even though yt-dlp will run after CDL, we still need to include it as a dependency to use the parser

Footnotes

  1. I was wrong. It does make a request. We can call the parser within generic request limiter and it should be fine

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants