perf: remove crawler scrape lock #425

NTFSvolume · 2025-01-04T22:52:25Z

This PR removes the async lock from the crawlers and replaces it with a semaphore with a capacity of 20

In reality, neither the lock nor the semaphore are needed. The requests are actually limited by the request_limiter of each crawler, not the lock. However, i could not remove the lock because it will brake the logic for the UI scrape queue, that's why i replaced it with a semaphore instead.

The lock was making each crawler behave synchronously and the request_limiter was never close to being reached. This was only for the crawlers. The downloaders have a semaphore already with a different capacity per domain, so they were not affected. The default capacity for each downloader is 3, defined by --max-simultaneous-downloads-per-domain

Disadvantages

The main drawback of replacing the lock with a semaphore (or eventually removing the lock altogether) is that the request_limiter is defined per crawler and almost all crawlers right now have a generic 10 requests / sec limit. Some crawlers may require fine tuning the limiter to make sure CDL does not trigger 429s
The I/O to the requests cache will increase but that one is already overloaded in some cases. Needs a fix for the database lockups.

This PR removes the async lock from the crawlers and replaces it with a semaphore with a capacity of 20 In reality, neither the lock nor the semaphore are needed. The requests are actually limited by the `request_limiter` of each crawler, not the lock. However, i could not remove the lock because it will brake the logic for the UI scrape queue, that's why i replaced it with a semaphore instead. The lock was making each crawler behave synchronously and the `request_limiter` was never close to being reached. This was only for the crawlers. The downloaders have a semaphore already with a different capacity per domain, so they were not affected. The default capacity for each downloader is 3, defined by `--max-simultaneous-downloads-per-domain` ## Disadvantages The only drawback of replacing the lock with a semaphore (or eventually removing the lock altogether) is that the `request_limiter` is defined per crawler and almost all crawlers right now have a generic `10 requests / sec` limit. Some crawlers may require fine tuning the limiter to make sure CDL does not trigger `429s`

Moved to PR jbsparrow#425

* fix: use a dataclass for reddit posts Should fix #426 * refactor: pass `scrape_item` as origin for `web_pager` (chevereto) * fix: "Loose Files" not being created (all crawlers) * refactor: add custom MediaFireError (mediafire) * fix: scrape error codes * refactor: make `RealDebridError` Inherit from `CDLBaseError` * refactor: move `RealDebridError` to `clients.errors` * refactor: move `VALIDATION_ERROR_FOOTER ` to `constants` * fix: undo crawler semaphore Moved to PR #425

NTFSvolume added the refactor No user facing changes label Jan 4, 2025

NTFSvolume requested a review from jbsparrow January 4, 2025 22:52

NTFSvolume added a commit to NTFSvolume/CyberDropDownloader that referenced this pull request Jan 5, 2025

fix: undo crawler semaphore

001b6c7

Moved to PR jbsparrow#425

jbsparrow approved these changes Jan 5, 2025

View reviewed changes

jbsparrow merged commit 7b89d53 into jbsparrow:master Jan 5, 2025
5 checks passed

NTFSvolume deleted the remove_scrape_lock branch January 5, 2025 22:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: remove crawler scrape lock #425

perf: remove crawler scrape lock #425

NTFSvolume commented Jan 4, 2025

perf: remove crawler scrape lock #425

perf: remove crawler scrape lock #425

Conversation

NTFSvolume commented Jan 4, 2025

Disadvantages