Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: remove crawler scrape lock #425

Merged
merged 1 commit into from
Jan 5, 2025

Conversation

NTFSvolume
Copy link
Collaborator

This PR removes the async lock from the crawlers and replaces it with a semaphore with a capacity of 20

In reality, neither the lock nor the semaphore are needed. The requests are actually limited by the request_limiter of each crawler, not the lock. However, i could not remove the lock because it will brake the logic for the UI scrape queue, that's why i replaced it with a semaphore instead.

The lock was making each crawler behave synchronously and the request_limiter was never close to being reached. This was only for the crawlers. The downloaders have a semaphore already with a different capacity per domain, so they were not affected. The default capacity for each downloader is 3, defined by --max-simultaneous-downloads-per-domain

Disadvantages

  • The main drawback of replacing the lock with a semaphore (or eventually removing the lock altogether) is that the request_limiter is defined per crawler and almost all crawlers right now have a generic 10 requests / sec limit. Some crawlers may require fine tuning the limiter to make sure CDL does not trigger 429s

  • The I/O to the requests cache will increase but that one is already overloaded in some cases. Needs a fix for the database lockups.

This PR removes the async lock from the crawlers and replaces it with a semaphore with a capacity of 20

In reality, neither the lock nor the semaphore are needed. The requests are actually limited by the `request_limiter` of each crawler, not the lock. However, i could not remove the lock because it will brake the logic for the UI scrape queue, that's why i replaced it with a semaphore instead.

The lock was making each crawler behave synchronously and the `request_limiter` was never close to being reached. This was only for the crawlers. The downloaders have a semaphore already with a different capacity per domain, so they were not affected.
The default capacity for each downloader is 3, defined by `--max-simultaneous-downloads-per-domain`

## Disadvantages

The only drawback of replacing the lock with a semaphore (or eventually removing the lock altogether) is that the `request_limiter` is defined per crawler and almost all crawlers right now have a generic `10 requests / sec` limit.

Some crawlers may require fine tuning the limiter to make sure CDL does not trigger `429s`
@NTFSvolume NTFSvolume added the refactor No user facing changes label Jan 4, 2025
@NTFSvolume NTFSvolume requested a review from jbsparrow January 4, 2025 22:52
NTFSvolume added a commit to NTFSvolume/CyberDropDownloader that referenced this pull request Jan 5, 2025
@jbsparrow jbsparrow merged commit 7b89d53 into jbsparrow:master Jan 5, 2025
5 checks passed
@NTFSvolume NTFSvolume deleted the remove_scrape_lock branch January 5, 2025 22:47
NTFSvolume added a commit that referenced this pull request Jan 5, 2025
* fix: use a dataclass for reddit posts

Should fix #426

* refactor: pass `scrape_item` as origin for `web_pager` (chevereto)

* fix: "Loose Files" not being created (all crawlers)

* refactor: add custom MediaFireError (mediafire)

* fix: scrape error codes

* refactor: make `RealDebridError` Inherit from `CDLBaseError`

* refactor: move `RealDebridError` to `clients.errors`

* refactor: move `VALIDATION_ERROR_FOOTER ` to `constants`

* fix: undo crawler semaphore

Moved to PR #425
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refactor No user facing changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants