Skip to content

Commit

Permalink
perf: remove crawler scrape lock
Browse files Browse the repository at this point in the history
This PR removes the async lock from the crawlers and replaces it with a semaphore with a capacity of 20

In reality, neither the lock nor the semaphore are needed. The requests are actually limited by the `request_limiter` of each crawler, not the lock. However, i could not remove the lock because it will brake the logic for the UI scrape queue, that's why i replaced it with a semaphore instead.

The lock was making each crawler behave synchronously and the `request_limiter` was never close to being reached. This was only for the crawlers. The downloaders have a semaphore already with a different capacity per domain, so they were not affected.
The default capacity for each downloader is 3, defined by `--max-simultaneous-downloads-per-domain`

## Disadvantages

The only drawback of replacing the lock with a semaphore (or eventually removing the lock altogether) is that the `request_limiter` is defined per crawler and almost all crawlers right now have a generic `10 requests / sec` limit.

Some crawlers may require fine tuning the limiter to make sure CDL does not trigger `429s`
  • Loading branch information
NTFSvolume committed Jan 4, 2025
1 parent 30a03db commit d3a4cd5
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions cyberdrop_dl/scraper/crawler.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ def __init__(self, manager: Manager, domain: str, folder_domain: str | None = No
self.downloader = field(init=False)
self.scraping_progress = manager.progress_manager.scraping_progress
self.client: ScraperClient = field(init=False)
self._lock = asyncio.Lock()
self._semaphore = asyncio.Semaphore(20)

self.domain = domain
self.folder_domain = folder_domain or domain.capitalize()
Expand All @@ -65,7 +65,7 @@ async def run(self, item: ScrapeItem) -> None:
if not item.url.host:
return
self.waiting_items += 1
async with self._lock:
async with self._semaphore:
self.waiting_items -= 1
if item.url.path_qs not in self.scraped_items:
log(f"Scraping: {item.url}", 20)
Expand Down

0 comments on commit d3a4cd5

Please sign in to comment.