perf: remove crawler scrape lock

This PR removes the async lock from the crawlers and replaces it with a semaphore with a capacity of 20 In reality, neither the lock nor the semaphore are needed. The requests are actually limited by the `request_limiter` of each crawler, not the lock. However, i could not remove the lock because it will brake the logic for the UI scrape queue, that's why i replaced it with a semaphore instead. The lock was making each crawler behave synchronously and the `request_limiter` was never close to being reached. This was only for the crawlers. The downloaders have a semaphore already with a different capacity per domain, so they were not affected. The default capacity for each downloader is 3, defined by `--max-simultaneous-downloads-per-domain` ## Disadvantages The only drawback of replacing the lock with a semaphore (or eventually removing the lock altogether) is that the `request_limiter` is defined per crawler and almost all crawlers right now have a generic `10 requests / sec` limit. Some crawlers may require fine tuning the limiter to make sure CDL does not trigger `429s`
jbsparrow · Jan 4, 2025 · d3a4cd5 · d3a4cd5
1 parent 30a03db
commit d3a4cd5
Showing 1 changed file with 2 additions and 2 deletions.
diff --git a/cyberdrop_dl/scraper/crawler.py b/cyberdrop_dl/scraper/crawler.py
@@ -44,7 +44,7 @@ def __init__(self, manager: Manager, domain: str, folder_domain: str | None = No
         self.downloader = field(init=False)
         self.scraping_progress = manager.progress_manager.scraping_progress
         self.client: ScraperClient = field(init=False)
-        self._lock = asyncio.Lock()
+        self._semaphore = asyncio.Semaphore(20)
 
         self.domain = domain
         self.folder_domain = folder_domain or domain.capitalize()
@@ -65,7 +65,7 @@ async def run(self, item: ScrapeItem) -> None:
         if not item.url.host:
             return
         self.waiting_items += 1
-        async with self._lock:
+        async with self._semaphore:
             self.waiting_items -= 1
             if item.url.path_qs not in self.scraped_items:
                 log(f"Scraping: {item.url}", 20)