You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've created a CustomCrawlProfile that was extending the main class CrawlProfile.
My CustomCrawlProfile allowed for subdomains to be crawled + had some extra filtering, but the crawler was stopping too early, not enough URLs were being pushed to queue.
I've noticed that in CrawlRequestFulfilled class there's this code:
if (! $this->crawler->getCrawlProfile() instanceof CrawlSubdomains) {
if ($crawlUrl->url->getHost() !== $this->crawler->getBaseUrl()->getHost()) {
return;
}
}
So even though the URL got crawled, HTML extracted, it won't push found URLs to queue.
Is this behavior correct? Shouldn't CrawlProfile decide whether something gets crawled instead of checking here if something extends CrawlSubdomains?
Maybe it was my bad, but in the documentation I couldn't find anything saying I should extend from the CrawlSubdomains class. I assumed it's just a "ready to go" class and I didn't have to extend it. Took me some time to find out why crawl is ending early.
This discussion was converted from issue #370 on August 01, 2021 20:48.
Heading
Bold
Italic
Quote
Code
Link
Numbered list
Unordered list
Task list
Attach files
Mention
Reference
Menu
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hey,
I've created a CustomCrawlProfile that was extending the main class CrawlProfile.
My CustomCrawlProfile allowed for subdomains to be crawled + had some extra filtering, but the crawler was stopping too early, not enough URLs were being pushed to queue.
I've noticed that in CrawlRequestFulfilled class there's this code:
So even though the URL got crawled, HTML extracted, it won't push found URLs to queue.
Is this behavior correct? Shouldn't CrawlProfile decide whether something gets crawled instead of checking here if something extends CrawlSubdomains?
Maybe it was my bad, but in the documentation I couldn't find anything saying I should extend from the CrawlSubdomains class. I assumed it's just a "ready to go" class and I didn't have to extend it. Took me some time to find out why crawl is ending early.
Beta Was this translation helpful? Give feedback.
All reactions