Crawler doesn't push subdomain URLs to queue if CrawlProfile doesn't extend CrawlSubdomains #372

kejkej31 · 2021-07-31T23:15:42Z

kejkej31
Jul 31, 2021

Hey,

I've created a CustomCrawlProfile that was extending the main class CrawlProfile.
My CustomCrawlProfile allowed for subdomains to be crawled + had some extra filtering, but the crawler was stopping too early, not enough URLs were being pushed to queue.
I've noticed that in CrawlRequestFulfilled class there's this code:

        if (! $this->crawler->getCrawlProfile() instanceof CrawlSubdomains) {
            if ($crawlUrl->url->getHost() !== $this->crawler->getBaseUrl()->getHost()) {
                return;
            }
        }

So even though the URL got crawled, HTML extracted, it won't push found URLs to queue.
Is this behavior correct? Shouldn't CrawlProfile decide whether something gets crawled instead of checking here if something extends CrawlSubdomains?
Maybe it was my bad, but in the documentation I couldn't find anything saying I should extend from the CrawlSubdomains class. I assumed it's just a "ready to go" class and I didn't have to extend it. Took me some time to find out why crawl is ending early.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawler doesn't push subdomain URLs to queue if CrawlProfile doesn't extend CrawlSubdomains #372

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Crawler doesn't push subdomain URLs to queue if CrawlProfile doesn't extend CrawlSubdomains #372

kejkej31 Jul 31, 2021

Replies: 0 comments

kejkej31
Jul 31, 2021