-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crawl any page, parse only whitelisted and not blacklisted pages #9
Comments
Here's how I'd like to implement this: The crawler will get 2 new functions There are times when you want to crawl a page for links, but you don't want the response. In the ecom example you mentioned, an example might be a category page with a filtered view of products. In this case, you wouldn't blacklist the category url pattern, but you would provide the discardItem function, perhaps like so:
There's other times when you want to crawl a page for an item but don't want to follow any links. Some crawler frameworks have a maxDepth param to try to solve this, but that doesn't work as well, since many times you want to control the crawl depth by the page content. The whitelist and blacklist would still be useful for other purposes. You would still want to blacklist things like auth pages, since you don't want to even attempt to crawl those, or pages with very slow or large responses that would slow your crawl rate. How does this sound? |
Sounds better and it's more flexible too. On Mon, Dec 29, 2014, 1:35 PM James Culveyhouse [email protected]
|
In many websites, you are only interested in some kind of a page, like a product details page. But what if there are no links to other product detail pages from there? The crawler would get stuck.
I think it could work that the crawler crawls pages freely (respecting robots.txt file of course) but only parses the qualified pages to extract item.
The text was updated successfully, but these errors were encountered: