-
-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow controlling which links are visited #63
Comments
Thanks @YairHalberstadt for the kind words! Yep, so the library can cover your example - by giving it a URL (the root URL of the site), it will crawl all the pages on the site. It will only crawl additional pages on other sites (eg. subdomains) if you specifically allow it. Continuing the example from the readme: using InfinityCrawler;
var crawler = new Crawler();
var result = await crawler.Crawl(new Uri("http://example.org/"), new CrawlSettings {
UserAgent = "MyVeryOwnWebCrawler/1.0",
RequestProcessorOptions = new RequestProcessorOptions
{
MaxNumberOfSimultaneousRequests = 5
},
HostAliases = new [] { "example.net", "subdomain.example.org" }
}); In that example, the domains "example.net" and "subdomain.example.org" will additionally be crawled if (and only if) links are found to them from "example.org". |
That's great! Is there anyway to deal with more complex logic? For example, visit all subdomains of this site, but not other sites? |
Currently there isn't a way to catch-all aliases however that may be a reasonable future addition - probably a wildcard on the |
A more general solution might be to accept a |
That might be an option however having full flexibility like that can make more simple cases like crawling subdomains more complex. Being able to write, for example Functionality where you want to control crawling to very specific pages, like what could be achieved with a custom handler, are likely to be quite rare. |
I would like to see include/exclude urls using regular expressions. This will allow handling almost everything. |
Not that I am committing one way or another but would you want multiple regular expressions for each? Do you want the scheme/host/port separate from the path? Just want to understand the full scope to achieve a good developer experience. Don't really want lots of repetitive rules etc |
It would be a list for each. I don't care about port or scheme since public are mostly using https these days and work off regular port 80. Maybe others find it those useful. But since these are part of the URL and if the regex is working off URLs, it seems to me no extra work is needed. |
For example, I was thinking of using this library to crawl a single site for pages.
This library looks great by the way - much higher quality than any of the other existing crawler libraries I've investigated in C#. Good job!
The text was updated successfully, but these errors were encountered: