Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow controlling which links are visited #63

Open
YairHalberstadt opened this issue Dec 21, 2020 · 8 comments
Open

Allow controlling which links are visited #63

YairHalberstadt opened this issue Dec 21, 2020 · 8 comments

Comments

@YairHalberstadt
Copy link

For example, I was thinking of using this library to crawl a single site for pages.

This library looks great by the way - much higher quality than any of the other existing crawler libraries I've investigated in C#. Good job!

@Turnerj
Copy link
Member

Turnerj commented Dec 22, 2020

Thanks @YairHalberstadt for the kind words!

Yep, so the library can cover your example - by giving it a URL (the root URL of the site), it will crawl all the pages on the site. It will only crawl additional pages on other sites (eg. subdomains) if you specifically allow it.

Continuing the example from the readme:

using InfinityCrawler;

var crawler = new Crawler();
var result = await crawler.Crawl(new Uri("http://example.org/"), new CrawlSettings {
	UserAgent = "MyVeryOwnWebCrawler/1.0",
	RequestProcessorOptions = new RequestProcessorOptions
	{
		MaxNumberOfSimultaneousRequests = 5
	},
	HostAliases = new [] { "example.net", "subdomain.example.org" }
});

In that example, the domains "example.net" and "subdomain.example.org" will additionally be crawled if (and only if) links are found to them from "example.org".

@YairHalberstadt
Copy link
Author

That's great!

Is there anyway to deal with more complex logic? For example, visit all subdomains of this site, but not other sites?

@Turnerj
Copy link
Member

Turnerj commented Dec 22, 2020

Currently there isn't a way to catch-all aliases however that may be a reasonable future addition - probably a wildcard on the HostAlias (eg. "*.example.org"). I've opened #64 to cover adding that feature in a future release.

@YairHalberstadt
Copy link
Author

A more general solution might be to accept a Func<Uri, bool> (or whatever) to control which pages are visited.

@Turnerj
Copy link
Member

Turnerj commented Dec 23, 2020

That might be an option however having full flexibility like that can make more simple cases like crawling subdomains more complex. Being able to write, for example *.example.org, is a lot easier than writing the logic manually in C# to support that directly. To a greater extent, I could probably have an allow/block list for paths that also use wildcards rather than someone needing to code that too.

Functionality where you want to control crawling to very specific pages, like what could be achieved with a custom handler, are likely to be quite rare.

@Tony20221
Copy link

I would like to see include/exclude urls using regular expressions. This will allow handling almost everything.

@Turnerj
Copy link
Member

Turnerj commented Nov 14, 2022

I would like to see include/exclude urls using regular expressions. This will allow handling almost everything.

Not that I am committing one way or another but would you want multiple regular expressions for each? Do you want the scheme/host/port separate from the path?

Just want to understand the full scope to achieve a good developer experience. Don't really want lots of repetitive rules etc

@Tony20221
Copy link

It would be a list for each. I don't care about port or scheme since public are mostly using https these days and work off regular port 80. Maybe others find it those useful. But since these are part of the URL and if the regex is working off URLs, it seems to me no extra work is needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants