Allow controlling which links are visited #63

YairHalberstadt · 2020-12-21T16:23:04Z

For example, I was thinking of using this library to crawl a single site for pages.

This library looks great by the way - much higher quality than any of the other existing crawler libraries I've investigated in C#. Good job!

Turnerj · 2020-12-22T08:14:52Z

Thanks @YairHalberstadt for the kind words!

Yep, so the library can cover your example - by giving it a URL (the root URL of the site), it will crawl all the pages on the site. It will only crawl additional pages on other sites (eg. subdomains) if you specifically allow it.

Continuing the example from the readme:

using InfinityCrawler;

var crawler = new Crawler();
var result = await crawler.Crawl(new Uri("http://example.org/"), new CrawlSettings {
	UserAgent = "MyVeryOwnWebCrawler/1.0",
	RequestProcessorOptions = new RequestProcessorOptions
	{
		MaxNumberOfSimultaneousRequests = 5
	},
	HostAliases = new [] { "example.net", "subdomain.example.org" }
});

In that example, the domains "example.net" and "subdomain.example.org" will additionally be crawled if (and only if) links are found to them from "example.org".

YairHalberstadt · 2020-12-22T08:20:02Z

That's great!

Is there anyway to deal with more complex logic? For example, visit all subdomains of this site, but not other sites?

Turnerj · 2020-12-22T09:27:05Z

Currently there isn't a way to catch-all aliases however that may be a reasonable future addition - probably a wildcard on the HostAlias (eg. "*.example.org"). I've opened #64 to cover adding that feature in a future release.

YairHalberstadt · 2020-12-22T10:03:07Z

A more general solution might be to accept a Func<Uri, bool> (or whatever) to control which pages are visited.

Turnerj · 2020-12-23T03:35:45Z

That might be an option however having full flexibility like that can make more simple cases like crawling subdomains more complex. Being able to write, for example *.example.org, is a lot easier than writing the logic manually in C# to support that directly. To a greater extent, I could probably have an allow/block list for paths that also use wildcards rather than someone needing to code that too.

Functionality where you want to control crawling to very specific pages, like what could be achieved with a custom handler, are likely to be quite rare.

Tony20221 · 2022-11-14T06:41:27Z

I would like to see include/exclude urls using regular expressions. This will allow handling almost everything.

Turnerj · 2022-11-14T08:15:54Z

I would like to see include/exclude urls using regular expressions. This will allow handling almost everything.

Not that I am committing one way or another but would you want multiple regular expressions for each? Do you want the scheme/host/port separate from the path?

Just want to understand the full scope to achieve a good developer experience. Don't really want lots of repetitive rules etc

Tony20221 · 2022-11-14T19:04:02Z

It would be a list for each. I don't care about port or scheme since public are mostly using https these days and work off regular port 80. Maybe others find it those useful. But since these are part of the URL and if the regex is working off URLs, it seems to me no extra work is needed.

Turnerj mentioned this issue Dec 22, 2020

Add support for wildcard HostAliases #64

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow controlling which links are visited #63

Allow controlling which links are visited #63

YairHalberstadt commented Dec 21, 2020

Turnerj commented Dec 22, 2020

YairHalberstadt commented Dec 22, 2020

Turnerj commented Dec 22, 2020

YairHalberstadt commented Dec 22, 2020

Turnerj commented Dec 23, 2020

Tony20221 commented Nov 14, 2022

Turnerj commented Nov 14, 2022

Tony20221 commented Nov 14, 2022

Allow controlling which links are visited #63

Allow controlling which links are visited #63

Comments

YairHalberstadt commented Dec 21, 2020

Turnerj commented Dec 22, 2020

YairHalberstadt commented Dec 22, 2020

Turnerj commented Dec 22, 2020

YairHalberstadt commented Dec 22, 2020

Turnerj commented Dec 23, 2020

Tony20221 commented Nov 14, 2022

Turnerj commented Nov 14, 2022

Tony20221 commented Nov 14, 2022