Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for wildcard HostAliases #64

Open
Turnerj opened this issue Dec 22, 2020 · 4 comments
Open

Add support for wildcard HostAliases #64

Turnerj opened this issue Dec 22, 2020 · 4 comments
Labels
enhancement New feature or request

Comments

@Turnerj
Copy link
Member

Turnerj commented Dec 22, 2020

Follows on from discussions in #63 - currently the HostAlias setting is relatively limited, requiring an exact match before it crawls a link with that domain.

To make crawling a large number of subdomains easier, support for a wildcard (*) would be useful.

eg.

using InfinityCrawler;

var crawler = new Crawler();
var result = await crawler.Crawl(new Uri("http://example.org/"), new CrawlSettings {
	UserAgent = "MyVeryOwnWebCrawler/1.0",
	RequestProcessorOptions = new RequestProcessorOptions
	{
		MaxNumberOfSimultaneousRequests = 5
	},
	HostAliases = new [] { "*.example.org" }
});

There likely doesn't need to be any specific rules around wildcard handling. A host alias that is only a wildcard would indicate crawling any domain linked to. This is likely where analyzers of some kind would be useful as well as additional documentation.

A full wildcard setup does allow crawling of more complex subdomains like web.*.example.org, which may help in some specific usecases.

@Turnerj
Copy link
Member Author

Turnerj commented Dec 23, 2020

Additionally it may be worth looking at extending support for paths too (also with wildcards). If someone specifies that all URLs like example.org/shop/* are not to be crawled, that would be easier for them than needing to write that in C#.

Example in C#:

return !url.Path.StartsWith("/shop/");

That may, in turn, end up deprecating HostAliases for AllowUrls/BlockUrls.

@mguinness
Copy link

Wildcards would be good, but maybe a regular expression would provide a greater flexibility? Another option could be MicroRuleEngine which I've used in another project.

@Turnerj
Copy link
Member Author

Turnerj commented Mar 31, 2021

Hey @mguinness , thanks for the link - MicroRuleEngine does look interesting though probably won't take on a dependency for something like that at this stage (maybe in the future if I had a plugin system to this). That said, may look at it further for other projects of mine!

Regular expressions definitely could work though I'd be cautious about the performance impact. I mean, compared to a HTTP request, the performance is negligible, however if the vast majority of use cases can be accomplished with basic wild cards (and assuming I can make it efficient), I'd probably go that route.

For your own use cases, what types of expressions would you be wanting to do? Like, would you need something like *.mydomain.com/some-path/*.html or something else? The more I understand what people need, the better I can target the implementation.

@mguinness
Copy link

There is Compilation and Reuse in Regular Expressions to improve performance, but as you say wildcards would be faster.

I don't have a use case atm as I just happened across your repo. But I guess you could use (jpg|png|gif)$ to only grab images.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants