-
-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for wildcard HostAliases #64
Comments
Additionally it may be worth looking at extending support for paths too (also with wildcards). If someone specifies that all URLs like Example in C#: return !url.Path.StartsWith("/shop/"); That may, in turn, end up deprecating |
Wildcards would be good, but maybe a regular expression would provide a greater flexibility? Another option could be MicroRuleEngine which I've used in another project. |
Hey @mguinness , thanks for the link - MicroRuleEngine does look interesting though probably won't take on a dependency for something like that at this stage (maybe in the future if I had a plugin system to this). That said, may look at it further for other projects of mine! Regular expressions definitely could work though I'd be cautious about the performance impact. I mean, compared to a HTTP request, the performance is negligible, however if the vast majority of use cases can be accomplished with basic wild cards (and assuming I can make it efficient), I'd probably go that route. For your own use cases, what types of expressions would you be wanting to do? Like, would you need something like |
There is Compilation and Reuse in Regular Expressions to improve performance, but as you say wildcards would be faster. I don't have a use case atm as I just happened across your repo. But I guess you could use |
Follows on from discussions in #63 - currently the
HostAlias
setting is relatively limited, requiring an exact match before it crawls a link with that domain.To make crawling a large number of subdomains easier, support for a wildcard (
*
) would be useful.eg.
There likely doesn't need to be any specific rules around wildcard handling. A host alias that is only a wildcard would indicate crawling any domain linked to. This is likely where analyzers of some kind would be useful as well as additional documentation.
A full wildcard setup does allow crawling of more complex subdomains like
web.*.example.org
, which may help in some specific usecases.The text was updated successfully, but these errors were encountered: