-
-
Notifications
You must be signed in to change notification settings - Fork 262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exclude & Allowed Switches Not Behaving as Expected #91
Comments
It only checks the path and not the domain looking at that line of code.
Are you expecting it to check the domain as well?
…On Wed, 20 Apr 2022, 22:28 03k64serenity, ***@***.***> wrote:
https://github.com/digininja/CeWL/blob/280bfe6f8f57a783cf447c47cfb38ad568177d00/cewl.rb#L814
When providing regex patterns in a file for the --exclude or in the
command line argument for --allowed, cewl is not properly excluding and
allowing offsite URLs based on the rules.
—
Reply to this email directly, view it on GitHub
<#91>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAA4SWKFURTCXRL7DMWPRATVGBZGPANCNFSM5T5J6WPA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Right. I'd like to be able to limit the spider from crawling certain domains and allow it to crawl others based on a regex. |
Not currently possible. You could easily tweak that line to check the
domain instead. I don't know the property off hand, but try domain instead
of path.
…On Wed, 20 Apr 2022, 22:35 03k64serenity, ***@***.***> wrote:
Right. I'd like to be able to limit the spider from crawling certain
domains and allow it to crawl others based on a regex.
—
Reply to this email directly, view it on GitHub
<#91 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAA4SWPCWUAPVVQELUTC2XLVGB2CJANCNFSM5T5J6WPA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Sounds good. Will do. Hey, by the way...I had no idea you were the author of CeWL all these years seeing you on the interwebs, so I'm even more impressed and grateful for your contributions to the community. |
Glad you like it.
If you get stuck, let me know, and I'll have a look for the right property
in the morning.
…On Wed, 20 Apr 2022, 22:40 03k64serenity, ***@***.***> wrote:
Sounds good. Will do. Hey, by the way...I had no idea you were the author
of CeWL all these years seeing you on the interwebs, so I'm even more
impressed and grateful for your contributions to the community.
—
Reply to this email directly, view it on GitHub
<#91 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAA4SWP4EDM6R4P3CNQ6W63VGB2TZANCNFSM5T5J6WPA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
https://github.com/spencer-dollahite/CeWL/blob/master/cewl.rb This is the sort of approach/feature I'd like to see to have both an allowed and exclude pattern switch for the domain and path. I know the code here isn't perfect, but I think it is close enough for demo purposes. Thoughts? |
I'll have a look as soon as I get chance.
…On Thu, 28 Apr 2022, 21:09 spencer-dollahite, ***@***.***> wrote:
https://github.com/spencer-dollahite/CeWL/blob/master/cewl.rb
This is the sort of approach/feature I'd like to see to have both an
allowed and exclude pattern switch for the domain and path. I know the code
here isn't perfect, but I think it is close enough for demo purposes.
Thoughts?
—
Reply to this email directly, view it on GitHub
<#91 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAA4SWJ4UFXCLCPHHAEDVYTVHLV77ANCNFSM5T5J6WPA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
CeWL/cewl.rb
Line 814 in 280bfe6
When providing regex patterns in a file for the --exclude or in the command line argument for --allowed, cewl is not properly excluding and allowing offsite URLs based on the rules.
The text was updated successfully, but these errors were encountered: