-
-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to honor robots.txt information #4
Comments
Spec for robots is here Things to note:
|
Honoring robots will mean
|
Question: Where to dump robots info. If anywhere A sitemap could have a ton of links, or crawling could. not everyone looks in logs though |
also, where do we decide to exclude a page
but then it's not immediately obvious that happened. We also want to truly honor a robots file and not go to the page if we have it in the robots file. |
Robots class is created. we will have a The next step will be to also read a disallowed.json file so that now this covers both robots behavior and "filtering" of urls. this should probably be covered in the "filtering urls" issue. |
This is mostly finished I think, but this needs to be tested on a live site. |
Finally found a better explanation of syntax: https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt#syntax |
Provide an option to honor robots.txt file.
The text was updated successfully, but these errors were encountered: