Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to honor robots.txt information #4

Open
paceaux opened this issue Mar 2, 2022 · 7 comments
Open

Add option to honor robots.txt information #4

paceaux opened this issue Mar 2, 2022 · 7 comments
Assignees
Labels
enhancement New feature or request

Comments

@paceaux
Copy link
Owner

paceaux commented Mar 2, 2022

Provide an option to honor robots.txt file.

  1. It looks for robots.txt to begin with
  2. I will find the sitemap from there if it's listed
  3. It will ignore any pages explicitely listed in robots.txt
@paceaux paceaux added the enhancement New feature or request label Sep 2, 2022
@paceaux
Copy link
Owner Author

paceaux commented Aug 28, 2024

Spec for robots is here

Things to note:

  • a page can have a meta tag <meta name="robots">
  • a robots file can have a disallow
  • Pages or directories can be disallowed

@paceaux
Copy link
Owner Author

paceaux commented Aug 28, 2024

Honoring robots will mean

  1. first grab a robots.txt
  2. Collect the disallow rules from it
  3. Generate sitemap based on allowed pages
  4. When scanning a page, first checking if a <meta name=robots> exists, and then honoring it

@paceaux paceaux self-assigned this Aug 28, 2024
@paceaux
Copy link
Owner Author

paceaux commented Sep 3, 2024

Question:

Where to dump robots info. If anywhere

A sitemap could have a ton of links, or crawling could.
if a user sees pages missing, they'll want to know why. If we add robots info into the sitemap json, could ... confuse stuff?

not everyone looks in logs though

@paceaux
Copy link
Owner Author

paceaux commented Sep 3, 2024

also, where do we decide to exclude a page

setSitemaps() would be a great place because that takes the sitemap file and does an addLinks() method, and we could just filter out the pages that are matches.

but then it's not immediately obvious that happened.

We also want to truly honor a robots file and not go to the page if we have it in the robots file.

@paceaux
Copy link
Owner Author

paceaux commented Sep 27, 2024

Robots class is created.

we will have a -b option to honor robots.
This will only affect crawling. But it will produce a disallowed.json file.

The next step will be to also read a disallowed.json file so that now this covers both robots behavior and "filtering" of urls. this should probably be covered in the "filtering urls" issue.

@paceaux
Copy link
Owner Author

paceaux commented Oct 2, 2024

This is mostly finished I think, but this needs to be tested on a live site.

@paceaux
Copy link
Owner Author

paceaux commented Oct 2, 2024

Finally found a better explanation of syntax: https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt#syntax

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant