Add option to honor robots.txt information #4

paceaux · 2022-03-02T21:22:17Z

Provide an option to honor robots.txt file.

It looks for robots.txt to begin with
I will find the sitemap from there if it's listed
It will ignore any pages explicitely listed in robots.txt

paceaux · 2024-08-28T14:51:25Z

Spec for robots is here

Things to note:

a page can have a meta tag <meta name="robots">
a robots file can have a disallow
Pages or directories can be disallowed

paceaux · 2024-08-28T15:14:22Z

Honoring robots will mean

first grab a robots.txt
Collect the disallow rules from it
Generate sitemap based on allowed pages
When scanning a page, first checking if a <meta name=robots> exists, and then honoring it

paceaux · 2024-09-03T17:57:55Z

Question:

Where to dump robots info. If anywhere

A sitemap could have a ton of links, or crawling could.
if a user sees pages missing, they'll want to know why. If we add robots info into the sitemap json, could ... confuse stuff?

not everyone looks in logs though

paceaux · 2024-09-03T18:03:00Z

also, where do we decide to exclude a page

setSitemaps() would be a great place because that takes the sitemap file and does an addLinks() method, and we could just filter out the pages that are matches.

but then it's not immediately obvious that happened.

We also want to truly honor a robots file and not go to the page if we have it in the robots file.

paceaux · 2024-09-27T20:10:17Z

Robots class is created.

we will have a -b option to honor robots.
This will only affect crawling. But it will produce a disallowed.json file.

The next step will be to also read a disallowed.json file so that now this covers both robots behavior and "filtering" of urls. this should probably be covered in the "filtering urls" issue.

paceaux · 2024-10-02T15:20:48Z

This is mostly finished I think, but this needs to be tested on a live site.

paceaux · 2024-10-02T15:42:34Z

Finally found a better explanation of syntax: https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt#syntax

paceaux added the enhancement New feature or request label Sep 2, 2022

paceaux added this to the 2.4.0 The one with URL options milestone Aug 28, 2024

paceaux self-assigned this Aug 28, 2024

paceaux mentioned this issue Oct 2, 2024

Allow filtering sitemap urls #9

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to honor robots.txt information #4

Add option to honor robots.txt information #4

paceaux commented Mar 2, 2022

paceaux commented Aug 28, 2024 •

edited

Loading

paceaux commented Aug 28, 2024

paceaux commented Sep 3, 2024

paceaux commented Sep 3, 2024

paceaux commented Sep 27, 2024 •

edited

Loading

paceaux commented Oct 2, 2024

paceaux commented Oct 2, 2024

Add option to honor robots.txt information #4

Add option to honor robots.txt information #4

Comments

paceaux commented Mar 2, 2022

paceaux commented Aug 28, 2024 • edited Loading

paceaux commented Aug 28, 2024

paceaux commented Sep 3, 2024

paceaux commented Sep 3, 2024

paceaux commented Sep 27, 2024 • edited Loading

paceaux commented Oct 2, 2024

paceaux commented Oct 2, 2024

paceaux commented Aug 28, 2024 •

edited

Loading

paceaux commented Sep 27, 2024 •

edited

Loading