Disallow crawlers on the `/data` subdirectory #334

PGijsbers · 2024-07-10T12:59:09Z

The /data directory just contains our data files. We noticed crawlers using 95% of the total bandwidth by scraping data files. We do not see the use case.

Additionally, the explicit mention of the different user-agents is unnecessary. Most likely, it even did not do what it intended in the first place: googlebot (and I assume more) look at the most specific user-agent match, and follow those rules. That means that all the explicit mentions of the user-agents was doing, is to explicitly allow them to also crawl the /cgi-bin/ in addition to allowing everything else. That did not seem intentional. So, I took the liberty to simplify.

The `/data` directory just contains our data files. We noticed crawlers using 95% of the total bandwidth by scraping data files. We do not see the use case. Additionally, the explicit mention of the different user-agents is unnecessary. Most likely, it even did not do what it intended in the first place: googlebot (and I assume more) look at the most specific user-agent match, and follow those rules. That means that all the explicit mentions of the user-agents was doing, is to explicitly allow them to also crawl the /cgi-bin/ in addition to allowing everything else. That did not seem intentional. So, I took the liberty to simplify.

PGijsbers added the enhancement New feature or request label Jul 10, 2024

This was referenced Jul 10, 2024

Add additional filters to robots.txt to avoid crawler traps #335

Open

Don't allow arbitrary prefixes to our paths #336

Open

josvandervelde approved these changes Nov 10, 2024

View reviewed changes

PGijsbers merged commit e735f01 into master Nov 10, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disallow crawlers on the `/data` subdirectory #334

Disallow crawlers on the `/data` subdirectory #334

PGijsbers commented Jul 10, 2024

Disallow crawlers on the /data subdirectory #334

Disallow crawlers on the /data subdirectory #334

Conversation

PGijsbers commented Jul 10, 2024

Disallow crawlers on the `/data` subdirectory #334

Disallow crawlers on the `/data` subdirectory #334