From 5030d150691c5581d9d8b252b679ce90099264ce Mon Sep 17 00:00:00 2001 From: Pieter Gijsbers Date: Wed, 10 Jul 2024 14:58:53 +0200 Subject: [PATCH] Disallow crawlers on the `/data` subdirectory The `/data` directory just contains our data files. We noticed crawlers using 95% of the total bandwidth by scraping data files. We do not see the use case. Additionally, the explicit mention of the different user-agents is unnecessary. Most likely, it even did not do what it intended in the first place: googlebot (and I assume more) look at the most specific user-agent match, and follow those rules. That means that all the explicit mentions of the user-agents was doing, is to explicitly allow them to also crawl the /cgi-bin/ in addition to allowing everything else. That did not seem intentional. So, I took the liberty to simplify. --- server/src/client/app/public/robots.txt | 34 +------------------------ 1 file changed, 1 insertion(+), 33 deletions(-) diff --git a/server/src/client/app/public/robots.txt b/server/src/client/app/public/robots.txt index df371334..8671b559 100644 --- a/server/src/client/app/public/robots.txt +++ b/server/src/client/app/public/robots.txt @@ -1,35 +1,3 @@ -User-agent: Googlebot -Disallow: -User-agent: googlebot-image -Disallow: -User-agent: googlebot-mobile -Disallow: -User-agent: MSNBot -Disallow: -User-agent: Slurp -Disallow: -User-agent: Teoma -Disallow: -User-agent: Gigabot -Disallow: -User-agent: Robozilla -Disallow: -User-agent: Nutch -Disallow: -User-agent: ia_archiver -Disallow: -User-agent: baiduspider -Disallow: -User-agent: naverbot -Disallow: -User-agent: yeti -Disallow: -User-agent: yahoo-mmcrawler -Disallow: -User-agent: psbot -Disallow: -User-agent: yahoo-blogs/v3.9 -Disallow: User-agent: * -Disallow: +Disallow: /data/ Disallow: /cgi-bin/