From 5030d150691c5581d9d8b252b679ce90099264ce Mon Sep 17 00:00:00 2001
From: Pieter Gijsbers
Date: Wed, 10 Jul 2024 14:58:53 +0200
Subject: [PATCH] Disallow crawlers on the `/data` subdirectory
The `/data` directory just contains our data files. We noticed crawlers using 95% of the total bandwidth by scraping data files. We do not see the use case.
Additionally, the explicit mention of the different user-agents is unnecessary. Most likely, it even did not do what it intended in the first place: googlebot (and I assume more) look at the most specific user-agent match, and follow those rules. That means that all the explicit mentions of the user-agents was doing, is to explicitly allow them to also crawl the /cgi-bin/ in addition to allowing everything else. That did not seem intentional. So, I took the liberty to simplify.
---
server/src/client/app/public/robots.txt | 34 +------------------------
1 file changed, 1 insertion(+), 33 deletions(-)
diff --git a/server/src/client/app/public/robots.txt b/server/src/client/app/public/robots.txt
index df371334..8671b559 100644
--- a/server/src/client/app/public/robots.txt
+++ b/server/src/client/app/public/robots.txt
@@ -1,35 +1,3 @@
-User-agent: Googlebot
-Disallow:
-User-agent: googlebot-image
-Disallow:
-User-agent: googlebot-mobile
-Disallow:
-User-agent: MSNBot
-Disallow:
-User-agent: Slurp
-Disallow:
-User-agent: Teoma
-Disallow:
-User-agent: Gigabot
-Disallow:
-User-agent: Robozilla
-Disallow:
-User-agent: Nutch
-Disallow:
-User-agent: ia_archiver
-Disallow:
-User-agent: baiduspider
-Disallow:
-User-agent: naverbot
-Disallow:
-User-agent: yeti
-Disallow:
-User-agent: yahoo-mmcrawler
-Disallow:
-User-agent: psbot
-Disallow:
-User-agent: yahoo-blogs/v3.9
-Disallow:
User-agent: *
-Disallow:
+Disallow: /data/
Disallow: /cgi-bin/