Disallow crawlers on the /data
subdirectory
#334
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The
/data
directory just contains our data files. We noticed crawlers using 95% of the total bandwidth by scraping data files. We do not see the use case.Additionally, the explicit mention of the different user-agents is unnecessary. Most likely, it even did not do what it intended in the first place: googlebot (and I assume more) look at the most specific user-agent match, and follow those rules. That means that all the explicit mentions of the user-agents was doing, is to explicitly allow them to also crawl the /cgi-bin/ in addition to allowing everything else. That did not seem intentional. So, I took the liberty to simplify.