Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FCL-284] Add ADR explaining the change to our robots.txt #177

Merged
merged 1 commit into from
Sep 16, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# 20. Allow crawling in robots.txt and use sitemap to promote discovery

Date: 2024-09-16

## Status

Accepted

## Context

At the moment our `robots.txt` explicitly disallows crawling of the site. This is nominally to discourage search engines from indexing judgments, however search engines may still index pages where they are linked to from other sites, using related terms on those pages to infer the title or content of the document.

For each document on the service we do provide a `noindex` robots directive both through HTML meta tags and HTTP headers, but because crawling and scraping of the page is forbidden by `robots.txt` search engines will not crawl the page in order to discover that they shouldn't index it. This means that pages may appear in search engines.

The spirit of the service is that individual documents should not appear in search indexes, and to achieve this we (somewhat counter-intuitively) need to allow crawling so that search engines can discover that they shouldn't include pages in their index.

## Decision

- `robots.txt` on Public UI will be changed to allow crawling of the entire site
- A new sitemap, rooted in `sitemap.xml`, will be provided which allows easy discovery of all documents for the purposes of crawling. This means search engines can rapidly discover the entire corpus of documents, crawl them, and mark them for exclusion from their search indexes.
- This sitemap will be included in `robots.txt`, as well as manually submitted to major search engines.

## Consequences

- Search engines which obey robot `noindex` directives will be better able to proactively flag documents as not for inclusion in the index.
- Users looking to crawl the site will be able to discover all site content.
Loading