From 24d8a34da576f86b10923e426f66df48ab6201b9 Mon Sep 17 00:00:00 2001 From: Brian Smith Date: Wed, 11 Dec 2024 17:17:32 +0100 Subject: [PATCH] feat(http): Add X-Robots-Tag header (#37079) * feat(http): Add X-Robots-Tag header * Update files/en-us/web/http/headers/x-robots-tag/index.md * Update files/en-us/web/http/headers/x-robots-tag/index.md Co-authored-by: Estelle Weyl * Update files/en-us/web/http/headers/x-robots-tag/index.md Co-authored-by: Estelle Weyl * Update files/en-us/web/http/headers/x-robots-tag/index.md Co-authored-by: Estelle Weyl * Apply suggestions from code review Co-authored-by: Estelle Weyl * Update files/en-us/web/http/headers/x-robots-tag/index.md Co-authored-by: Estelle Weyl * Update files/en-us/web/http/headers/x-robots-tag/index.md * Update files/en-us/web/http/headers/x-robots-tag/index.md Co-authored-by: Estelle Weyl * feat(http): X-Robots-Tag header, robots.txt * Update files/en-us/web/http/headers/x-robots-tag/index.md * Update files/en-us/web/http/headers/x-robots-tag/index.md Co-authored-by: Estelle Weyl * Apply suggestions from code review Co-authored-by: Vadim Makeev * Update files/en-us/web/http/headers/x-robots-tag/index.md Co-authored-by: Vadim Makeev * Update files/en-us/web/http/headers/x-robots-tag/index.md * Update files/en-us/web/http/headers/x-robots-tag/index.md * chore(http): improvements following reviewer feedback * chore(http): improvements following reviewer feedback --------- Co-authored-by: Estelle Weyl Co-authored-by: Vadim Makeev --- files/en-us/glossary/robots.txt/index.md | 18 +- .../en-us/web/html/element/meta/name/index.md | 4 +- .../web/http/headers/x-robots-tag/index.md | 196 ++++++++++++++++++ 3 files changed, 211 insertions(+), 7 deletions(-) create mode 100644 files/en-us/web/http/headers/x-robots-tag/index.md diff --git a/files/en-us/glossary/robots.txt/index.md b/files/en-us/glossary/robots.txt/index.md index f8c59e870d7dcf3..0f8d22c657c6f32 100644 --- a/files/en-us/glossary/robots.txt/index.md +++ b/files/en-us/glossary/robots.txt/index.md @@ -6,13 +6,21 @@ page-type: glossary-definition {{GlossarySidebar}} -Robots.txt is a file which is usually placed in the root of any website. It decides whether {{Glossary("crawler", "crawlers")}} are permitted or forbidden access to the website. +A **robots.txt** is a file which is usually placed in the root of a website (for example, `https://www.example.com/robots.txt`). +It specifies whether {{Glossary("crawler", "crawlers")}} are allowed or disallowed from accessing an entire website or to certain resources on a website. +A restrictive `robots.txt` file can prevent bandwidth consumption by crawlers. -For example, the site admin can forbid crawlers to visit a certain folder (and all the files therein contained) or to crawl a specific file, usually to prevent those files being indexed by other search engines. +A site owner can forbid crawlers to detect a certain path (and all files in that path) or a specific file. +This is often done to prevent these resources from being indexed or served by search engines. + +If a crawler is allowed to access resources, you can define [indexing rules](/en-US/docs/Web/HTTP/Headers/X-Robots-Tag#directives) for those resources via `` elements and {{HTTPHeader("X-Robots-Tag")}} HTTP headers. +Search-related crawlers use these rules to determine how to index and serve resources in search results, or to adjust the crawl rate for specific resources over time. ## See also +- {{HTTPHeader("X-Robots-Tag")}} +- {{Glossary("Search engine")}} +- {{RFC("9309", "Robots Exclusion Protocol")}} +- [How Google interprets the robots.txt specification](https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt) on developers.google.com +- https://www.robotstxt.org - [Robots.txt](https://en.wikipedia.org/wiki/Robots.txt) on Wikipedia -- -- Standard specification: [RFC9309](https://www.rfc-editor.org/rfc/rfc9309.html) -- diff --git a/files/en-us/web/html/element/meta/name/index.md b/files/en-us/web/html/element/meta/name/index.md index 3d27923a43ce778..f7ad03baa657923 100644 --- a/files/en-us/web/html/element/meta/name/index.md +++ b/files/en-us/web/html/element/meta/name/index.md @@ -241,10 +241,10 @@ The [WHATWG Wiki MetaExtensions page](https://wiki.whatwg.org/wiki/MetaExtension > > - Only cooperative robots follow these rules. Do not expect to prevent email harvesters with them. > - The robot still needs to access the page in order to read these rules. To prevent bandwidth consumption, consider if using a _{{Glossary("robots.txt")}}_ file is more appropriate. - > - The `robots` `` tag and `robots.txt` file serve different purposes: `robots.txt` controls the crawling of pages, and does not affect indexing or other behavior controlled by `robots` meta. A page that can't be crawled may still be indexed if it is referenced by another document. + > - The `` element and `robots.txt` file serve different purposes: `robots.txt` controls the crawling of pages, and does not affect indexing or other behavior controlled by `robots` meta. A page that can't be crawled may still be indexed if it is referenced by another document. > - If you want to remove a page, `noindex` will work, but only after the robot visits the page again. Ensure that the `robots.txt` file is not preventing revisits. > - Some values are mutually exclusive, like `index` and `noindex`, or `follow` and `nofollow`. In these cases the robot's behavior is undefined and may vary between them. - > - Some crawler robots, like Google, Yahoo and Bing, support the same values for the HTTP header `X-Robots-Tag`; this allows non-HTML documents like images to use these rules. + > - Some crawler robots, like Google, Yahoo and Bing, support the same values for the HTTP header {{HTTPHeader("X-Robots-Tag")}}; this allows non-HTML documents like images to use these rules. diff --git a/files/en-us/web/http/headers/x-robots-tag/index.md b/files/en-us/web/http/headers/x-robots-tag/index.md new file mode 100644 index 000000000000000..029138228b3329f --- /dev/null +++ b/files/en-us/web/http/headers/x-robots-tag/index.md @@ -0,0 +1,196 @@ +--- +title: X-Robots-Tag +slug: Web/HTTP/Headers/X-Robots-Tag +page-type: http-header +status: + - non-standard +--- + +{{HTTPSidebar}} + +The **`X-Robots-Tag`** {{Glossary("response header")}} defines how {{glossary("Crawler", "crawlers")}} should index URLs. +While not part of any specification, it is a de-facto standard method for communicating with search bots, web crawlers, and similar user agents. +Search-related crawlers use the rules from the `X-Robots-Tag` header to adjust how to present web pages or other resources in search results. + +Indexing rules defined via `` elements and `X-Robots-Tag` headers are discovered when a URL is crawled. +Specifying indexing rules in a HTTP header is useful for non-HTML documents like images, PDFs, or other media. + +> [!NOTE] +> Only cooperative robots follow these rules, and a crawler still needs to access the resource to read headers and meta elements (see [Interaction with robots.txt](#interaction_with_robots.txt)). +> If you want to prevent bandwidth consumption by crawlers, a restrictive {{Glossary("robots.txt")}} file is more effective than indexing rules as it blocks resources from being crawled entirely. + + + + + + + + + + + + +
Header type{{Glossary("Response header")}}
{{Glossary("Forbidden header name")}}No
+ +## Syntax + +One or more indexing rules as a comma-separated list: + +```http +X-Robots-Tag: +X-Robots-Tag: , …, +``` + +An optional `:` specifies the user agent that the subsequent rules should apply to: + +```http +X-Robots-Tag: , : +X-Robots-Tag: : , …, +``` + +See [Specifying user agents](#specifying_user_agents) for an example. + +## Directives + +Any of the following indexing rules may be used: + +- `all` + - : No restrictions for indexing or serving in search results. + This rule is the default value and has no effect if explicitly listed. +- `noindex` + - : Do not show this page, media, or resource in search results. + If omitted, the page, media, or resource may be indexed and shown in search results. +- `nofollow` + - : Do not follow the links on this page. + If omitted, search engines may use the links on the page to discover those linked pages. +- `none` + - : Equivalent to `noindex, nofollow`. +- `nosnippet` + - : Do not show a text snippet or video preview in the search results for this page. + A static image thumbnail (if available) may still be visible. + If omitted, search engines may generate a text snippet and video preview based on information found on the page. + To exclude certain sections of your content from appearing in search result snippets, use the [`data-nosnippet` HTML attribute](https://developers.google.com/search/docs/crawling-indexing/robots-meta-tag#data-nosnippet-attr). +- `indexifembedded` + - : A search engine is allowed to index the content of a page if it's embedded in another page through iframes or similar HTML elements, in spite of a `noindex` rule. + `indexifembedded` only has an effect if it's accompanied by `noindex`. +- `max-snippet: ` + - : Use a maximum of `` characters as a textual snippet for this search result. + Ignored if no valid `` is specified. +- `max-image-preview: ` + - : The maximum size of an image preview for this page in a search results. + If omitted, search engines may show an image preview of the default size. + If you don't want search engines to use larger thumbnail images, specify a `max-image-preview` value of `standard` or `none`. Values include: + - `none` + - : No image preview is to be shown. + - `standard` + - : A default image preview may be shown. + - `large` + - : A larger image preview, up to the width of the viewport, may be shown. +- `max-video-preview: ` + - : Use a maximum of `` seconds as a video snippet for videos on this page in search results. + If omitted, search engines may show a video snippet in search results, and the search engine decides how long a preview may be. + Ignored if no valid `` is specified. + Special values are as follows: + - `0` + - : At most, a static image may be used, in accordance to the `max-image-preview` setting. + - `-1` + - : No video length limit. +- `notranslate` + - : Don't offer translation of this page in search results. + If omitted, search engines may translate the search result title and snippet into the language of the search query. +- `noimageindex` + - : Do not index images on this page. + If omitted, images on the page may be indexed and shown in search results. +- `unavailable_after: ` + + - : Requests not to show this page in search results after the specified ``. + Ignored if no valid `` is specified. + A date must be specified in a format such as {{RFC("822")}}, {{RFC("850")}}, or ISO 8601. + + By default there is no expiration date for content. + If omitted, this page may be shown in search results indefinitely. + Crawlers are expected to considerably decrease the crawl rate of the URL after the specified date and time. + +## Description + +Indexing rules via `` and `X-Robots-Tag` are discovered when a URL is crawled. +Most crawlers support rules in the `X-Robots-Tag` HTTP header that can be used in a `` element. + +In the case of conflicting robot rules within the `X-Robots-Tag` or between the `X-Robots-Tag` HTTP header and the `` element, the more restrictive rule applies. +For example, if a page has both `max-snippet:50` and `nosnippet` rules, the `nosnippet` rule will apply. +Indexing rules won't be discovered or applied if paths are blocked from being crawled by a `robots.txt` file. + +Some values are mutually exclusive, such as `index` and `noindex`, or `follow` and `nofollow`. +In these cases, the crawler's behavior is undefined and may vary. + +### Interaction with robots.txt + +If a resource is blocked from crawling through a `robots.txt` file, then any information about indexing or serving rules specified using `` or the `X-Robots-Tag` HTTP header will not be detected and will therefore be ignored. + +A page that's blocked from crawling may still be indexed if it is referenced from another document (see the [`nofollow`](#nofollow) directive). +If you want to remove a page from search indexes, `X-Robots-Tag: noindex` will typically work, but a robot must first revisit the page to detect the `X-Robots-Tag` rule. + +## Examples + +### Using X-Robots-Tag + +The following `X-Robots-Tag` header adds `noindex`, asking crawlers not to show this page, media, or resource in search results: + +```http +HTTP/1.1 200 OK +Date: Tue, 03 Dec 2024 17:08:49 GMT +X-Robots-Tag: noindex +``` + +### Multiple headers + +The following response has two `X-Robots-Tag` headers, each with an indexing rule specified: + +```http +HTTP/1.1 200 OK +Date: Tue, 03 Dec 2024 17:08:49 GMT +X-Robots-Tag: noimageindex +X-Robots-Tag: unavailable_after: Wed, 03 Dec 2025 13:09:53 GMT +``` + +### Specifying user agents + +It's possible to specify which user agent the rules should apply to. +The following example contains two `X-Robots-Tag` headers which ask that `googlebot` not follow the links on this page and that a fictional `BadBot` crawler not index the page or follow any links on it, either: + +```http +HTTP/1.1 200 OK +Date: Tue, 03 Dec 2024 17:08:49 GMT +X-Robots-Tag: BadBot: noindex, nofollow +X-Robots-Tag: googlebot: nofollow +``` + +In the response below, the same indexing rules are defined, but in a single header. +Each indexing rule applies to the user agent specified behind it: + +```http +HTTP/1.1 200 OK +Date: Tue, 03 Dec 2024 17:08:49 GMT +X-Robots-Tag: BadBot: noindex, nofollow, googlebot: nofollow +``` + +For situations where multiple crawlers are specified along with different rules, the search engine will use the sum of the negative rules. +For example: + +```http +X-Robots-Tag: nofollow +X-Robots-Tag: googlebot: noindex +``` + +The page containing these headers will be interpreted as having a `noindex, nofollow` rule when crawled by `googlebot`. + +## Specifications + +Not part of any current specification. + +## See also + +- {{Glossary("Robots.txt")}} +- {{Glossary("Search engine")}} +- {{RFC("9309", "Robots Exclusion Protocol")}} +- [Using the X-Robots-Tag HTTP header](https://developers.google.com/search/docs/crawling-indexing/robots-meta-tag#xrobotstag) on developers.google.com