Skip to content

Commit

Permalink
Merge pull request #60 from bobmatyas/updates/100424
Browse files Browse the repository at this point in the history
Updates/100424
  • Loading branch information
bobmatyas authored Oct 7, 2024
2 parents 0b798c2 + 2b1f82c commit 2942b3c
Show file tree
Hide file tree
Showing 4 changed files with 129 additions and 41 deletions.
40 changes: 36 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,41 @@
# Block AI Crawlers

A WordPress plugin for blocking AI crawlers. It uses `robots.txt` and an experimental Meta Tag to tell AI crawlers not to use your site as part of their training data.
This WordPress plugin helps prevent AI crawlers from using your content as training data for their products. By updating your site's `robots.txt`, it blocks common AI crawlers and scrapers, aiming to protect your content from being used in the training of Large Language Models (LLMs).

## Installing
## Features

You can install the plugin via the WordPress.org plugin directory:
### Blocks AI Crawlers

- https://wordpress.org/plugins/block-ai-crawlers
Includes:

- **OpenAI** - Blocks crawlers used for ChatGPT
- **Google** - Blocks crawlers used by Google's Gemini AI products
- **Facebook / Meta** - Used for Facebook's AI training
- **Anthropic AI** - Blocks crawlers used by Anthropic
- **Perplexity** - Block crawlers used by Perplexity
- **Applebot** - Blocks crawlers used by Apple
- ... and more!

### Experimental Meta Tags

The plugin adds the "noai, noimageai" directive to your site's meta tags, instructing AI bots not to use your content in their datasets. Please note that these tags are experimental and have not been standardized.

## Installation

1. Download the plugin zip file.
2. Go to your WordPress admin panel.
3. Navigate to Plugins > Add New > Upload Plugin.
4. Choose the zip file and click "Install Now."
5. Activate the plugin.

## Usage

After activation, the plugin will automatically update your `robots.txt` and add the necessary meta tags. No further configuration is required, but you can check the settings page for a full list of blocked crawlers.

## Limitations

While this plugin aims to block specified crawlers, it cannot guarantee complete protection against all forms of scraping, as some bots may disregard `robots.txt` directives.

## Support

For questions or support, [please post on the forums](https://wordpress.org/support/plugin/block-ai-crawlers/) or [on GitHub](https://github.com/bobmatyas/wp-block-ai-crawlers/issues).
9 changes: 7 additions & 2 deletions block-ai-crawlers.php
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
* Author: Bob Matyas
* Author URI: https://www.bobmatyas.com
* Text Domain: block-ai-crawlers
* Version: 1.3.9
* Version: 1.4.0
* License: GPL-2.0-or-later
* License URI: https://www.gnu.org/licenses/gpl-2.0.html
*
Expand Down Expand Up @@ -40,16 +40,21 @@ function block_ai_robots_txt( $robots ) {
$robots .= "User-agent: cohere-ai\n";
$robots .= "User-agent: Diffbot\n";
$robots .= "User-agent: FacebookBot\n";
$robots .= "User-agent: FriendlyCrawler\n";
$robots .= "User-agent: GPTBot\n";
$robots .= "User-agent: Google-Extended\n";
$robots .= "User-agent: ImagesiftBot\n";
$robots .= "User-agent: Kangaroo Bot\n";
$robots .= "User-agent: Meta-ExternalAgent\n";
$robots .= "User-agent: Meta-ExternalFetcher\n";
$robots .= "User-agent: OAI-SearchBot\n";
$robots .= "User-agent: Omgili\n";
$robots .= "User-agent: Omgilibot\n";
$robots .= "User-agent: PetalBot\n";
$robots .= "User-agent: PerplexityBot\n";
$robots .= "User-agent: Scrapy\n";
$robots .= "User-agent: SentiBot\n";
$robots .= "User-agent: sentibot\n";
$robots .= "User-agent: Timpibot\n";
$robots .= "User-agent: YouBot\n";
$robots .= "User-agent: webzio\n";
Expand Down Expand Up @@ -91,7 +96,7 @@ function block_ai_activate() {
*/
function block_ai_prepend_plugin_settings_link( $links_array, $plugin_file_name ) {
if ( strpos( $plugin_file_name, basename( __FILE__ ) ) ) {
array_unshift( $links_array, '<a href=" ' . get_admin_url() . ' options-general.php?page=block-ai-crawlers">Settings</a>' );
array_unshift( $links_array, '<a href="' . get_admin_url() . 'options-general.php?page=block-ai-crawlers">Settings</a>' );
}
return $links_array;
}
Expand Down
44 changes: 32 additions & 12 deletions inc/settings-html.php
Original file line number Diff line number Diff line change
Expand Up @@ -39,52 +39,62 @@
</tr>
<tr>
<th>Bytespider</th>
<td><p>Used by TikTok for AI training</p></td>
<td><p>Used by TikTok for AI training.</p></td>
<td><a href="https://darkvisitors.com/agents/bytespider" target=_blank>More Info <span class="dashicons dashicons-external link"></span></a></td>
</tr>
<tr>
<th>Cohere</th>
<td><p>Used by Cohere to scrape data for AI training</p></td>
<td><p>Used by Cohere to scrape data for AI training.</p></td>
<td><a href="https://darkvisitors.com/agents/cohere-ai" target=_blank>More Info <span class="dashicons dashicons-external link"></span></a></td>
</tr>
<tr>
<th>ChatGPT</th>
<td><p>Used by OpenAI</p></td>
<td><p>Used by OpenAI to power ChatGPT.</p></td>
<td><a href="https://platform.openai.com/docs/plugins/bot" target=_blank>More Info <span class="dashicons dashicons-external link"></span></a></td>
</tr>
<tr>
<th>ClaudeBot and Claude-Web</th>
<td><p>Used by Anthropic's Claude</p></td>
<td><p>Used by Anthropic's Claude.</p></td>
<td><a href="https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler" target=_blank>More Info <span class="dashicons dashicons-external link"></span></a></td>
</tr>
<tr>
<th>CommonCrawl</th>
<td><p>Compiles datasets used to train AI models</p></td>
<td><p>Compiles datasets used to train AI models.</p></td>
<td><a href="https://commoncrawl.org/big-picture/frequently-asked-questions/" target=_blank>More Info <span class="dashicons dashicons-external link"></span></a></td>
</tr>
<tr>
<th>Diffbot</th>
<td><p>Used by Diffbot to scrape data for AI training</p></td>
<td><p>Used by Diffbot to scrape data for AI training.</p></td>
<td><a href="https://docs.diffbot.com/reference/crawl-introduction" target=_blank>More Info <span class="dashicons dashicons-external link"></span></a></td>
</tr>
<tr>
<th>FacebookBot</th>
<td><p>Used by Meta (Facebook) for their AI</p></td>
<td><p>Used by Meta (Facebook) for their AI.</p></td>
<td><a href="https://developers.facebook.com/docs/sharing/bot" target=_blank>More Info <span class="dashicons dashicons-external link"></span></a></td>
</tr>
<tr>
<th>Friendly Crawler</th>
<td><p>Crawls websites to build datasets for machine learning experiments. </p></td>
<td><a href="https://imho.alex-kunz.com/2024/01/25/an-update-on-friendly-crawler/" target=_blank>More Info <span class="dashicons dashicons-external link"></span></a></td>
</tr>
<tr>
<th>Google Extended</th>
<td><p>Used by Google to power Gemini (formerly known as Bard)</p></td>
<td><p>Used by Google to power Gemini (formerly known as Bard).</p></td>
<td><a href="https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers?hl=en#common-crawlers" target=_blank>More Info <span class="dashicons dashicons-external link"></span></a></td>
</tr>
<tr>
<th>ImagesiftBot</th>
<td><p>Used by Hive's Imagesift tool that scrapes images. This may be used for the company's generative AI product </p></td>
<td><p>Used by Hive's Imagesift tool that scrapes images. This may be used for the company's generative AI product.</p></td>
<td><a href="https://imagesift.com/about" target=_blank>More Info <span class="dashicons dashicons-external link"></span></a></td>
</tr>
<tr>
<th>Kangaroo Bot</th>
<td><p>Used to power the Australia-focused Kangaroo LLM.</p></td>
<td><a href="https://kangaroollm.com.au/kangaroo-bot/" target=_blank>More Info <span class="dashicons dashicons-external link"></span></a></td>
</tr>
<tr>
<th>Meta-ExternalAgent / Meta-ExternalFetcher</th>
<td><p>Used by Meta to train AI products</p></td>
<td><p>Used by Meta (Facebook) to train AI products.</p></td>
<td><a href="https://developers.facebook.com/docs/sharing/webmasters/web-crawlers" target=_blank>More Info <span class="dashicons dashicons-external link"></span></a></td>
</tr>
<tr>
Expand All @@ -94,14 +104,24 @@
</tr>
<tr>
<th>Omgilibot</th>
<td><p>Used by Omigili to scrape data for AI training</p></td>
<td><p>Used by Omigili to scrape data for AI training.</p></td>
<td><a href="https://webz.io/blog/machine-learning/common-crawl-vs-webz-io-data-which-one-works-best-for-large-language-models/" target=_blank>More Info <span class="dashicons dashicons-external link"></span></a></td>
</tr>
<tr>
<th>PerplexityBot</th>
<td><p>Used by Perplexity for their AI products</p></td>
<td><p>Used by Perplexity for their AI products.</p></td>
<td><a href="https://docs.perplexity.ai/docs/perplexitybot" target=_blank>More Info <span class="dashicons dashicons-external link"></span></a></td>
</tr>
<tr>
<th>Scrapy</th>
<td><p>Blocks the Scrapy bot (used for scraping websites).</p></td>
<td><a href="https://scrapy.org/" target=_blank>More Info <span class="dashicons dashicons-external link"></span></a></td>
</tr>
<tr>
<th>SentiBot</th>
<td><p>Blocks SentiOne's AI-powered social media listening and analysis tools.</p></td>
<td><a href="https://scrapy.org/" target=_blank>More Info <span class="dashicons dashicons-external link"></span></a></td>
</tr>
<tr>
<th>Timpibot</th>
<td><p>Used by Timpi; likely for their Wilson AI Product.</p></td>
Expand Down
77 changes: 54 additions & 23 deletions readme.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,42 +2,57 @@
Contributors: lastsplash
Tags: ai, robots.txt, chatgpt, crawlers
Requires at least: 5.6
Tested up to: 6.6
Tested up to: 6.6.2
Requires PHP: 7.4
Stable tag: 1.3.9
Stable tag: 1.4.0
License: GPLv2 or later
License URI: https://www.gnu.org/licenses/gpl-2.0.html

Tells AI companies not to access and scrape your site for AI.
Tells AI (Artificial Intelligence) companies not to scrap your site for their AI products.

== Description ==

Tells AI crawlers (such as OpenAI ChatGPT) not to use your website as training data for their Artificial Intelligence (AI) products. It does this by updating your site's `robots.txt` to block common AI crawlers and scrapers. This should prevent your content from being used to traing Large Language Models (LLMs).
# Protect Your Content from AI Scraping

It blocks these AI crawlers and bots:
This plugin helps you prevent AI crawlers from using your content as training data for their products. By updating your site's `robots.txt`, it blocks common AI crawlers and scrapers, aiming to protect your content from being used in the training of Large Language Models (LLMs).

- **ChatGPT and GPTBot** - Crawlers and web browser used by OpenAI
- **Google Extended** - Crawler used for Google's Gemini (formerly Google Bard) AI training
- **FacebookBot** - Crawler used for Facebook's AI training
- **Meta** - Blocks crawlers used by Meta AI training
- **CommonCrawl** - Crawler that compiles datasets used to train AI models
- **Anthropic AI / Claude** - Crawler used by Anthropic
- **Omgili** - Crawler used by Omgili for AI training
- **Bytespider** - Crawler used by TikTok for AI training
- **PerplexityBot** - Used by Perplexity for its AI products
- **Applebot** - Used by Apple to train its AI products
- **Cohere** - Crawler used by Cohere AI training
- **DiffBot** - Crawler used by Diffbot for AI training
- **Imagesift** - Crawler used by used by Imagesift for images
- ... and more!
## Features

## Experimental Meta Tags
### Blocks AI Crawlers

The plugin adds the "noai, noimageai" directive to your site's meta tags. These tags tell AI bots not to use your content as part of their data sets. These are experimental and they have not been standardized.
Includes:

## Disclaimer
- **OpenAI** - Blocks crawlers used for ChatGPT
- **Google** - Blocks crawlers used by Google's Gemini AI products
- **Facebook / Meta** - Used for Facebook's AI training
- **Anthropic AI** - Blocks crawlers used by Anthropic
- **Perplexity** - Block crawlers used by Perplexity
- **Applebot** - Blocks crawlers used by Apple
- ... and more!

*Note:* While the plugin adds these markers, it is up to the crawlers themeselves to honor these requests.
### Experimental Meta Tags

The plugin adds the "noai, noimageai" directive to your site's meta tags, instructing AI bots not to use your content in their datasets. Please note that these tags are experimental and have not been standardized.

## Installation

1. Download the plugin zip file.
2. Go to your WordPress admin panel.
3. Navigate to Plugins > Add New > Upload Plugin.
4. Choose the zip file and click "Install Now."
5. Activate the plugin.

## Usage

After activation, the plugin will automatically update your `robots.txt` and add the necessary meta tags. No further configuration is required, but you can check the settings page for a full list of blocked crawlers.

## Limitations

While this plugin aims to block specified crawlers, it cannot guarantee complete protection against all forms of scraping, as some bots may disregard `robots.txt` directives.

## Support

For questions or support, [please post on the forums](https://wordpress.org/support/plugin/block-ai-crawlers/) or [on GitHub](https://github.com/bobmatyas/wp-block-ai-crawlers/issues).

== Installation ==

Expand All @@ -55,6 +70,14 @@ Unfortunately, no. However, it does tell bots that your site shouldn't be used f

The plugin adds directives to the `robots.txt` file to tell AI crawlers that they shouldn't index your site. It also adds the `noai` meta tag to your site's header to do the same.

= How often is this updated? =

I try to keep up with new crawlers and update the block list regularly.

= Can I suggest crawlers for blocking? =

Yes! please share suggestions on [the forums](https://wordpress.org/support/plugin/block-ai-crawlers/) or [on GitHub](https://github.com/bobmatyas/wp-block-ai-crawlers/issues).

= What if I already have a `robots.txt` file on my web server? =

If you have a physical `robots.txt` file on your web server, you won't be able to activate this plugin. The plugin only works when using WordPress' built-in virtual `robots.txt`.
Expand All @@ -72,6 +95,14 @@ No. Search engines follow different `robots.txt` rules.

== Changelog ==

= 1.4.0 =
- New: Block Kangaroo Bot
- New: Block sentibot
- New: Block FriendlyCrawler
- New: Block Scrapy
- Fix: Broken link to settings page from Plugins page
- Enhancement: Improve `readme.md` and `readme.txt`

= 1.3.9 =
- New: Block PetalBot
- New: Block AI2Bot
Expand Down

0 comments on commit 2942b3c

Please sign in to comment.