Merge pull request #60 from bobmatyas/updates/100424

Updates/100424
bobmatyas · Oct 7, 2024 · 2942b3c · 2942b3c
2 parents 0b798c2 + 2b1f82c
commit 2942b3c
Show file tree

Hide file tree

Showing 4 changed files with 129 additions and 41 deletions.
diff --git a/README.md b/README.md
@@ -1,9 +1,41 @@
 # Block AI Crawlers
 
-A WordPress plugin for blocking AI crawlers. It uses `robots.txt` and an experimental Meta Tag to tell AI crawlers not to use your site as part of their training data.
+This WordPress plugin helps prevent AI crawlers from using your content as training data for their products. By updating your site's `robots.txt`, it blocks common AI crawlers and scrapers, aiming to protect your content from being used in the training of Large Language Models (LLMs).
 
-## Installing
+## Features
 
-You can install the plugin via the WordPress.org plugin directory:
+### Blocks AI Crawlers
 
-- https://wordpress.org/plugins/block-ai-crawlers
+Includes:
+
+  - **OpenAI** - Blocks crawlers used for ChatGPT
+  - **Google** - Blocks crawlers used by Google's Gemini AI products
+  - **Facebook / Meta** - Used for Facebook's AI training
+  - **Anthropic AI** - Blocks crawlers used by Anthropic  
+  - **Perplexity** - Block crawlers used by Perplexity
+  - **Applebot** - Blocks crawlers used by Apple
+  - ... and more!
+
+### Experimental Meta Tags
+
+The plugin adds the "noai, noimageai" directive to your site's meta tags, instructing AI bots not to use your content in their datasets. Please note that these tags are experimental and have not been standardized.
+
+## Installation
+
+1. Download the plugin zip file.
+2. Go to your WordPress admin panel.
+3. Navigate to Plugins > Add New > Upload Plugin.
+4. Choose the zip file and click "Install Now."
+5. Activate the plugin.
+
+## Usage
+
+After activation, the plugin will automatically update your `robots.txt` and add the necessary meta tags. No further configuration is required, but you can check the settings page for a full list of blocked crawlers.
+
+## Limitations
+
+While this plugin aims to block specified crawlers, it cannot guarantee complete protection against all forms of scraping, as some bots may disregard `robots.txt` directives.
+
+## Support
+
+For questions or support, [please post on the forums](https://wordpress.org/support/plugin/block-ai-crawlers/) or [on GitHub](https://github.com/bobmatyas/wp-block-ai-crawlers/issues).
diff --git a/block-ai-crawlers.php b/block-ai-crawlers.php
@@ -5,7 +5,7 @@
  * Author:          Bob Matyas
  * Author URI:      https://www.bobmatyas.com
  * Text Domain:     block-ai-crawlers
- * Version:         1.3.9
+ * Version:         1.4.0
  * License:         GPL-2.0-or-later
  * License URI:     https://www.gnu.org/licenses/gpl-2.0.html
  *
@@ -40,16 +40,21 @@ function block_ai_robots_txt( $robots ) {
 		$robots .= "User-agent: cohere-ai\n";
 		$robots .= "User-agent: Diffbot\n";
 		$robots .= "User-agent: FacebookBot\n";
+		$robots .= "User-agent: FriendlyCrawler\n";
 		$robots .= "User-agent: GPTBot\n";
 		$robots .= "User-agent: Google-Extended\n";
 		$robots .= "User-agent: ImagesiftBot\n";
+		$robots .= "User-agent: Kangaroo Bot\n";
 		$robots .= "User-agent: Meta-ExternalAgent\n";
 		$robots .= "User-agent: Meta-ExternalFetcher\n";
 		$robots .= "User-agent: OAI-SearchBot\n";
 		$robots .= "User-agent: Omgili\n";
 		$robots .= "User-agent: Omgilibot\n";
 		$robots .= "User-agent: PetalBot\n";
 		$robots .= "User-agent: PerplexityBot\n";
+		$robots .= "User-agent: Scrapy\n";
+		$robots .= "User-agent: SentiBot\n";
+		$robots .= "User-agent: sentibot\n";
 		$robots .= "User-agent: Timpibot\n";
 		$robots .= "User-agent: YouBot\n";
 		$robots .= "User-agent: webzio\n";
@@ -91,7 +96,7 @@ function block_ai_activate() {
  */
 function block_ai_prepend_plugin_settings_link( $links_array, $plugin_file_name ) {
 	if ( strpos( $plugin_file_name, basename( __FILE__ ) ) ) {
-		array_unshift( $links_array, '<a href=" ' . get_admin_url() . ' options-general.php?page=block-ai-crawlers">Settings</a>' );
+		array_unshift( $links_array, '<a href="' . get_admin_url() . 'options-general.php?page=block-ai-crawlers">Settings</a>' );
 	}
 	return $links_array;
 }

diff --git a/inc/settings-html.php b/inc/settings-html.php
@@ -39,52 +39,62 @@
 							</tr>
 							<tr>
 								<th>Bytespider</th>
-								<td><p>Used by TikTok for AI training</p></td>
+								<td><p>Used by TikTok for AI training.</p></td>
 								<td><a href="https://darkvisitors.com/agents/bytespider" target=_blank>More Info <span class="dashicons dashicons-external link"></span></a></td>
 							</tr>
 							<tr>
 								<th>Cohere</th>
-								<td><p>Used by Cohere to scrape data for AI training</p></td>
+								<td><p>Used by Cohere to scrape data for AI training.</p></td>
 								<td><a href="https://darkvisitors.com/agents/cohere-ai" target=_blank>More Info <span class="dashicons dashicons-external link"></span></a></td>
 							</tr>
 							<tr>
 								<th>ChatGPT</th>
-								<td><p>Used by OpenAI</p></td>
+								<td><p>Used by OpenAI to power ChatGPT.</p></td>
 								<td><a href="https://platform.openai.com/docs/plugins/bot" target=_blank>More Info <span class="dashicons dashicons-external link"></span></a></td>
 							</tr>
 							<tr>
 								<th>ClaudeBot and Claude-Web</th>
-								<td><p>Used by Anthropic's Claude</p></td>
+								<td><p>Used by Anthropic's Claude.</p></td>
 								<td><a href="https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler" target=_blank>More Info <span class="dashicons dashicons-external link"></span></a></td>
 							</tr>
 							<tr>
 								<th>CommonCrawl</th>
-								<td><p>Compiles datasets used to train AI models</p></td>
+								<td><p>Compiles datasets used to train AI models.</p></td>
 								<td><a href="https://commoncrawl.org/big-picture/frequently-asked-questions/" target=_blank>More Info <span class="dashicons dashicons-external link"></span></a></td>
 							</tr>
 							<tr>
 								<th>Diffbot</th>
-								<td><p>Used by Diffbot to scrape data for AI training</p></td>
+								<td><p>Used by Diffbot to scrape data for AI training.</p></td>
 								<td><a href="https://docs.diffbot.com/reference/crawl-introduction" target=_blank>More Info <span class="dashicons dashicons-external link"></span></a></td>
 							</tr>
 							<tr>
 								<th>FacebookBot</th>
-								<td><p>Used by Meta (Facebook) for their AI</p></td>
+								<td><p>Used by Meta (Facebook) for their AI.</p></td>
 								<td><a href="https://developers.facebook.com/docs/sharing/bot" target=_blank>More Info <span class="dashicons dashicons-external link"></span></a></td>
 							</tr>
+							<tr>
+								<th>Friendly Crawler</th>
+								<td><p>Crawls websites to build datasets for machine learning experiments. </p></td>
+								<td><a href="https://imho.alex-kunz.com/2024/01/25/an-update-on-friendly-crawler/" target=_blank>More Info <span class="dashicons dashicons-external link"></span></a></td>
+							</tr>
 							<tr>
 								<th>Google Extended</th>
-								<td><p>Used by Google to power Gemini (formerly known as Bard)</p></td>
+								<td><p>Used by Google to power Gemini (formerly known as Bard).</p></td>
 								<td><a href="https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers?hl=en#common-crawlers" target=_blank>More Info <span class="dashicons dashicons-external link"></span></a></td>
 							</tr>
 							<tr>
 								<th>ImagesiftBot</th>
-								<td><p>Used by Hive's Imagesift tool that scrapes images. This may be used for the company's generative AI product </p></td>
+								<td><p>Used by Hive's Imagesift tool that scrapes images. This may be used for the company's generative AI product.</p></td>
 								<td><a href="https://imagesift.com/about" target=_blank>More Info <span class="dashicons dashicons-external link"></span></a></td>
 							</tr>
+							<tr>
+								<th>Kangaroo Bot</th>
+								<td><p>Used to power the Australia-focused Kangaroo LLM.</p></td>
+								<td><a href="https://kangaroollm.com.au/kangaroo-bot/" target=_blank>More Info <span class="dashicons dashicons-external link"></span></a></td>
+							</tr>
 							<tr>
 								<th>Meta-ExternalAgent / Meta-ExternalFetcher</th>
-								<td><p>Used by Meta to train AI products</p></td>
+								<td><p>Used by Meta (Facebook) to train AI products.</p></td>
 								<td><a href="https://developers.facebook.com/docs/sharing/webmasters/web-crawlers" target=_blank>More Info <span class="dashicons dashicons-external link"></span></a></td>
 							</tr>
 							<tr>
@@ -94,14 +104,24 @@
 							</tr>
 							<tr>
 								<th>Omgilibot</th>
-								<td><p>Used by Omigili to scrape data for AI training</p></td>
+								<td><p>Used by Omigili to scrape data for AI training.</p></td>
 								<td><a href="https://webz.io/blog/machine-learning/common-crawl-vs-webz-io-data-which-one-works-best-for-large-language-models/" target=_blank>More Info <span class="dashicons dashicons-external link"></span></a></td>
 							</tr>
 							<tr>
 								<th>PerplexityBot</th>
-								<td><p>Used by Perplexity for their AI products</p></td>
+								<td><p>Used by Perplexity for their AI products.</p></td>
 								<td><a href="https://docs.perplexity.ai/docs/perplexitybot" target=_blank>More Info <span class="dashicons dashicons-external link"></span></a></td>
 							</tr>
+							<tr>
+								<th>Scrapy</th>
+								<td><p>Blocks the Scrapy bot (used for scraping websites).</p></td>
+								<td><a href="https://scrapy.org/" target=_blank>More Info <span class="dashicons dashicons-external link"></span></a></td>
+							</tr>
+							<tr>
+								<th>SentiBot</th>
+								<td><p>Blocks SentiOne's AI-powered social media listening and analysis tools.</p></td>
+								<td><a href="https://scrapy.org/" target=_blank>More Info <span class="dashicons dashicons-external link"></span></a></td>
+							</tr>
 							<tr>
 								<th>Timpibot</th>
 								<td><p>Used by Timpi; likely for their Wilson AI Product.</p></td>

diff --git a/readme.txt b/readme.txt
@@ -2,42 +2,57 @@
 Contributors: lastsplash
 Tags: ai, robots.txt, chatgpt, crawlers
 Requires at least: 5.6
-Tested up to: 6.6
+Tested up to: 6.6.2
 Requires PHP: 7.4
-Stable tag: 1.3.9
+Stable tag: 1.4.0
 License: GPLv2 or later
 License URI: https://www.gnu.org/licenses/gpl-2.0.html
 
-Tells AI companies not to access and scrape your site for AI.
+Tells AI (Artificial Intelligence) companies not to scrap your site for their AI products.
 
 == Description ==
 
-Tells AI crawlers (such as OpenAI ChatGPT) not to use your website as training data for their Artificial Intelligence (AI) products. It does this by updating your site's `robots.txt` to block common AI crawlers and scrapers. This should prevent your content from being used to traing Large Language Models (LLMs). 
+# Protect Your Content from AI Scraping
 
-It blocks these AI crawlers and bots:
+This plugin helps you prevent AI crawlers from using your content as training data for their products. By updating your site's `robots.txt`, it blocks common AI crawlers and scrapers, aiming to protect your content from being used in the training of Large Language Models (LLMs).
 
-- **ChatGPT and GPTBot** - Crawlers and web browser used by OpenAI
-- **Google Extended** - Crawler used for Google's Gemini (formerly Google Bard) AI training
-- **FacebookBot** - Crawler used for Facebook's AI training
-- **Meta** - Blocks crawlers used by Meta AI training
-- **CommonCrawl** - Crawler that compiles datasets used to train AI models
-- **Anthropic AI / Claude** - Crawler used by Anthropic
-- **Omgili** - Crawler used by Omgili for AI training
-- **Bytespider** - Crawler used by TikTok for AI training 
-- **PerplexityBot** - Used by Perplexity for its AI products
-- **Applebot** - Used by Apple to train its AI products
-- **Cohere** - Crawler used by Cohere AI training 
-- **DiffBot** - Crawler used by Diffbot for AI training 
-- **Imagesift** - Crawler used by used by Imagesift for images 
-- ... and more!
+## Features
 
-## Experimental Meta Tags
+### Blocks AI Crawlers
 
-The plugin adds the "noai, noimageai" directive to your site's meta tags. These tags tell AI bots not to use your content as part of their data sets. These are experimental and they have not been standardized. 
+Includes:
 
-## Disclaimer
+  - **OpenAI** - Blocks crawlers used for ChatGPT
+  - **Google** - Blocks crawlers used by Google's Gemini AI products
+  - **Facebook / Meta** - Used for Facebook's AI training
+  - **Anthropic AI** - Blocks crawlers used by Anthropic  
+  - **Perplexity** - Block crawlers used by Perplexity
+  - **Applebot** - Blocks crawlers used by Apple
+  - ... and more!
 
-*Note:* While the plugin adds these markers, it is up to the crawlers themeselves to honor these requests.
+### Experimental Meta Tags
+
+The plugin adds the "noai, noimageai" directive to your site's meta tags, instructing AI bots not to use your content in their datasets. Please note that these tags are experimental and have not been standardized.
+
+## Installation
+
+1. Download the plugin zip file.
+2. Go to your WordPress admin panel.
+3. Navigate to Plugins > Add New > Upload Plugin.
+4. Choose the zip file and click "Install Now."
+5. Activate the plugin.
+
+## Usage
+
+After activation, the plugin will automatically update your `robots.txt` and add the necessary meta tags. No further configuration is required, but you can check the settings page for a full list of blocked crawlers.
+
+## Limitations
+
+While this plugin aims to block specified crawlers, it cannot guarantee complete protection against all forms of scraping, as some bots may disregard `robots.txt` directives.
+
+## Support
+
+For questions or support, [please post on the forums](https://wordpress.org/support/plugin/block-ai-crawlers/) or [on GitHub](https://github.com/bobmatyas/wp-block-ai-crawlers/issues).
 
 == Installation ==
 
@@ -55,6 +70,14 @@ Unfortunately, no. However, it does tell bots that your site shouldn't be used f
 
 The plugin adds directives to the `robots.txt` file to tell AI crawlers that they shouldn't index your site. It also adds the `noai` meta tag to your site's header to do the same.
 
+= How often is this updated? =
+
+I try to keep up with new crawlers and update the block list regularly.
+
+= Can I suggest crawlers for blocking? =
+
+Yes! please share suggestions on [the forums](https://wordpress.org/support/plugin/block-ai-crawlers/) or [on GitHub](https://github.com/bobmatyas/wp-block-ai-crawlers/issues).
+
 = What if I already have a `robots.txt` file on my web server? =
 
 If you have a physical `robots.txt` file on your web server, you won't be able to activate this plugin. The plugin only works when using WordPress' built-in virtual `robots.txt`.
@@ -72,6 +95,14 @@ No. Search engines follow different `robots.txt` rules.
 
 == Changelog ==
 
+= 1.4.0 =
+- New: Block Kangaroo Bot
+- New: Block sentibot
+- New: Block FriendlyCrawler
+- New: Block Scrapy
+- Fix: Broken link to settings page from Plugins page
+- Enhancement: Improve `readme.md` and `readme.txt`
+
 = 1.3.9 =
 - New: Block PetalBot
 - New: Block AI2Bot