Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CANTINA-995: Disable crawling for VIP convenience domains #5129

Merged
merged 1 commit into from
Jan 5, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions 001-core/privacy.php
Original file line number Diff line number Diff line change
Expand Up @@ -457,3 +457,22 @@ function delete_old_export_files() {
}
}
}

/**
* Disable crawling for go-vip.co and go-vip.net domains.
*
* @param string $output The robots.txt content.
* @return string The modified robots.txt content.
*/
function vip_convenience_domain_robots_txt( $output ) {
$host = strtolower( $_SERVER['HTTP_HOST'] ?? '' ); // phpcs:ignore WordPress.Security.ValidatedSanitizedInput.InputNotSanitized
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is $_SERVER['HTTP_HOST'] set for curl requests?

I see the nullcoalescing, but want to avoid customers making requests with curl and a browser on their convenience domain sites and getting different results.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, unless the host parameter is set otherwise?

if ( false !== strpos( $host, '.go-vip.co' ) || false !== strpos( $host, '.go-vip.net' ) ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these check for the strings at the end of the $host, rather than just the presence of it? develop.go-vip.company.com would satisfy this condition, for instance.

Does any IPv4 or IPv6 need to be taken into account as potential values of $host?

$output = "# Crawling is blocked for go-vip.co and go-vip.net domains\n";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
$output = "# Crawling is blocked for go-vip.co and go-vip.net domains\n";
$output = "# Crawling is blocked for go-vip.co and go-vip.net domains.\n";

$output .= "User-agent: *\n";
$output .= "Disallow: /\n";
}

return $output;
}
// phpcs:ignore WordPressVIPMinimum.Hooks.RestrictedHooks.robots_txt
add_filter( 'robots_txt', __NAMESPACE__ . '\vip_convenience_domain_robots_txt' );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WP must have loaded by this point to be able to use add_filter(). Has any customer application code been loaded yet? i.e. this being run at priority 10 isn't going to overwrite any existing changes they may have made?

But, they still have chance to unhook it if they wish? Would be good to document an example of the right hook and process to run that remove_filter( 'robots_txt', '\Automattic\VIP\Core\Privacy\vip_convenience_domain_robots_txt' ) call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No customer application code should be loaded at this point yet. But yeah, @yolih, if you wanted to add this to documentation, that'd be fine.

Loading