WordPress ignore init for crawlers

I just realized that crawlers like google are triggering massive activity from all add_action binded to ‘init‘.

Is this normal behaviour? Is it possible to trigger ‘init’ only for legit visitors?

Solutions Collecting From Web of "WordPress ignore init for crawlers"

Just added this to functions.php:

// Returns TRUE if it's a crawler
function check_is_crawler() {
    if (isset($_SERVER['HTTP_USER_AGENT']) && preg_match('/bot|wget|crawl|google|slurp|spider/i', $_SERVER['HTTP_USER_AGENT'])) {
        return true;
    } else {
        return false;
    }
}

And I’m using it on critical functions to lower resource usage.

Also, created a robots.txt with the following content:

User-agent: *
Crawl-delay: 10

It puts a halt on crawlers, so they don’t “spam” your website and consume all your resources

Be warned ! However, google does not like this AT ALL. When accessing your page, if google notices different behavior for crawlers and visitors from your website, it will possibly consider your website as spam.

Thanks for the tip @Jack Johansson, I’ll use it only on internal functions. It’s an ads website, and there’s a lot of things going under the hood that don’t output to the user.

If your website is consuming a lot of resources on each pageload, you should also look into a caching solution to help your pages load faster and reduce your overall server usage.

If caching is not possible, using deferred Cronjobs (ie. not WordPress crons but good old server crons) would be a good thing, allowing your visitors to always have the data ready for them instead of having to wait for it to compile/refresh.

Short answer to both your questions is : Yes.

  • A google crawler bot is supposed to crawl every page of your website
    and index the contents. So when a crawler from google is accessing
    your website, it is legit. If you simply want to exclude some of your pages
    from google, use the google webmasters console to do it.
  • You can disable the add_action for crawlers. You need to get the
    user agent, and then use a simple if() in your theme’s
    functions.php to disable it for crawlers. There are plenty of guide on the internet that can help you do this.

Be warned ! However, google does not like this AT ALL. When accessing your page, if google notices different behavior for crawlers and visitors from your website, it will possibly consider your website as spam.

You may have noticed that there are 2 render outputs in Fetch as google, in the google webmasters console. One is the rendered output for visitors, one is rendered by google bot. These 2 must be as close as possible to each other. Even a simple missing CSS may result your website being rendered Messy to google bot, which over time google will consider your website’s appearance as crap.

Many websites try to fool the search engines by providing different output for visitors and crawlers. At least google does not appreciate this, and your website has the potential to be subjected to penalties from this action.