How to Optimize WP site for millions of posts

I’m working on a website for a company that will very likely be creating millions of posts via a custom post type. They’re prayers, so basically the user on the front end just submits a short phrase via a form. All the company cares about is the post content and the date posted. The site hasn’t even launched yet and they already have over 120,000 posts, so I’m dead serious when I say millions.

So, a couple optimization questions:

  1. Let’s say I have a category ‘featured’ in a custom post type that has 500,000 posts. The featured category only has 500 posts. If I create a query for the featured posts, am I querying the entire 500,000 posts, or just the 500 featured? What if I only want to display the ten most recent posts that are featured?
  2. When saving this custom post type to the database are there any things I can do to cut down on server resources, especially since the only thing really needed is the content of the post and the date?
  3. Should I even be using a custom post type? I like it in principle because it’s well integrated into the WordPress admin, but if there are significant disadvantages in performance then I suppose I can do something different.

I’ve never worked on a project at this scale, so I’m a bit more concerned about performance than usual. Thanks for any help!

Solutions Collecting From Web of "How to Optimize WP site for millions of posts"

1. Set the query before WP_Query is run

This seems to be the most important thing to keep in mind when trying to keep database queries to a minimum since the only opportunity to alter the query is, of course, before it is run in the SQL database.

Normal Queries
For a normal query, WordPress uses the wp() function, which in turn calls $wp->main( $query_vars ). The “is_ variables” from conditional tags are set before passing them to WP_Query->get_posts(), which converts it into a MySQL database query and finally stores them in the $wp_query object. It is possible to filter the query before it is actually run in the SQL database.

The pre_get_posts action hooks into this process, allowing you to change the query before it is passed to WP_Query->get_posts().

For example, if you want to filter the query for posts in the “featured” category, you would use add_action( 'pre_get_posts', 'your_function_name' ); and include the in_category conditional tag within your_function_name.

function your_function_name( $query ) {
    if ( $query->in_category( 'featured' ) && $query->is_main_query() ) {
        // Replace 123 with the category ID of the featured category.
        $query->set( 'cat', '123' );
    }
}
add_action( 'pre_get_posts', 'your_function_name' );

See Plugin API/Action Reference/pre get posts « WordPress Codex

Page Requests
As for page templates, such as the archive page for the “featured” category, conditional tags won’t work from the pre_get_posts filter. For example, you can’t use is_category to check for the archive page because WP_Query has not run.

Instead, you would have to alter the main query for page requests with a new WP_Query which would look something like $query = new WP_Query( 'cat=123' );. This runs the query with the appropriate argument set from the start.

See Class Reference/WP Query « WordPress Codex

2. Saving to the database

You can use the filter wp_insert_post_data ensuring that only the $data relevant to your custom post type are returned to wp_insert_post. Be sure to include a conditional statement to check for your custom post type.
Plugin API/Filter Reference/wp insert post data « WordPress Codex

This hook is called by the wp_insert_post function, which is called by wp_update_post when you update your custom post type, usually by saving a draft or publishing the post.

You’ll have to benchmark it yourself though as I can’t personally speak as to the optimization significance of reducing the data that is updated in the database.

3. Do custom post types affect performance?

In my experience, custom post types are a powerful tool for managing content. I don’t know of any other way to manage posts in all the ways that it allows in a manner that would use less resources. I would personally focus on finding ways to reduce the number of queries made wherever possible.

There used to be a performance issue related to permalink structure causing it to take a hit when it begins with text instead of a number.3 This was particularly troublesome for sites hosting a large number of pages, but has been solved since WordPress version 3.3.

I’m only bringing up permalinks here because the slug is usually the first part of the permalink structure which may or may not have affected performance prior to version 3.3. Aside from that I’m not aware of any performance issues that arise from the use of custom post types.

Other Performance Options

Transients
This isn’t a replacement for keeping queries to a minimum in your code, but you can use set_transient to store the queries for some time so that new queries are not necessary. Here is the example used in Dave Clements’ post. Also, note that he recommends adding a save_post action to delete the transient any time a given post type is updated.

<?php // IN THE SPOTLIGHT QUERY
if( false === ( $its_query = get_transient( 'its_query' ) ) ) {
    $pttimestamp = time() + get_option('gmt_offset') * 60*60;
    $its_query = new WP_Query( array(
        'post_type' => 'spotlight',
        'posts_per_page' => 1,
            'post__not_in' => $do_not_duplicate,
        'meta_query' => array(
            array(
                'key' => '_hpc_spotlight_end_time',
                'value' => $pttimestamp,
                'compare' => '>'
            )
        )
    ) );
    set_transient( 'its_query', $its_query, 60*60*4 );
}
if( have_posts() ) { // HIDE SECTION IF NO CURRENT ITS FEATURE ?>
    // LOOP GOES HERE: NOT IMPORTANT TO EXAMPLE
<?php } ?>

More query optimization
Thomas Griffin has a few good tips in his Optimize WordPress Queries tutorial. Here’s a brief list of his suggestions:

  • Set 'cache_results' => false in one-off queries if your server is
    not using persistent caching such as Memcached. One-off queries are
    described as “queries that are used to show small amounts of data. It
    could be that you just want to display linked post titles related to
    the current post, or you may want to display a dropdown of posts to
    select for a particular option setting.”

    His example: $query = get_posts( array( 'posts_per_page' => 1,
    'cache_results' => false ) );

  • Set 'no_found_rows' => true where pagination is not needed. This
    will “bypass MySQL counting the results to see if we need pagination
    or not.”

    His example: $query = new WP_Query( array( 'posts_per_page' => 1,
    'no_found_rows' => true ) );

  • Query for post IDs only if this is all you need 'fields' => 'ids'
    in get_posts. This should significantly reduce the amount of data
    being returned, which is quite a lot per post if you look at
    Database Description « WordPress
    Codex

    His example: $query = get_posts( array( 'posts_per_page' => 1,
    'fields' => 'ids' ) );

In addition to that last tip, the same reasoning can be applied when you only need one or a few post fields by using get_post_field.

Having a solid understanding of how the query works is essential. The more specific you can be with your queries, the less work you will be demanding from your SQL database. This means there are a vast number of possibilities for managing database queries. Be careful with custom queries as far as where they run (is it an admin page?), use proper sanitization in direct queries and try to use native WordPress functions where it allows you to achieve the same performance.

As all premature optimization type of questions this one can’t really be answered without knowing exact usage patterns which too many times are being discovered only when you go live.

In general, per MYSQL specs there should not be any problem with the amount of data. Of course searching data even with the best algorithms will be slower then with a much smaller tables but the solution for that is simple, stronger CPU.

You might want to optimize how meta data is stored (for example not to store ping related data), but this kind of thing depends on what exactly you do, and in the end you might still need stronger CPU so might not be worth your trouble.