The MU forums have moved to WordPress.org

mysterious load spikes crash server (68 posts)

  1. Klark123
    Member
    Posted 14 years ago #

    We found that turning keepalives off has helped, so far.

    If we turned them on, even with a low timeout limit, the server spikes at times.

  2. honewatson
    Member
    Posted 14 years ago #

    I discovered some interesting things on one of my boxes.

    I'm currently doing some refinements with a stats package for Nginx and testing it on a live box which serves around 30,000 pages views per day on 1440 ram. (That's not including anything from /wp-admin/.*php or statics, just any wordpress dynamically generated page that visitors see on the live site)

    I did a search for the number of IP's who visit over 40 times excluding legitimate search bots. What I found was, 126 IP's generated around wordpress 10,000 page visits in a day.

    Each of these IP's was using multiple pseudo browser agents.

    So these are basically all most likely spammers putting a third of the load on PHP/mysql.

    For the moment I just chucked them in a blacklist and blocked them with iptables.

    Also some legitimate search bots like Baidu and Yandex will visit your site 1000 times per day to send you 10 visitors. These legitimate bots add up and really they don't need to visit everyday.

  3. andrea_r
    Moderator
    Posted 14 years ago #

    "Each of these IP's was using multiple pseudo browser agents.
    So these are basically all most likely spammers putting a third of the load on PHP/mysql. "

    Yep. They are.

  4. Mixologic
    Member
    Posted 14 years ago #

    Its probably not spammers. You are likely blocking legit traffic, and possibly more traffic than you know since they might be proxy servers you are blocking.

    I've been performance tuning a pretty heavily loaded site, and have encountered pretty much the same symptoms.

    If you look a little closer at your access logs you will likely notice a pattern.

    There will be a flood of traffic coming from one ip address, and most of the requests it is making is most likely to your 'archives'

    13.8.137.11 - - [14/Apr/2010:10:36:08 -0500] "GET /2010/03/ HTTP/1.1" 200 140438 "-" "Mozilla/4.0 (compatible;)"
    13.8.137.11 - - [14/Apr/2010:10:36:08 -0500] "GET /2010/01/ HTTP/1.1" 200 138445 "-" "Mozilla/4.0 (compatible;)"
    13.8.137.11 - - [14/Apr/2010:10:36:08 -0500] "GET /2009/10/ HTTP/1.1" 200 141295 "-" "Mozilla/4.0 (compatible;)"
    13.8.137.11 - - [14/Apr/2010:10:36:08 -0500] "GET /2008/11/ HTTP/1.1" 200 140874 "-" "Mozilla/4.0 (compatible;)"
    13.8.137.11 - - [14/Apr/2010:10:36:08 -0500] "GET /2009/09/ HTTP/1.1" 200 140911 "-" "Mozilla/4.0 (compatible;)"
    13.8.137.11 - - [14/Apr/2010:10:36:08 -0500] "GET /2008/09/ HTTP/1.1" 200 141257 "-" "Mozilla/4.0 (compatible;)"
    13.8.137.11 - - [14/Apr/2010:10:36:08 -0500] "GET /2009/01/ HTTP/1.1" 200 142280 "-" "Mozilla/4.0 (compatible;)"

    I'd bet dollars to doughnuts that you *also* have the following line in header.php in your theme:

    <?php wp_get_archives('type=monthly&format=link'); ?>

    which is spewing out the following helpful html in the <head> section of every page:

    <link rel='archives' title='March 2010' href='http://yourwebsite.com/2010/03/'; />
    <link rel='archives' title='February 2010' href='http://yourwebsite.com/2010/02/'; />
    <link rel='archives' title='January 2010' href='http://yourwebsite.com/2010/01/'; />
    <link rel='archives' title='December 2009' href='http://yourwebsite.com/2009/12/'; />
    <link rel='archives' title='November 2009' href='http://yourwebsite.com/2009/11/'; />
    <link rel='archives' title='October 2009' href='http://yourwebsite.com/2009/10/'; />

    What happens is that browsers that dont know any better interpret the <link> tag as a resource that is necessary to load in order to render the page, particularly poorly behaved ones like IE6. But Im see a lot of others doing this too (like the aforementioned "Mozilla/4.0" browsers). This really blows when you have a site thats been around for, oh, 60 months and therefore gets beat down with every request.

    The simple solution is to simply strip that "wp_get_archives" reference from the header.php, and voila! load average on our servers drop from 16 to 2 almost instantly. The problem now is that without those handy links in the top of the document, googlebot/msnbot etc wont see/index all the valuable content in our archives, which are indeed listed on the page, just way, way further down in the html.

    At first I thought we could control the behavior of the browsers or at least detect it. Firefox has this awful notion of "prefetch" which will grab the previous/next document if you tell it to, and I thought that it, or the "Fasterfox" plugin were possibly culprits in this case.

    I disabled fasterfox as detailed here:
    http://fasterfox.mozdev.org/faq.html#Im_a_webmaster,_how_can_I_prevent_prefetching

    And additionally enabled logging of the prefetch requests to see how much of an impact it was having (see here: http://www.edochan.com/programming/pf.htm)

    Turns out neither one seemed to solve it, therefore it must be older browsers simply not understanding the rel="archive" portion of the link tag.

    So as a final solution (which Im still working on) I tweaked line 720 of wp-includes/general-template.php and added some phony querystring to the request to be able to write some .htaccess rules around it. (one that allows bots to hit those links, but nobody else)

    function get_archives_link($url, $text, $format = 'html', $before = '', $after = '') {
         $text = wptexturize($text);
         $title_text = esc_attr($text);
         $url = esc_url($url);
    
         if ('link' == $format)
              $link_html = "\t<link rel='archives' title='$title_text' href='$url?LINK' />\n";
         elseif ('option' == $format)
              $link_html = "\t<option value='$url'>$before $text $after</option>\n";
         elseif ('html' == $format)
              $link_html = "\t<li>$before<a href='$url' title='$title_text'>$text</a>$after</li>\n";
         else // custom
              $link_html = "\t$before<a href='$url' title='$title_text'>$text</a>$after\n";
    
         $link_html = apply_filters( "get_archives_link", $link_html );
    
         return $link_html;
    }

    There may be a better way to tweak this in wordpress, though I havent dug deeper yet.

    Once I get an .htaccess rule written up that does what I need it to do, I'll post an update.

  5. agreda
    Member
    Posted 14 years ago #

    @mixologic Thanks for the suggestion! Looking forward to testing your .htaccess tweak.

  6. andrea_r
    Moderator
    Posted 14 years ago #

    The simple solution is to simply strip that "wp_get_archives" reference from the header.php, and voila! load average on our servers drop from 16 to 2 almost instantly.

    A handy tip, but not applicable in all cases. YMMV. :)

    (I'm gonna keep it in mind tho. :D )

    I've seen brand new setups get absolutely pounded with splog signups. they blocked proxy connections, and presto - sploggers went away. Box load quietened down.

  7. agreda
    Member
    Posted 14 years ago #

    I'd bet dollars to doughnuts that you *also* have the following line in header.php

    @mixologic You owe me a doughnut. ;-) Our theme has no such call in the header. Mandigo seems to have its own get_author_archives function in a separate theme file. Regardless, the output HTML does not include the spew you referenced either.

    blocked proxy connections, and presto - sploggers went away

    @andrea_r Another great suggestion, though with our HughesNet connection (via proxy) I suppose we would be blocking ourselves! FYI: We're having great success stopping Splogs with the WPMU Dev Premium Anti-Splog plugin.

  8. andrea_r
    Moderator
    Posted 14 years ago #

    I have my own splog stopping plugins I use for clients. :) They stop things dead cold.

About this Topic