Staying One Step Ahead
Almost 2 out of every 3 visitors to your website is not human.
That’s according to a recent report by Incapsula, which found that up to 61.5% of all website traffic is bot (non-human) traffic. If Incapsula’s figures are correct, bot traffic has grown by 20.6% since 2012.
As shown in the pie chart, one third of these bots are search engines or other “good” bots.
Of the “bad” bots, 5% are scrapers, 4.5% are hacking tools, 0.5% are spammers, and 20.5% are other impersonators. This is an 8% increase in impersonator bots since 2012.
Good Bot, Bad Bot
Some bots—or software applications that run automated tasks over the internet—are malicious.
Malicious bots do things like leave spam comments and links, or attempt to exploit sites in order to install malware. Obviously you don’t want these sorts of bots attacking your website. There are a number of things that we here at WP Engine do to stay ahead of these types of bots (more on this below).
Other bots have good intentions. The “good” bots crawl the web to perform automated tasks that are helpful, such as indexing websites, collecting analytical data, or archiving internet content. You can read more about the benefits of good bots in this previous WP Engine blog post.
Although good bots are necessary for site growth, the increase in overall bot traffic can put more pressure on your website’s servers. Luckily, so long as you are hosting with WP Engine you don’t need to worry. This is because our caching infrastructure is robust enough to handle a flood of traffic while still serving up fresh content.
While the general message of the Incapsula report is accurate—there is certainly a trend of more non-human traffic—the numbers and breakdown of different types of bots may not be representative of the wider web.
How Representative is the Data?
Incapsula obtained their data by observing 1.45 billion bot visits to the 20,000 sites operated by their clients over a 90 day period. As noted by Dr Ian Brown, quoted in this story by the BBC, the figures are useful as an indication of the growth in non-human traffic, but may or may not be representative of the wider web.
Jason Cosper, a WordPress and security expert here at WP Engine, agreed that while there certainly has been an increase in bot traffic, the Incapsula data does not necessarily reflect everyone’s experience. He explained:
If you look at WordPress sites in general, you’d see a lot more spam trying to hit them. People and bots trying to throw links into comments and things like that.
There was also a concerted attack during this past spring where a massive botnet was trying to guess weak administrator account passwords. That attack was handled admirably by our network configuration, but it was constantly hitting some sites well into the summer. That amounted to much more than 0.5% of our traffic [the number quoted in the Incapsula report].
Their [Incapsula’s] numbers are what they’re seeing based on their usage. However, it’s not the whole picture.
So while Incapsula has seen a 75% drop in spam bots since their last report, that hasn’t necessarily been experienced by WP Engine or the wider WordPress community.
Being Proactive Against Attacks
WP Engine has number of measures in place for dealing with impersonators, spammers, and other types of malicious bots. We are constantly trying to stay one step ahead of hackers, as Jason notes:
While I’m the sort of person who finds this sort of thing fun, this is a constant uphill battle. I feel like Sisyphus sometimes, pushing the boulder up a hill.
One of the measures WP Engine has in place is to automatically filter a number of user agents that are known to be malicious. This basically blocks those attacks, preventing them from even hitting the server in the first place.
Not only do we block known attacks, we also stay on top of incoming attacks from unknown user agents. So as an attack is coming, we add on-the-fly rules to handle the attack before it even gets to most customers WordPress installs.
In addition, we constantly monitor the server, so if an attack like that does happen, we can find ways to clean it up very easily. This is done through our partnership with Sucuri.
Finally, WP Engine and Sucuri team members follow hacker oriented blogs and other nefarious fringe sites. That way, we can keep an eye on the attackers, following all of the latest tricks as they’re made available. By staying on top of the most recent developments in malicious software, we can block, prevent, or scan for new attacks.
Want to Know More?
While WP Engine does our best to ensure the safety of your site, if you’d like to know about additional security measures you can take, check out Jason’s post on Advanced Anti-Spam Techniques for Torque.
You can also check out these resources on anti-scraper techniques:
- How to Install and Setup WordPress SEO Plugin by Yoast. In particular, check out Step 9 which explains how to you set up RSS footers, which are perfect for combatting scrapers as they indicate where the content came from.
- This article, which tells you how to verify your authorship to Google. By setting yourself as the author of your content in Google, it shows your site as the canonical source. That means that if your site does get scraped, Google won’t penalize you for having content that appears elsewhere (SEO’s refer to this as “duplicate content”) on your site.
Do you have any thoughts on handling increased bot traffic?
Try Cloudflare
I love that WP Engine maintains an active blog.
So when you say traffic, are these percentages you’ve referenced the same numbers bloggers use when determining how many visits are made to their blogs? To encourage ads or paid content, established bloggers will often list their site stats on a special page. When they say they have 75,000 visitors a month, what does this really mean?
Thanks!!
Thanks for your feedback Carla! Here’s a previous post of ours that might help answer your question – Robots May Not Have Feelings But They Do Have Eyes. Basically, it depends what the bloggers are using to count their visits. If they are using Google Analytics for example, the number will be different because Google Analytics does not count non-human (bot) traffic. I hope that helps! Kirby
Thank you for those resources. If we have a “top 20” list of urls that continually scrape our content is there a way we can detect and block them? I assume they use servers and IPs different from the url where they constantly re-post our content.