For years I crunched server logs with WebLog Expert to obtain visitor statistics for my website. During the last four months I also ran my own Referrer Filter to cut down on referrer spam. It was laborious but I thought I was getting a pretty accurate picture of human visitors, without the kind of intrusive (and easily blocked) tracking employed by Google Analytics.
Then, in the first week of September, the global bot networks that are for some reason still allowed to operate with impunity discovered my website. The following is a non-exhaustive list of attacks that I identified in the logs. Please keep in mind that I get typically less than 2,000 real visitors a week, for weblog and website combined.
- On 03 September, a break-in attempt with ~100 requests for invalid
category/*
files, as well as ~50 requests for an older page. Looked like a Chinese bot net in training. - On 04 September, about 150 requests for
HexkitGuide.pdf
in a denial-of-service manner, with less than one second between requests. - On 04 and 06 September, a total of about 600 attempts to break into the blog’s login page, in one-second intervals. Various international IPs, so likely a zombie network.
- On 03 September, an incredible 26,319 login attacks, part of an identifiable series of 56,727 for the year so far. Once again the attacks came back-to-back, DoS style. The day’s server log was bloated five times the usual size.
- Things I’m seeing routinely: the same IP trying 10-20 known WordPress attack vectors in a short interval; a group of bots hitting some random page hourly for a day or two.
All this in addition to a vast amount of benign bot traffic. The shutdown of Google Reader has indeed led to a Cambrian explosion of feed readers… and they all send their bots to me. I’m also seeing a growing number of new search spiders that apparently got wind of my little website as its popularity increased. That’s nice, but WebLog Expert knows only a fraction of these bots, and its database is unlikely to ever fully catch up.
With so many bots around, I got uneasy about the accuracy of my server log analysis and decided to run a cross-check with Google Analytics. GA relies on client-side JavaScript which bots won’t run, so they are automatically excluded. I now have the results for a full week using both methods, and the difference is devastating.
- Overall visitors – which are already supposed to exclude spiders, spammers, and other bots – were overestimated by a factor of two.
- Unique IPs were overestimated by “only” 30%, probably because benign bots have the courtesy to reuse the same IP.
- Page visits were overestimated between 10% and 400%! All the statistics I had recorded over the years turned out to be worthless.
- The website’s index page – a popular bot target – dropped from 179 visits to 20.
Could Google Analytics be horribly inaccurate? That’s not very likely, given its widespread use and the evidence of bot infestation in the server logs. GA’s use of remote JavaScript execution does underestimate visitors, just like WordPress Statistics, but even a generous estimation of people using JS/GA blockers cannot account for such discrepancies.
Google Analytics
So as of today, I’m retiring my public visitor counts page that was based on server log analysis, as its existing content was far too inaccurate. Website & weblog now use Google Analytics to give me a reliable baseline of human visitors. Like WordPress Statistics, GA remotely loads some JavaScript and also leaves a first-party cookie to identify repeat visits. The JS code is loaded asynchronously and should not impede user experience.
Why use an external service at all? Various open-source packages for self-hosted visitor tracking exist, but those are not an option on my cheap hosting service. Its slow and tiny MySQL databases can’t even reliably show WordPress posts without additional caching, and the contract explicitly forbids write-heavy uses such as logging.
Why Google Analytics? I’m not a huge fan of Google but GA has a rather compelling set of advantages. It’s widely used by commercial websites, so its results probably aren’t terrible. It’s free up to 10 million hits a month, and too important to Google for a sudden Reader-style shutdown. Versatile reporting, good documentation, and a liberal license round out the package. If you consider another tracking service, make absolutely sure to read the fine print – I saw one license agreement that forbade modifying the tracking code, and even tried to impose content restrictions on tracked websites!
Usage Tips
The “Tracking Info” panel of the admin module is supposed to show the status of your website’s Google Analytics tracking code. However, Google retrieves this information from its own much-delayed internal data stores, not directly from your website. So expect it to say “Status: Tracking Not Installed” for a day or more, even while tracking is already working and producing data. Check the “Real-Time Overview” on your reports page for incoming hits instead.
Google Analytics supports aggregating multiple top-level domains in a single report, but you must modify the tracking script for that purpose. The current help page for cross-domain tracking suggests a solution that requires different scripts for each top-level domain, which is awkward for me as kynosarges.org
and kynosarges.de
both point to the same web storage.
What I did instead was use a tip from an older version of that help page which still works fine. I added some custom JavaScript code to extract the entry domain for the current page, and send that back to GA using ga('set', 'hostname', url)
.
WordPress Integration
You don’t need a plugin for Google Analytics if you can edit to your theme. First, save your complete GA snippet to a new JavaScript file below your current theme folder, e.g.
wp-content/themes/twentytwelve/js/google-analytics.js
Now append the following block to the existing file functions.php
in the theme directory itself. This is the standard WordPress method for adding new functionality to themes. You should see a long list of similar action handlers in the file.
function add_google_analytics() {
if ( ! is_user_logged_in() )
wp_enqueue_script( 'google-analytics', get_template_directory_uri() . '/js/google-analytics.js', false, '20170308', false );
}
add_action( 'wp_enqueue_scripts', 'add_google_analytics' );
The date 20170308
acts as a serial number for the loaded JavaScript file. Change it when you update the file so visitors can be told to re-fetch it. Finally, the is_user_logged_in
condition ensures that administrators and other weblog contributors aren’t counted, as usual. This integration method appears to work well with WP Super Cache, too.
2017-03-08: Switched from “Classic Analytics” (script ga.js
) to the current “Universal Analytics” (script analytics.js
) which now works fine on my site. Also updated PHP snippet to place GA code in the page header rather than footer, for faster concurrent loading.