The new Referrer Filter utility, or ReferFilter for short, is the result of my efforts to get a realistic idea of legitimate traffic from my server logs. ReferFilter is a small Java command-line program that reads a server log from
stdin, checks the entries against a referrer whitelist, and produces a filtered log on
stdout, as well as copious diagnostic information on
ReferFilter comes in a small ZIP package with usage instructions and source code. The operation is described in the enclosed ReadMe file. See the project page for background information on referrer spam and sample statistics from my server logs. Some highlights:
60% of visitors have no referrer, and nearly another 20% refer from my own website. About one third of the rest is spam. Deleting all requests with any external referrer would be a fairly effective spam fighting technique!
Spam comprises less than 10% of all visitors, but nearly 80% of all referrer domains. Less than 10% of those are reused year over year. That’s why I decided manual filtering needs a whitelist rather than a blacklist.
Spam incidence varies hugely between pages, from zero to nearly 60% of all page hits. That’s why proper spam filtering is necessary. Subtracting a flat 10% from all hits would grossly distort the relative popularity of different pages.
With a hand-knitted list of about 400 domains, ReferFilter works quite well for me – at least as far as referrer spam is concerned. Unfortunately, there’s plenty of other unwanted traffic out there that doesn’t conveniently identify itself with a referrer header, for example attempts to post spam comments or attack bots probing for exploits.
At present I don’t seem to get a lot of those on my website, but WordPress installations present juicy targets for hackers. I already found hundreds of hits from attack bots that didn’t supply a referrer header and could only be filtered by originating IP. Not much I can do about that, other than look out for sudden inexplicable traffic spikes.
2013-05-06: I had overlooked another source of misleading hits, namely HTTP requests other than
GET. WebLog Expert, the commercial analysis program I’m using, can already filter by HTTP method so I won’t build this feature into ReferFilter. Dropping
POST requests found another targeted bot attack that recently boosted a page by 100 fake hits. Moreover, dropping
HEAD requests eliminates OpenGraph previews that are legitimate but shouldn’t count as page views. I updated all sample statistics counting only
GET requests. The relative changes in terms of referrer spam were quite small, so the conclusions still stand.