To begin, where do third-party click fraud numbers come from? At Google, whenever we detect malicious activity against an advertiser's account, we mark those clicks as invalid, and thus don't charge the advertiser for them. We utilize a number of different automated techniques and algorithms, as well as proactive manual analysis, to do this, analyzing hundreds of different factors. The analysis that we see from third-party auditing firms (including ClickForensics) seems to essentially rely on just one factor, which we call IP frequency. IP frequency is the number of times an IP address clicks within a certain time window. If it clicks too many times, it could be click fraud. On our end, this is a very simple rule which runs in an automated fashion, protecting Google advertisers 24/7. Third-party firms sometimes find the same suspicious IP frequency patterns that our systems do, and include them in their click fraud reports - leading advertisers to request refunds for clicks they were never charged for in the first place.
But that is actually not even the most common problem with their analyses. What is far more common is that the reports we receive from them ask for refunds for clicks which do not even exist. This more serious problem comes from the issues we addressed in our August report on fictitious clicks. In that report, we demonstrated the limits of web log based analysis for any analytics purpose (including click fraud analysis) due to the way Internet Explorer, Firefox and other browsers work. Unfortunately, that was a very technical report, which was difficult for many readers to parse. I'll try to provide a simpler explanation here.
Here's the problem: web logs, whether generated by an advertisers, or by third-party code on an advertiser's site, cannot directly track ad clicks. Instead, they track visits to a special landing page URL on the advertiser's site (e.g. http://example.com/?adwords ) as a proxy for how many ad clicks occurred. The assumption they're relying upon is that each visit to that URL corresponds to a unique click, and vice versa. But in practice this is not the case. Once a user visits that page, they often browse through the site, navigating through sub pages, and then return to the original landing page by hitting the back button. When the landing page is reloaded in the browser, it appears in the web log as though additional ad "clicks" are occurring. Google can count ad clicks reliably as a click on a Google ad will cause the web browser to contact Google and then we redirect it to the advertiser's landing page. A reload of the advertiser's landing does not contact Google again. In addition, the referrer URL which is passed by the browser when users hit the back button is actually the original referrer URL (which says the page came from an ad click) which gets cached, so there is no analysis which can be done based on logs alone which can resolve this. This is where the fictitious clicks come from.
When one analyzes data from web logs under these default conditions, we find that on average it leads to a 40% inflation of click estimates. You can think of it this way: if an average of 1000 clicks occurred, a log based analysis would estimate on average that there were 1400 clicks, 400 of which are fictitious and did not actually occur.
Now consider the principal analytical tool of third-party click fraud firms: IP frequency. When they see a user browsing through the site, and reloading the landing page multiple times in a short time window, they will classify it as click fraud - even though those "clicks" do not actually exist. It also results in the misclassification of advertisers' best users (the ones who are spending time browsing through their sites) as "fraudulent".
Thus, while click estimates were inflated by 40% on average, click fraud estimates were inflated by much, much higher amounts. As we detailed in our report, we found cases of firms reporting click fraud rates above 100% in some instances due to this problem. We also found that in other instances, clicks classified as "click fraud" by third-party firms produced sales at the same rate as the "good" clicks. In other words, the identification of click fraud by third-party firms was much worse than imprecise - it was not even in the right ballpark, with nearly all of the "bad" clicks they identified actually being fictitious.
The net result was that advertisers were consistently being given false data from reports they trusted, which would actually hurt their advertising campaigns if they acted on them. For example, if an advertiser is told certain keywords have higher "fraud rates", they are likely to change their campaign to eliminate spending on those keywords in favor of others, hurting the performance on their campaigns when this information is false. The damage this can do to advertisers' businesses can be quite large.
So is there a solution to this? Yes. Third-party analytics (not click fraud) firms have been aware of the page reload issue for many years, and generally use redirects (rather than web log based tracking) to avoid it. If one is tied to using web site logs (or landing page code generating logs) however, the only solution is to use the AdWords auto-tagging feature. Auto-tagging has been available since 2005, and is a feature which appends a unique ID to the landing page URL for every click, so that the cases of (a) multiple clicks and (b) multiple reloads of the landing page can be easily distinguished.
Two of the three firms we identified in our report, AdWatcher and ClickFacts, have not made any changes we're aware of. That's discouraging to say the least. ClickForensics claims to have fixed this problem a couple of months ago by requiring their AdWords clients to use auto-tagging, yet despite such a significant change in methodology, their new numbers are nearly the same as their old numbers. Perhaps it hasn't yet been fully or correctly utilized, so the significant corrective drop in their numbers is yet to come. Or perhaps their network is heavily skewed toward non-Google advertisers, and thus they still cannot correct the problem until Yahoo, MSN and others implement their own versions of auto-tagging. Until then, considering that the total number of clicks they're counting could be off by as much as 40%, and their click fraud estimates could be off by much more, there's very little meaning in a difference of 0.1% from Q2 to Q4 - or in any of their other inferred statistics. But most importantly, the fact that they don't take into account the amount that Google already protects advertisers against means that they're not even trying to measure actual click fraud.