Lies, Damned Lies and Google Analytics

Thu, Mar 10, 2011 11:21 AM
Google Analytics (GA) has changed the world of web analytics. It used to be that you only had applications that analyzed logs from your web server. Those were OK for the first few years. But, with bots (and especially ones that lied about being a bot) and more complex web architectures, those logs became less useful for understanding how your users used your web site.

Once such product was Urchin. Years ago, we were users of Urchin, the product and company that Google purchased to create GA. It was the last really good log analyzer out there. We were kind of excited when Google bought them as we were looking at going to their hosted solution which eventually became GA. At the time however they were young and we decided to go with Omniture, the 800 pound gorilla in the space. However, Omniture prices were such that increased traffic meant increased cost with no additional importance in their numbers and no new features from their product. So, we left them. We turned back to GA. In addition we started using Yahoo's recently acquired product, Yahoo Analytics (formerly IndexTools). They were both free so we figured why not use them both. All this was over 2 years ago.

The one big lacking thing for us with all these tools was being able to tie actual user activity in the analytics to revenue. You see, dealnews does some affiliate based business. This means that we get a percentage of sales after the transactions are all closed. So, we may not see revenue for a user action for days or even weeks. All the analytics packages are made for shopping carts. In a shopping cart, you know at the checkout how much the user spent so you can inject that into the system along with their page view. We don't have that luxury. To try and fill this gap, we started keeping our own javascript based session data in logs for various purposes in addition to the two analytics systems.

Recently we decided to try and use the GA API to get page views, visits and unique visitor data about parts of our site so we could, at least at a high level, tie the numbers all together in one report. This gets me to the meat of this post. What we found was quite disturbing. This is a query for a single day for a particular segment of our site. We used the Data Feed Query Explorer to craft our queries for the data we wanted. This was the result.



As you can see, some pages received visitors but no visits. Pages consistently received more visitors than visits which I find quite odd. Sometimes we found that our internal logging would show activity on pages and Google would simply not show any activity for that page for an entire month. Now, I know they have a cutoff. I believe it is still 10,000, the old Urchin number. (C'mon Google, you are Google. You still have this?) They will only store up to 10,000 unique items (page URLs in this case) for a given segment of their reporting data. So, 10k referring domains, 10k pages, etc. etc. Perhaps that is what happened? We dug in and found that to not be the case. And besides, it shows visitors but not visits. What is up with that? Maybe its a date thing. We should look at more days. This is an entire month for the same filters.



Whoa, still have visitors and no visits. In addition, there are lots of multiples of 17 there. 17, 34 both appear a lot. We found that littered throughout our results when filtering the page path. This has to be made up data or something. I don't see how it could be based in any reality.

The odd data was not limited to this apparent missing data. There is apparently extra data in there too. In another discussion about the impact of recent social marketing efforts we went to the analytics to see what kind of traffic social networking sites were sending our way. When comparing GA to YA and our internal numbers, we were left baffled. Sorry, no numbers allowed here - super corporate secrets and all that. But, I can tell you that Google Analytics claims that several large, well known referring sites send 5x to 10x the traffic to us that our other tracking reports. I even wrote custom code to try and follow a user through our logs to see if maybe they were attributing referrers to a visitor on a second visit in the day as belonging to the earlier referrer but could never find any pattern that matched their data. There simply is no way this is right.

There is good news. If you stick with the high level numbers like overall page views, visitors, visits and things like new vs. old visitors, the numbers appear to line up well. In all cases, the differences of all 3 of our resources are within acceptable ranges. But, apparently, drilling down too deep in the data with GA will yield some very unsavory answers.