Tech & Work

Don't make development decisions based on mistakes during A/B tests

Jason Cohen's recent presentation about the science of statistics highlights why it's so important for developers to use metrics properly.

One big change that has come with the era of always interconnected computers is that software companies are able to capture lots of metrics from their applications; even traditional applications like Windows and Office are collecting this information (through the Customer Experience Improvement Program). The metrics companies gather range from the basic (e.g., how many copies have been downloaded, or accounts signed up for) to the complex (e.g., usage rates of certain features, or A/B testing to compare conversion rates of different advertisement text).

When I was doing a lot of numbers crunching and reporting for the pharmaceutical industry, I learned that it is easy for people to get the wrong message from metrics. Sometimes the numbers are measuring the wrong things to get a clear indication of the situation, and sometimes the people looking at the numbers do not know what they are looking at. So when I watched Jason Cohen's presentation about metrics at the Business of Software 2012 event, I was all ears. The most important thing I learned that I think developers can use is to not be tricked into making decisions based on two common mistakes during A/B tests. If you are not familiar with the idea, an A/B test presents two different designs to users, and provides metrics showing which one increases or decreases key metrics (such as conversion rate, return rate, usage rate, etc.) more or less.

Programmers often underestimate the impact of the statistical margin for error. When the difference between two competing designs is less than the margin of error, you do not have meaningful data. Let's say that you are comparing two designs, and version A has a 3% improved key metric, and the test has a 5% margin for error. You are saying to yourself, "great, a 3% improved metric, let's use it!" And when you are in the mindset of iterating designs and incremental improvements, stacking a bunch of 3% improvements on top of each other sounds great. But with the 5% margin for error, you stand a good chance of picking the wrong design. Do this often enough, and the mistakes average out, and you tread water. But you also stand a chance of making the wrong choice often enough that your key metrics eventually go down (though you could luck out and see them improve too). If your goal is to really make improvements, any tested improvement that is less than the test's margin for error is not significantly important enough to make decisions on.

Along the same lines, metrics can create a "design by committee" disaster. The example that Jason gave was with the link color that Google uses for its paid ads. Google tested 41 shades of blue for that link. The problem with this is two-fold. First, with that many options, what is the likelihood that any one shade will be significantly better than any other? Pretty slim, you'd think, and you'd be right. The other problem is that when you let metrics dictate the design to this extent, you take human agency out to the point of losing all common sense. Do you really think that which shade of blue is used for the link will make a big difference in how many people click it? Unless the current color is nearly indistinguishable from the surrounding text or background color, any shade of blue should be equally likely to spur a user to click it. It isn't like there is a magic hue that triggers clicks.

The lesson here is clear. First, collect your metrics, but learn what differences are important. Second, learn what tests should be driving decisions, not just in terms of what metric is being measured, but if the item being tested can legitimately affect that key metric.


Keep your engineering skills up to date by signing up for TechRepublic's free Software Engineer newsletter, delivered each Tuesday.


Justin James is the Lead Architect for Conigent.


Whilst this piece is well written and discusses the effects of poor testing, there is no discussion about what makes a good test in the first place. As a professional marketer, I am all too well aware of the fact that people will "test two things twenty times" ... instead of "testing twenty things twice" This is crucial. More importantly the differences are not usually apparent to those who are setting up the tests. That is how they test the same thing and *think* it is different. How it appears to the reader is however pretty well the same. Having said that, if Google tests forty different shades of blue, they have the mass audience to justify it. With a billion searches every few seconds, this kind of metric will make a difference. However slight! So what does make a difference? What kind of thing are you looking for as a difference? As I said, it is not an easy thing to grasp. Because it deals with the emotional realm. If you don't agree, that is fine by me. If you tell me this is rubbish, that is fine by me. It matters not a whit what you think. Because your actions are determined by what you feel *before* you think. At its simplest, what do your readers most like - and most dislike. Understand this and within a short while a large discrepancy will emerge. It will be way beyond the realms of "statistical margin of error"* - usually by a factor of 30-50%. If the margin is less, I know that my split test isn't hitting the right spot and I need to find another aspect to measure. This sort of testing usually makes statistics irrelevant, just looking at the broad figures will tell you which of ten advertisements/questions are working. As I say, the margins will be substantial and obvious. *See above; if your split test is that narrow, you are not performing a real test.


Obviously Microsoft has misread the telemetry data like what you described.

Editor's Picks