Tech & Work

Don't make development decisions based on mistakes during A/B tests

Jason Cohen's recent presentation about the science of statistics highlights why it's so important for developers to use metrics properly.

One big change that has come with the era of always interconnected computers is that software companies are able to capture lots of metrics from their applications; even traditional applications like Windows and Office are collecting this information (through the Customer Experience Improvement Program). The metrics companies gather range from the basic (e.g., how many copies have been downloaded, or accounts signed up for) to the complex (e.g., usage rates of certain features, or A/B testing to compare conversion rates of different advertisement text).

When I was doing a lot of numbers crunching and reporting for the pharmaceutical industry, I learned that it is easy for people to get the wrong message from metrics. Sometimes the numbers are measuring the wrong things to get a clear indication of the situation, and sometimes the people looking at the numbers do not know what they are looking at. So when I watched Jason Cohen's presentation about metrics at the Business of Software 2012 event, I was all ears. The most important thing I learned that I think developers can use is to not be tricked into making decisions based on two common mistakes during A/B tests. If you are not familiar with the idea, an A/B test presents two different designs to users, and provides metrics showing which one increases or decreases key metrics (such as conversion rate, return rate, usage rate, etc.) more or less.

Programmers often underestimate the impact of the statistical margin for error. When the difference between two competing designs is less than the margin of error, you do not have meaningful data. Let's say that you are comparing two designs, and version A has a 3% improved key metric, and the test has a 5% margin for error. You are saying to yourself, "great, a 3% improved metric, let's use it!" And when you are in the mindset of iterating designs and incremental improvements, stacking a bunch of 3% improvements on top of each other sounds great. But with the 5% margin for error, you stand a good chance of picking the wrong design. Do this often enough, and the mistakes average out, and you tread water. But you also stand a chance of making the wrong choice often enough that your key metrics eventually go down (though you could luck out and see them improve too). If your goal is to really make improvements, any tested improvement that is less than the test's margin for error is not significantly important enough to make decisions on.

Along the same lines, metrics can create a "design by committee" disaster. The example that Jason gave was with the link color that Google uses for its paid ads. Google tested 41 shades of blue for that link. The problem with this is two-fold. First, with that many options, what is the likelihood that any one shade will be significantly better than any other? Pretty slim, you'd think, and you'd be right. The other problem is that when you let metrics dictate the design to this extent, you take human agency out to the point of losing all common sense. Do you really think that which shade of blue is used for the link will make a big difference in how many people click it? Unless the current color is nearly indistinguishable from the surrounding text or background color, any shade of blue should be equally likely to spur a user to click it. It isn't like there is a magic hue that triggers clicks.

The lesson here is clear. First, collect your metrics, but learn what differences are important. Second, learn what tests should be driving decisions, not just in terms of what metric is being measured, but if the item being tested can legitimately affect that key metric.

J.Ja

Keep your engineering skills up to date by signing up for TechRepublic's free Software Engineer newsletter, delivered each Tuesday.

About

Justin James is the Lead Architect for Conigent.

Editor's Picks