How to use split tests for customer research

Feature flags and split tests are in the early mainstream. Their real power may be in product research.

Top 5 things to know about data science

Terms like the "build-measure-learn" loop or the "pivot" were popularized by Eric Ries in his book The Lean Startup. The core idea behind the lean startup was to validate a market and to make sure that customers were actually willing to pay for the product before building it. The term for that is validated learnings, as in, you learned to validate the market--or not.

Once the mania from The Lean Startup quieted down, there isn't much talk about validated learnings. Aside from venture-funded Silicon Valley startups, people simply don't know how to do it, or if they do, they don't want to invest the resources.

It turns out you can use split tests.

Feature flags and split tests

When feature flags first became popular, it was as a quality measure. The feature could be defined as on or off for a given category of user, such as the software team, employees, power users, and so on. If something went wrong, turning the feature off was a simple configuration change, not a software change to retest and "push." Split tests take the feature flag idea one step further, delivering a different experience to different groups. In the simplest example, you send two advertisements to 1,000 users each, and whoever clicks the most, that is the ad you serve up to a million people. Companies use A/B split tests to choose book titles, advertisements, and which user interface will lead to a more engaged audience.

A split test that determines which feature to develop, based on real projected sales data, is a validated learning.

SEE: Managing AI and ML in the enterprise 2020: Tech leaders increase project development and implementation (TechRepublic Premium)

Hypothesis driven development

Asa Schachar, a developer advocate at Optimizely, suggests that every product decision should be driven by a hypothesis to drive development. For example, a new feature would lead to a typical visitor spending 5% more time on the website. Gojko Adzic describes a similar concept in his book Impact Mapping. This sort of outcome-driven planning lets the team decide how a given feature or project will impact the business in a testable way. When the team builds the feature, and the feature has the directed impact, it is on to the next feature. When the feature does not have the impact, it is time to consider how the organization makes decisions.

The "painted door" is a way to get those validated learnings before building the feature. That can be as simple as building the affordance for the feature, the button or link--and having it lead to a coming-soon screen. Jon Noronha, the vice president of product at Optimizely, described an online site that wanted to experiment with a "save for later button." They built the button and found that no one was clicking it.

Split tests give you the ability to build 20 of these "painted door" features (with paint on wood but no actual door yet), run each against 1% of the user base, and see which of the 20 is most popular. However, that can create some problems.

Risks of split tests

Noronha explains that when you have a large number of split tests, noise begins to enter the system. He references the XKCD jelly bean story as an example. In that comic, the author does 20 different experiments to find if different jelly beans cure acne, testing to a 95% confidence level. Surprisingly, of all the jelly beans, the green ones pass this test. The trick is that there's a 5% chance of being wrong. When you do 20 tests and allow a 5% confidence of being incorrect, odds are high you will experience at least one false positive. Simple split testing 20 ways will not reduce this problem; Noronha suggests a statistical tool called False Discovery Rate Control.

The ideas above require a new layer--on top of the split tests--to measure the click-through rate, design, and validate experiments. Optimizely provides one such service, with both a free tier and an advanced tier that provides the metrics and management insights designed to steer development.

The simplest place to start may be with impact mapping and a way to measure it. After all, if your new product features get used exactly how you'd hoped, there is no problem.

Also see


Image: Andrii Yalanskyi, Getty Images/iStockphoto