Big Data

Why samples sizes are key to predictive data analytics

In order to use big data for predictive analytics, you must take sample sizes seriously and understand the risks about sampling assumptions.

predictive-data-analytics-thumb-090413.jpg
Prior to the age of big data, one of the most common questions I would get asked as a Six Sigma Black Belt was, "How much data should we collect?" It scares me that I don't hear this question asked often enough anymore.

There's so much confusion around sampling these days. Clients have told me we shouldn't worry about sample sizes because we're collecting so much data -- it's obvious that our sample size is adequate. One client said sampling isn't necessary because their machines were capable of processing all the data.

The executives who should care about sampling aren't talking to the data scientists who don't think it's important to discuss. If you're a leader trying to use big data for predictive analytics, you must take sample sizes seriously.

It's all in the sample

Errors in judgment about sampling are the easiest to fix, though they cause the biggest problems for the big data strategist.

First of all, lose any notion that you're collecting all the data (i.e., population data), and you don't need to worry about sample size. If you're doing predictive analytics (which should be the case if you're trying to leverage big data into your corporate strategy), all data that you collect is a sample. Even if you collect massive amounts of data every second, part of your population involves the future, which you cannot collect data on.

For instance, you may have clickstream data that you're trying to profile for digital behavior. Let's say your powerful machines can process every single click in real time. That's fantastic, but the point of collecting this data is to predict future behavior. There's data about this future behavior that hasn't happened yet, but it's still part of your population data. Furthermore, don't assume that you're collecting enough data just because there's a lot of it; your instincts may be right, but it's better to know the actual statistics than to take them for granted.

As a leader, I agree that you should leave the heavy-duty analytics to the data scientists. However, the conversation about sample size is one discussion that should be addressed. The trick is to come up with the right language for the dialogue.

Know what you don't know

In order to determine the right sample size, you should talk to your data scientists. There are executive decisions that are buried in the assumptions drawn from sample size characteristics; these decisions are often ignored or left to the data scientists to make. For better or worse, there are rules of thumb that analysts use for many of these values. These are typically represented as defaults in whatever software your data scientists are using; in most cases, the defaults are accepted and never discussed. This is not how you should make strategic decisions.

The two biggest decisions involve how much risk you want to accept in your assumptions. There are two types of risk: the risk that you're going to take some action when you shouldn't, and the risk that you're not going to take some action when you should.

Let's say you're trying to define a key customer segment based on the digital behavior data collected from your clickstream. The statistics clearly indicate that your target customer segment behaves differently than the rest of your customers. With this information, you'll invest additional resources to make sure this key customer segment stays engaged. But consider these questions:

  • How confident are you in the statistics? Are you willing to take a 5% chance that the statistics are wrong?
  • What if the statistics say your target customer segment doesn't behave any differently than the rest of your customers? Are you willing to take a 10% chance that these statistics are wrong?
These are common defaults for these types of risk; however, if your strategy is at stake, you should be the one who makes these calls.

There are a couple of other things you should know about your sample size that aren't as easy to control; they have to do with how much variance is in your sample based on what you're trying to measure, and how precise you want to be with that measurement.

For instance, you may calculate level of engagement on a continuous scale from 1 to 100. As you start building personas, you'll assign a level of engagement to tease out your best customers. There will be variance within each persona, and the amount of variance affects how much data you should collect; if you have a lot of variance, you need more data. You should expect this to be an iterative process for determining sample size. There's no way to understand your variance until you start collecting data.

Conclusion

The only way to know whether you've collected enough data to make a prediction is to understand your tolerance for the two types of risks and collect information about your sample's variation. You must open up a dialogue with your data scientists and collectively understand the characteristics of the samples. Otherwise, you're just taking an uneducated gamble with a lot of corporate money.

About

John Weathington is President and CEO of Excellent Management Systems, Inc., a management consultancy that helps executives turn chaotic information into profitable wisdom.

2 comments
Barry Goldman
Barry Goldman

This article is a timely and useful warning, and managers SHOULD be aware of data limitations. However they are not statisticians - their first objective should be to hire a competent statistician to analyse the data they wish to use (big or small). Next they should ensure that the statistician has the guts to tell them what problems exist in the data, if any. Then they should hope that said statistician is aware that not all data is normally distributed!

Unfortunately the use of the word 'risk' s a bit misleading - essentially he is talking about the probabilities of making a mistake in one or other direction. The 'risk' is that probability time the 'cost' of taking the action. It is this risk that the manager has to weigh up . . .

KrishnaPG
KrishnaPG

Finding the right quantity and quality of training data is very crucial for success or failure or learning algorithms. However finding the right quantity and quality for the given problem and domain is still a challenge. One of my earlier papers on this subjects discusses this and gives a solution for enumerable configuration systems. Interested readers can check out: https://www.researchgate.net/publication/228567207_Data-dependencies_and_Learning_in_Artificial_Systems?ev=prf_pub

Editor's Picks