Prior
to the age of big data, one of the most common questions I would get asked as a
Six Sigma Black Belt was, “How much data should we collect?” It
scares me that I don’t hear this question asked often enough anymore.
There’s
so much confusion around sampling these days. Clients have told me we shouldn’t
worry about sample sizes because we’re collecting so much data — it’s obvious
that our sample size is adequate. One client said sampling isn’t necessary
because their machines were capable of processing all the
data.
The
executives who should care about sampling aren’t talking to the data scientists
who don’t think it’s important to discuss. If you’re a leader trying to use big
data for predictive analytics, you must take sample sizes seriously.
It’s all
in the sample
Errors
in judgment about sampling are the easiest to fix, though they cause the
biggest problems for the big data strategist.
First
of all, lose any notion that you’re collecting all the data (i.e., population
data), and you don’t need to worry about sample size. If you’re doing
predictive analytics (which should be the case if you’re trying to leverage big
data into your corporate strategy), all data that you collect is a sample. Even
if you collect massive amounts of data every second, part of your population
involves the future, which you cannot collect data on.
For
instance, you may have clickstream data that you’re trying to profile for
digital behavior. Let’s say your powerful machines can process every single
click in real time. That’s fantastic, but the point of collecting this data is
to predict future behavior. There’s data about this future behavior that hasn’t
happened yet, but it’s still part of your population data. Furthermore, don’t
assume that you’re collecting enough data just because there’s a lot of it; your
instincts may be right, but it’s better to know the actual statistics than to
take them for granted.
As a
leader, I agree that you should leave the heavy-duty analytics to the data
scientists. However, the conversation about sample size is one discussion that
should be addressed. The trick is to come up with the right language for the
dialogue.
Know what
you don’t know
In
order to determine the right sample size, you should talk to your data
scientists. There are executive decisions that are buried in the assumptions
drawn from sample size characteristics; these decisions are often ignored or
left to the data scientists to make. For better or worse, there are rules of
thumb that analysts use for many of these values. These are typically
represented as defaults in whatever software your data scientists are using; in
most cases, the defaults are accepted and never discussed. This is not how you
should make strategic decisions.
The two
biggest decisions involve how much risk you want to accept in your assumptions.
There are two types of risk: the risk that you’re going to take some action
when you shouldn’t, and the risk that you’re not going to take some action when
you should.
Let’s
say you’re trying to define a key customer segment based on the digital
behavior data collected from your clickstream. The statistics clearly indicate
that your target customer segment behaves differently than the rest of your
customers. With this information, you’ll invest additional resources to make
sure this key customer segment stays engaged. But consider these questions:
- How confident are you in the statistics? Are you willing to take a 5% chance that the statistics are wrong?
- What if the statistics say your target customer segment doesn’t
behave any differently than the rest of your customers? Are you willing
to take a 10% chance that these statistics are wrong?
These
are common defaults for these types of risk; however, if your strategy is at
stake, you should be the one who makes these calls.
There
are a couple of other things you should know about your sample size that aren’t
as easy to control; they have to do with how much variance is in your sample
based on what you’re trying to measure, and how precise you want to be with
that measurement.
For instance, you may calculate level of engagement on a
continuous scale from 1 to 100. As you start building personas, you’ll assign a
level of engagement to tease out your best customers. There will be variance
within each persona, and the amount of variance affects how much data you
should collect; if you have a lot of variance, you need more data. You should expect
this to be an iterative process for determining sample size. There’s no way to
understand your variance until you start collecting data.
Conclusion
The
only way to know whether you’ve collected enough data to make a prediction is
to understand your tolerance for the two types of risks and collect information
about your sample’s variation. You must open up a dialogue with your data
scientists and collectively understand the characteristics of the samples.
Otherwise, you’re just taking an uneducated gamble with a lot of corporate
money.