Before you determine the big data sample size that makes sense with your company's goals, you need to get rid of the "junk" data. The trick is making sure you don't exclude valuable data.
"Casting your net" is one of the operative acts of big data gathering that enterprises engage in. Why? To assure that their sample is the correct size (particularly, large enough) of big data that they are drawing from for purposes of analysis and actionable conclusions.
Sizing the net for the optimum big data "catch" isn't easy. To do this, organizations have to make the correct calls on which big data is "junk" and therefore, excludable, and which big data has the potential to contribute to analysis -- and answers -- for the looming questions that the business wants to answer.
Let's talk about getting rid of the junk first.
In 2011, when the term big data began to take off, Walmart was already feeding its databases more than one million customer transactions per hour for a total of 2.5 petabytes of data, which was 167 times the total amount of data in the books in the US Library of Congress. Projections for internet traffic at that time were showing that cyberdata flows would exceed 667 exabytes by 2013; EMC now estimates that the growth of big data alone between 2009 and 2020 will be 45x. These burgeoning data stores are impacting data center storage, policies for keeping data under management, and database architectures and information processing. When this data comes in "unrefined" (i.e., unfiltered), there is real risk that you are committing your storage, database, and processing resources to a lot of useless data. This is why many companies in their big data initiatives seek out vendors or develop their own strategies for sifting out "junk" data before they commit the refined data product to database, storage, and processing resources.
The danger is that you exclude data that could be valuable -- if you only knew how to exploit it! This takes us to a discussion of an all-inclusive big data approach that virtually excludes nothing. Natural intelligence applications that look at all data in an associative manner that mimics how the human brain associates data and then assimilates it are a prime example.
Last week, I spoke with Ian Hersey, the chief product officer for Saffron Technology, which provides a cognitive computing platform for the Internet of Things. This platform is capable of cognitive thinking that emulates the associative functions of the human brain. "Like the human brain, we look at all the data, and we connect and index it at the entity level," said Hersey. "Through algorithms that emulate human thought association, we discover patterns in this data." The detection of similarities and patterns between the data is what enables the intelligence engine to learn from the data and to generate conclusions based upon what it has learned. Because no data is left behind, there is less chance to exclude data that seemingly isn't relevant but could turn out to be relevant later in analysis.
Filtered vs. non-filtered big data represent two approaches that enterprises can take in their big data harvesting. It's critical for enterprises to determine the appropriate degree of filtering for their big data needs in order to assure that the sample sizes of big data they collect are tuned to the task of what they need to analyze.
A good analogy is setting the aperture on a camera. How wide do you need to make the lens to assure that the data sheds the correct amount of light on the set of business problems you want to solve?
Every business has its own big data aims, so there is no one best practice, except to say that if "setting the lens" or determining how wide you are going to "cast your net" isn't an upfront strategic exercise, you could be missing the boat on your big data investment.