As the NSA has found, the more data we collect, the harder it can be to filter out noise to find signal.
It's a truism in big data that you can never have enough data. With the cost of storage declining to unprecedented levels, the mantra now is to store everything... just in case it becomes useful data tomorrow or years from now.
The problem with this approach, however, is that it assumes that the only cost of storing more data is the associated storage cost. Lost in the calculation is the difficulty of making sense of signal amidst ever increasing data noise. The more data we store, the harder it becomes to separate meaningful signal from meaningless noise.
Just ask the NSA.
So much data... what's a spy to do?
Bill Binney recently resigned from the US National Security Agency (NSA), where he was a high-ranking official, mathematician, and codebreaker. After becoming disillusioned with the way the NSA was gathering and using intelligence, he quit.
While Binney is a severe critic of the NSA's spying on US citizens, one of his most potent critiques goes to the heart of big data:
"[T]he problem...[w]ith this bulk acquisition of data on everybody [is that the NSA has] inundated their analysts with data. Unless they do a very focused attack, they're buried in information, and that's why they can't succeed."
In other words, there's so much data noise that it's increasingly difficult to decipher any signal.
Noted statistician Nate Silver addresses this in his book The Signal and the Noise:
"If the quantity of information is increasing by 2.5 quintillion bytes per day, the amount of useful information almost certainly isn't. Most of it is just noise, and the noise is increasing faster than the signal. There are so many hypotheses to test, so many data sets to mine — but a relatively constant amount of objective truth."
As both Binney and Silver highlight, the bigger the haystack, the harder it is to find the needle. We make this task ever more difficult for ourselves by using Hadoop and other modern data technologies to create "unsupervised digital landfills," as one Fortune 100 IT executive phrased it to me.
Nate Silver on signal and noise
Not only does it become ever harder to glean insight from mountains of data, but we can also seduce ourselves into believing that more data necessarily translates into more truth. In fact, all data is always processed by highly biased beings. Our prejudices aren't minimized by data.
If anything, they can be amplified by data, as Silver posits:
"[Big data] is sometimes seen as a cure-all, as computers were in the 1970s. Chris Anderson... wrote in 2008 that the sheer volume of data would obviate the need for theory, and even the scientific method....
"[T]hese views are badly mistaken. The numbers have no way of speaking for themselves. We speak for them. We imbue them with meaning.... [W]e may construe them in self-serving ways that are detached from their objective reality."
Ultimately, more data doesn't require less thinking, as some would suggest. We don't magically find correlations in mountains of data. We have to search for them, so we must ask the right questions of our data.
The best data scientist is the one you already have
This is why Gartner analyst Svetlana Sicular is dead-on when she suggests that enterprises will find it easier to train employees on big data technologies like Hadoop and NoSQL rather than bring in a "mythical data scientist" who already knows such technologies but likely won't know your business.
The hard part is figuring out the right questions to ask of your data, not how to use the technologies.
Which brings us back to the NSA. While the NSA may know which questions to ask of its data to figure out what citizens are doing with our time, could it be that mass surveillance may actually help to make us less susceptible to the NSA's prying into our lives? Share your thoughts in the discussion thread below.