Big Data

Big data: Neither snake oil or silver bullet

Why the recent questioning of the merits of big data analytics is a healthy debate rather than a condemnation of the field.

It's easy to be seduced by hype, to believe someone or something can transcend the constraints and complexities that render everyday life both humdrum and baffling.

Big data is the latest casualty of overcooked promises made in pursuit of a good story. The backlash began after a report cast a shadow over one of big data's shining beacons, Google Flu Trends.

Google Flu Trends is a service that predicts flu infection rates worldwide based on the search terms people are using, parsing a vast number of searches across 29 countries.

A paper published in Nature in 2009 found the service was able to generate predictions with only one day's delay, faster than the week or so it took the US Center for Disease Control and Prevention (CDC) to make forecasts based on feedback from doctors' surgeries.

Google Flu Trends' early successes led to articles celebrating the triumph of big data and hailing the ability to resolve information from noise through correlation in large datasets. What followed, as pointed out in the Financial Times, were articles that claimed that since all data points could be captured, old statistical sampling techniques were obsolete, and that statistical correlation reveals everything useful there is to know.

But when a paper was published in March this year showing that Google Flu Trends had overestimated the spread of flu-like illnesses by almost a factor of two it prompted reports pointing out the limitations of relying on correlating patterns in big datasets above all else.

A series of articles examining the state of big data analytics followed, and in the Financial Times David Spiegelhalter, Winton Professor of the Public Understanding of Risk at Cambridge university described some claims made by what it called "cheerleaders" of big data as "bollocks".

As highlighted in these pieces unless you can be certain you have captured 100 percent of the relevant data, and there are limited situations where this is the case, your big dataset will still be plagued by pitfalls that have dogged data analysis for decades, issues of sample error and sample bias.

These issues can trip you up when extrapolating what these huge datasets can tell you. Scoop up every Tweet from Twitter and you'll capture the prevailing mood of Twitter users, not of the nation. A similar limitation reportedly affects the Boston Street Bump smartphone app, which records locations of potholes by detecting jolts as a car drives along the city streets. The data produced by the app has been called out for providing a selective map of potholes, one that favours those areas where more affluent smartphone owners tend to drive.

But the attacks on big data drew The Economist's Kenneth Cukier to speak in its defence this week.

"Of course it's not bollocks, it's preposterous to think that it would be," he told the Big Data Week conference in London, citing various examples of analytical systems that have seen massive improvements primarily by scaling up their datasets, giving examples of voice recognition, translation and online recommendation and search suggestion engines.

Word processor grammar checkers, he said, became exponentially better after the datasets used to calibrate them were scaled up from half a million to one billion words, demonstrating improvements far greater than would have been possible by hand coding more comprehensive grammatical rules.

And when it comes back to the paper that triggered the wave of criticism of big data, The Parable of Google Flu: Traps in Big Data Analysis", it isn't a condemnation of Google Flu Trends as a whole. The paper claims: "The comparative value of the algorithm as a stand-alone flu monitor is questionable". But as Cukier points out, it also states that the most accurate predictions of flu outbreaks were produced by combining data from Google Flu Trends with other "near real-time health data", such as that that from the CDC.

The debate shouldn't be reduced to whether big data is worthless or not: Spiegelhalter's criticisms weren't directed at the field of big data, rather at some of the more outlandish claims being made for it. It seems to me the loudest criticisms of big data evangelism still stand, and while large datasets can illuminate trends that would be invisible with less data they don't eradicate problems of sample error and bias.

About

Nick Heath is chief reporter for TechRepublic UK. He writes about the technology that IT-decision makers need to know about, and the latest happenings in the European tech scene.

0 comments

Editor's Picks