As I wrote in my introductory column, I’ve been thinking a lot about data over the past several years. 2013 seemed like the year of modifiers: Big data. Private data. Log data. Liquid data. Open data. Personal data. And, of course, metadata, a concept that everyone from my mother to the President of the United States had an unexpected opportunity to learn much more about last summer.  

The rise of these modifiers into general public discourse suggests that people have woken up to both the potential power and pitfalls of the explosion of data in the world, with more than 90% of the world’s data estimated to have been created in the past few years. In this column, I’ll focus on that angle.

Given that my former colleague at O’Reilly Media, Edd Dumbill, wrote one of the best explanations of what big data is years ago, I won’t retrace his steps and will simply quote him:

“Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. To gain value from this data, you must choose an alternative way to process it.”

For anyone who want to see more evidence-based decision making in government, academia, medicine, and business, this new resource, often compared to oil or gold, is a heady brew.

“In God we trust,” tweeted former New York City Mayor Mike Bloomberg in 2011. “Everyone else, bring data.” That quotation is generally attributed to statistician W. Edwards Deming, who taught businessmen how to improve product design, service, quality, and sales through measurement in the 20th century. In many ways, Deming’s work informs the “lean startup” methodology popularized by entrepreneur Eric Ries today. If something can be measured, it can be improved, so the thinking goes. U2 singer Bono has urged the world to adopt “factivism,” applying data and evidence to every field. Microsoft cofounder Bill Gates has applied a data-driven approach to philanthropy, from fighting disease to improving education.

The excitement and immense hype regarding what can be done with all of this activity has, however, provided ample surface area for critics of the hype around big data to dine out for years. You can find it from mainstream columnists like David Brooks, exploring “data-ism” and telling us what data can’t do, to Evgeny Morozov, the voluble contrarian who has highlighted the dangers of solutionism and “Internet-centric” perspectives that divorce writing about technology from the underlying politics and economics that animate its creation, architecture, or uses. Kate Crawford, a researcher at Microsoft, has warned of the hidden biases in big data, warning the world that those who seek to use it must take care to untangle algorithmic illusions from reality. And, you could find thousands of stories in 2013 about what the National Security Agency could do — or not — with the massive amount of metadata gathered from the bulk collection of phone records and Internet communications.

Part of the trouble is that big data has long since become a big buzzword, enabling marketers, vendors, media, academics, and politicians to project whatever they like upon it. That bubble is hard to puncture with criticism, real or otherwise. That reality has been acknowledged by close observers of the phenomenon, like Ken Cukier, The Economist’s data editor, who suggests thinking about it in terms of its features:  

“…we can do things with a huge corpus of data that we are unable to do with smaller amounts, to extract new insights and create new sources of value. This encompasses things like machine learning, in which we have self-driving cars and decent language translation. This is not because we have faster chips or cleverer algorithms, but because we have more data (and the tools to process it at a vast and affordable scale).”

In other words, the romance of a better world through data is over. According to Gartner’s hype cycle, big data is now in the “trough of disillusionment,” which means it’s time to look deeper at who is using it, how they’re using it, and for the benefit of whom. Few stories have had as much impact on that count as Charles Duhigg’s feature on the application of data to measure shopping habits, which was memorably summarized by Kashmir Hill as the way Target figured out a woman was pregnant before her father did. The “creep factor” of this ability was sufficiently high enough that more people started paying attention.

The existential issues that the collection and use of all of this data poses to civil liberties and privacy have finally attracted the attention of the White House, where President Obama has tasked advisor John Podesta to lead a review of the future of big data and privacy. Here’s part of what Podesta wrote about the review in The White House blog:

“We are undergoing a revolution in the way that information about our purchases, our conversations, our social networks, our movements, and even our physical identities are collected, stored, analyzed and used. The immense volume, diversity and potential value of data will have profound implications for privacy, the economy, and public policy. The working group will consider all those issues, and specifically how the present and future state of these technologies might motivate changes in our policies across a range of sectors.”

While it’s clear that there are significant opportunities to apply the increasing volume of data to the betterment of many industries and sectors, it’s worth noting that the notion of better living, working, and fighting through data isn’t a new one. Cukier and Viktor Mayer-Schönberger memorably captured some of the history here in their feature looking back at former U.S. Defense Secretary Robert McNamara, a “hyper-rational executive” who was led astray by numbers in the 20th century:

“The use, abuse, and misuse of data by the U.S.
military during the Vietnam War is a troubling lesson about the limitations of
information as the world hurls toward the big-data era. The underlying data can
be of poor quality. It can be biased. It can be misanalyzed or used
misleadingly. And even more damning, data can fail to capture what it purports
to quantify.
We are more susceptible than we may think to the
‘dictatorship of data’—that is, to letting the data govern us in ways that may
do as much harm as good. The threat is that we will let ourselves be mindlessly
bound by the output of our analyses even when we have reasonable grounds for
suspecting that something is amiss.”

As more institutions and sectors adopt data-driven policies, regulation, commerce, or legislation, leaders, managers, and educators would do well to avoid being seduced by the fantasy that ultimate truth lies in numbers.

One pragmatic approach that combines with real-world experience can be found in New York City, where the mayor’s “geek squad” has applied predictive data analytics to saving lives and taxpayer dollars.

A team lead by Mike Flowers, the city’s first director of analytics, found through data mining signals that were highly correlated to fire risk and brought the results to inspectors at the New York City Department of Buildings and the fire department. By applying their knowledge of the physical world with that painted by numbers, New York City has been able to reduce deaths from fires by applying new risk-based fire inspections generated by data analysis.

Now, New York City is applying its immense cache of regulatory data to other areas, including predictive policing, which raises the same kinds of civil liberties concerns that such analyses by the Department of Homeland Security or National Security Agency does at the national and the international levels. As the administration of New York City Mayor Bill de Blasio works through these issues, the mayor’s office would do well to set a new standard for transparency by being more open about its crime data and algorithms, enabling the media and government watchdogs to audit and understand how decisions are being made.

Our world, awash in data, will require new techniques to ensure algorithmic accountability, leading the next-generation of computational journalists to file Freedom of Information requests for code, not just data, enabling them to reverse engineer how decisions and policies are being made by programs in the public and private sectors. To do otherwise would allow data-driven decision making to live inside of a black box, ruled by secret codes, hidden from the public eye or traditional methods of accountability. Given that such a condition could prove toxic to democratic governance and perhaps democracy itself, we can only hope that they succeed.