Big Data

Big data: Can it predict the spread of Zika? Cloudera thinks so

A recent hackathon in Austin, TX, tried to model the spread of Zika and predict its path, but is the hackathon model the best way to get results from big data?

Php Programming Html Coding Cyberspace Concept
Getty Images/iStockphoto

"Big data" means different things to different people. Ask the average consumer and they'll probably say it's something to do with the cloud. Ask a business owner and you'll hear about detailed information that can be combed for ways to increase profits, and an IT professional is likely going to get into the nitty-gritty of the newest, fastest ways to process massive amounts of data.

For Cloudera Inc., a data company out of Austin, TX, big data isn't just about profits: it's also valuable for public health projects like fighting the Zika virus. It recently hosted a hackathon in partnership with the University of Texas to find uses for data generated by the current outbreak, and in just one day were able to develop several ways to model and predict Zika's spread.

SEE: 10 ways big data, analytics, and sensors are helping behind the scenes (TechRepublic)

Over 50 local data scientists, programmers, and tech experts came together to scrape data from the CDC, WHO, and ECDC, with some of the results looking quite promising. One programmer used TensorFlow machine learning and satellite images to find standing pools of water, while others focused on designing an app that crowdsources data from those experiencing symptoms. The app would automatically geolocate new cases and track the spread of the disease, enabling health workers to target an area before the outbreak is serious.

Attendees also learned a lot about how companies like Cloudera crunch data and how to incorporate those methods into public health research. Eddie Garcia, Cloudera's chief security architect, said that a cure for Zika wasn't the goal of the hackathon. "[We want] to build awareness around the disease," he said, as well as to "highlight the challenges to find and create data sets for research and the socialization of open data sets for social good." Garcia and those who worked on the hackathon want to keep the project going, and they envision this as only the first of many events that find socially valuable ways to use data.

Are hackathons the ideal format?

Sitting in a room full of tech professionals with one goal is invigorating, but if you ask Miriam Young, communications director at DataKind, she'll say they aren't the most effective way to get things done. "A lot of great ideas come out of hackathons," she said, "but those ideas rarely lead to a useable product." With the ever-increasing popularity of hackathons there are problems that need to be addressed, which is exactly what DataKind aims to do.

As opposed to a one-off, loosely organized event, DataKind focuses on long-term projects that it calls data dives. "We partner directly with the organizations we're helping so that we can work with them, not for them," Young said. DataKind staff and project volunteers often spend months researching and collating data before a weekend data dive, and the end result for DataKind has been a multitude of permanent projects that have made real difference.

DataKind data dives have produced human rights alert filtering systems for Amnesty International, a triage system for Crisis Text Line, and even mobile device software for Nexleaf that prevents vaccine spoilage and maximizes effectiveness. "Big data can be an amazing resource for public health. The only problem is getting the biggest benefit out of the massive amount of data an organization might have," Young said.

Collaboration is the name of the game as far as DataKind founder Jake Porway is concerned. "Without subject matter experts available to articulate problems in advance, you get results ... [that] solve the participants' problems," he wrote in the Harvard Business Review.

"As data scientists, we are well equipped to explain the 'what' of data, but rarely should we touch the question of 'why' on matters we are not experts in," Porway added. Hackathons, he said, are often a free-for-all that simply doesn't address the real needs of the host organization. DataKind's team is in constant communication with subject matter experts, and Porway doesn't think it can work any other way.

SEE: Why 2016 might be the year of citizen data scientists (TechRepublic)

It isn't just the hackathon format that causes problems either: it's also the data itself. Whether gathered from the CDC, Google search results, or any other method there's an inherent problem that Porway also wants to call attention to: there is no such thing as "raw data."

Cloudera data scientist Juliet Hougland agrees with DataKind, which is in large part why the Zika virus hackathon is the first in a series. "We partnered with the Golden Gate National Parks Conservancy (GGNPC) to track the reintroduction of local plant species, and there's one big reason we succeeded: there was a member of their team at the hackathon with us."

To Hougland and the Cloudera team there's simply too much data out there to dive in without guidance from someone who knows the material. In the case of the Zika hackathon, multiple events and a close partnership with University of Texas Austin is how they'll create results. "It takes time to merge datasets and find relations," Hougland said, "which is why we plan on using UT Austin's computing resources to continue analysis well after our hackathons are over."

Where big data falls short

Most people who follow news about big data are familiar with the Google Flu Trends failure in 2013. Google tried to use search data to predict flu rates in advance of flu season, but the end result was anything but accurate. Google ended up missing the mark by more than double the CDC's numbers, and 2013 wasn't the first time - it had the same problem in 2011 and 2012 as well.

It's hard to pinpoint just why Google Flu Trends failed, but the possibilities aren't unique. Data-gathering algorithms change, participants can manipulate the data that's gathered, and biases in both the participants and the data organization can affect what is considered valuable.

Data scientist Cathy O'Neil said that, quite simply, humans place too much faith in algorithms. "Algorithms are just as biased as human curators," she said in a recent blog post. She also said that we often trust those algorithms more than people despite the fact that they are created by human programmers. It's here that big data, in most any form, starts to fall short.

What this means for your organization

If you're collecting, sorting, or using big data there's a lot to consider. Consider the following points so you don't make Google Flu Trend-level mistakes with your data projects:

  • Collaboration that gets results won't always happen in one weekend. Your subject matter experts need to work with programmers until your data turns into real, usable information.
  • Biases are everywhere, so don't assume your raw data is just data—it's all touched by people at some point. You'll get much greater results by being transparent about every step of the data gathering process: algorithms, methods, and even the questions being asked all color your data.
  • If you want to host a hackathon don't expect big results to come out of it. Cloudera's event is expected to be just one of many and yours should be too.

Big data has the potential to save lives and change the way we live, but like anything else scrutiny is essential. As Jake Porway said, data isn't just numbers: it's the quantification of our world. If you want to capture the beauty of big data you're going to need to commit.

Also see:

About Brandon Vigliarolo

Brandon writes about apps and software for TechRepublic. He's an award-winning feature writer who previously worked as an IT professional and served as an MP in the US Army.

Editor's Picks

Free Newsletters, In your Inbox