A new online initiative from the U.S. Food and Drug Administration (FDA) may offer a partial antidote to the government tech hangover the botched launch of HealthCare.gov gave the American people last fall. When openFDA went live on June 2, 2014, announced by U.S. chief technology officer Todd Park to great acclaim at the 5th Health Datapalooza in Washington D.C., the public gained the ability to search through millions of adverse drug reactions.

“This is only the beginning for openFDA,” he said. “Throughout the summer, the openFDA initiative will be adding APIs for product recalls and product labels as well. We can’t wait to see what [people] will build using this information as fuel.”

Park also announced that the Blue Button now enables personal data access to more than 150 million Americans.

The new open data platform, operated by the FDA’s Office of Informatics and Technology Innovation (OITI), joins a growing number of online efforts that the Obama administration is starting up in its second term.

As I’ve highlighted before, these open government data feeds represent an important class of public infrastructure for researchers, industry, media, and the American people to use. Open data fuels economic activity, enables resilience against climate change, provides insight into healthcare costs and fraud, energy efficiency and cost savings, and many other outcomes, from transparency and accountability to public participation. The 20 gigabytes of data (compressed) behind openFDA are now part of that list, with a potential outcome that goes beyond efficiency: increased consumer safety.

“Pharmaceutical companies already use this data,” said Sean Herron, a Presidential Innovation Fellow who worked on the openFDA project. “There is vast potential for finding more drug-to-drug interactions. The FDA uses this data right now as a huge part of surveillance activity and already is using it quite a bit internally to get insights. Our hope is by putting it out to the public, it will inform consumers. If people do research or scientific inquiry on it, it will help the FDA. Currently, researchers have a hard time citing events data. Our hope is that by having data in the open, it will help research.”

The story of opening FDA data

How did the new platform come to be? That success story involves many major players: leadership at the FDA that supported the initiative, a Presidential Innovation Fellow dedicated to it, a small team of contractors that used current web technologies and standards, and a technology startup that used the power of cloud computing and crowdsourcing to digitize documents.

OpenFDA was announced in February 2014 at the Safety Datapalooza, after the agency quietly put up open.fda.gov in January 2014. Alexander Gaffney, a sharp-eyed observer at the Regulatory Affairs Professionals Society, noted that the effort to free the data was unprecedented at the agency. That means leadership mattered.

While there was a top-down mandate from a 2013 Presidential Executive Order on open data and momentum from years of the Department of Health and Human Services’ Community Health Data Initiative, the FDA still had to decide to commit to putting this data online and follow through with harmonization, digitization, and publishing. The agency hired Dr. Taha Kass-Hout as its first health informatics officer in March 2013. Dr. Kass-Hout came over from the Centers for Disease Control and Prevention (CDC), where he’d helped the CDC adopt cloud computing. Here’s an excerpt from his introductory blog post on openFDA:

“In the past, these vast datasets could be difficult for industry to access and to use. Pharmaceutical companies, for example, send hundreds of Freedom of Information Act (FOIA) requests to FDA every year because that has been one of the ways they could get this data. Other methods called for downloading large amounts of files encoded in a variety of formats or not fully documented, or using a website to point-and-click and browse through a database — all slow and labor-intensive processes.

OpenFDA will make our publicly available data accessible in a structured, computer-readable format. It provides a ‘search-based’ Application Programming Interface — the set of requirements that govern how one software application can talk to another — that makes it possible to find both structured and unstructured content online.”

What may not be clear from this post or the FDA’s other public statements is that these open datasets weren’t just hard to access in digital form — there was also cost involved to acquiring it. Before openFDA went online, the way to get the data was to visit the site AdverseEvents, which acquires the data using the FOIA and resells it for a couple of hundred dollars. Lowering access, friction, and cost barriers to adverse events and recalls data is quite likely to lead to its use by the public and in more services and applications.

“What we’ve done is create a broad open data initiative,” said Herron. “It was an effort to find existing datasets that might be available only through FOIA, bring them out, polish and shine them, and release them in formats developers would use.”

The approach that the FDA chose to open up more than 3 million adverse event records at openFDA is in some way the opposite of the one federal and state governments chose to build health insurance marketplaces around the US.

This apples-to-oranges comparison isn’t entirely fair given the scope of the two projects — from security and privacy constraints to traffic demands to the complexity of shopping for health insurance — but given that HealthCare.gov is now a popular yardstick for the federal government’s use of technology, the contrasts are important to draw, from the use, attibution, and publishing of open source code to agile development to the use of Amazon’s cloud computing power.

Instead of contracting with a huge systems integrator around the Beltway to develop a classic enterprise platform, the FDA worked with Iodine, a tiny data science startup in San Francisco. Iodine helped the FDA harmonize the data, create a cutting-edge website, and write and release open source code for a data publishing platform for it.

“If you look at the website, you’ll see a lot of visualizations that start to give a sense of what is possible in terms of extracting value from it,” said Thomas Goetz, a cofounder of Iodine, in an interview. “That’s one of the things we’re really proud of: it’s not just saying ‘here’s API access, go figure it out.’ We’re really trying to take that next step towards showing people, with powerful examples, what this data could be.”

Opening data means cleaning data

Whether you call it data science or data journalism, the hardest part of the process often comes in cleaning and structuring data. That was true for openFDA as well.

“The data was extremely messy,” said Herron. “There were encoding errors, among others. The FDA’s primary job hasn’t been publishing open data. They didn’t necessarily have the skill or the resources to publish open data. By doing this deep effort, with a lot of data clean up, a lot of the fields are now very well documented. We removed duplicate records when we did the harmonization process. There are now 15 identifiers mapped to 85% of the adverse events data. For instance, where you previously just had ‘Tylenol,’ now you’ll find manufacturer, DUNS #, ingredients, where it was produced, et cetera.”

When I talked to Kuang Chen, the founder of Captricity, a California-based document digitization startup, he specifically praised the work that Herron and Iodine did to harmonize and publish the data online.

“That dataset is now the shining star of open data,” he said. “The reason why is that it’s extremely thought-through by the Iodine guys, in terms of how to present it.

“If you put PDFs online, it really doesn’t do anything,” said Chen. “If you think it through and use modern web APIs, the way they’ve done, and apply modern web materials to the mix, you get this beautifully usable interface. Those guys really know about data sources and make it work for consumers.”

By developers, for developers

In practice, this means that openFDA was built by developers, for developers, using open standards and open source code. If you visit the FDA’s GitHub accont, you’ll find both the openFDA website and the openFDA research project, with the code repository for the software that powers the FDA API. Notably, the Elasticsearch schema for the adverse event open data format and the Node.js API server that powers it are available for anyone to adopt.

“The API is very simple,” said Herron. “Elasticsearch is an open source search platform with a small API proxy on top. We leveraged existing open source work for this. All code for data cleansing is open source and on GitHub. My dream for openFDA is for another agency to pick it up and deploy it on their own. OpenFDA doesn’t care what data it has in it: as long as you do mapping, everything is agnostic.”

According to Herron, the development process for the site and APIs over the past year was also guided by a private beta of 50 or so users. “We had quite a few discussions, with substantial changes to how the API returned results,” he said.

The importance of this approach can’t be emphasized enough: if developers are the target audience for your initiative, how a government agency goes about standardizing and publishing data for them really matters, as does outreach to the communities that an agency wants to become part of the healthy ecosystem around that data. A link to openFDA on StackExchange atop the website suggests that the FDA understands where developers hang out.

“I have been at the FDA for almost a year,” Herron explained. “We identified this dataset within the first weeks of my fellowship. Ever since then, we’ve been working on freeing it. From the beginning, there was a big focus on outreach to developers. While we had a great team, we couldn’t put something out there right away. We engaged FDA subject matter experts, to make sure we did this right.”

Goetz was hopeful that the efforts to engage developers will bear fruit by the next Health Datapalooza in 2015.

“Part of this is trying to line up the constituencies, and trying to work the networks here at the Datapalooza, at like-minded organizations like Code for America, and start getting people to know about the existence of the data,” he said.

“We’ve been able to build the APIs with an eye towards the consumer and an eye towards the eventual use around the country.”

Iodine’s work with the FDA and the agency’s data is worth following closely: search through the company’s gorgeous redesign of drug labels and think about how that might apply to the forthcoming dataset and API for labels on openFDA.

Digitizing in the cloud

There’s more to the story of how the FDA unlocked millions of files than the public may realize, however, including a genuinely innovative, important approach to how the agency broke the backlog of thousands of paper reports that had built up. That’s where Captricity is part of the openFDA backstory (PDF).

“There was previously a totally arcane paper-based process,” said Chen. “It was so backed up that they had to warn the general audience about it. There was a big backlog and a need to fix it, and the FDA did it in less than a year, which in the government is light speed.”

While the vast majority of the reports that the FDA receives every year are electronic, more than 10% of them still are submitted using paper records or faxes. The FDA scans those reports in for later data entry, and it was that portion that was backlogged, with an estimated 50,000+ behind as of June 2013.

“Last summer, the FDA contacted us,” said Chen. “Last fall, we got through the hurdles of procurement. We helped them for a concentrated spell, and they caught up.”

In the end, Captricity helped to clear up the FDA’s paper jam over three months using its innovative approach, which combines Amazon’s Mechanical Turk and Web Services. According to Chen, the startup cleared about 5,000 of the total backlogged adverse events reports, with the existing data entry vendor (which appears to be Virginia-based DSI) stepping up its performance to convert the bulk of it.

The interaction is a perfect microcosm of the pressure that new, cloud-native digital startups will put on traditional systems integrators and industry incumbents, like data entry firms, to match their pricing, quality, and speed. In some cases, incumbents won’t be able to compete. In others, enterprise and government technology vendors will adapt, retool and reform. Those that do not will fail, as is the case elsewhere in the business world, although in D.C., they may use the regulatory or the lobbying process to try to protect their businesses.

“There’s a lot of numbers that have been thrown around,” said Chen, “but according to one estimate at the FDA, we were one eighth of the cost. I would feel safe saying less than half. It’s hard to estimate ‘all in,’ in terms of cost. In terms of speed, this is cloud-scale. We’re a cloud-native web service. We deliver in near real-time — in hours, not days, depending on volume. With Captricity, there would never be a backlog again.”

Although he couldn’t comment specifically, Chen said that the FDA hasn’t contracted with Captricity to provide more digitization services in 2014 yet.

There’s already an app for that

The data released through openFDA is already being used in existing services, like Epidemico’s MedWatcher, which has integrated adverse events data into its service.

The first dedicated web app to tap into openFDA’s social health insights is already online as well, at openFDA Search. It’s no accident that it was out of the gate so quickly: Brian Norris, the CEO of Social Health Insights, the company behind it, was part of the small group that the FDA invited into the private beta to test the API before launch.

“I looked at [openFDA] and thought, if I’m going to interact with the data, and I’m not a techie to go use the API, what would I want?” asked Norris, when we spoke. “I want a search engine to put in certain parameters, and get back a list of adverse events. I am sure it will spawn and grow, but over the past two weeks, we built this. We got some good feedback this week and added some features already, like linking to other datasets as well, like MedlinePlus, DailyMed, and ClinicalTrials.gov. The idea was to make it easier to learn.”

OpenFDA search is a great start: expect more services and apps to ingest the data and follow. For instance, if you search for aspirin on Google, you’ll see government information on the top right. Recalls and adverse events might be integrated in the future.

One of the biggest opportunities has yet to be delivered upon: a mobile app that enables consumers to check adverse events and recalls at the point of shopping and sale. The strategy the FDA has chosen to pursue in bringing that app to market is to act as an open data platform for third-party developers to innovate upon, as opposed to contracting with a firm to create a mobile app. (That kind of thinking is part of the missed opportunity of HealthCare.gov.)

“We don’t specialize in making consumer-facing iPhone or Android apps,” said Herron. “If we can provide data we collect on a daily basis in good detail to someone who does, it allows us to extend influence further.”

The time to start the clock for these apps and services to emerge will come later this summer, when all three datasets are online.

“The labels are what I’m excited about,” said Herron. “They include a bar code and image of the product. That dataset is what you can use to tie together adverse events and recalls.”

The openFDA team found that a universal product code is what all consumers want. As a result, they’re now going through and automating the extraction of UPCs from the images of labels so they’ll be part of the dataset.

“With this combination of APIs, you could enable someone to scan a bar code in a grocery store and see all of the recalls for a product,” mused Herron. “We also see huge potential for point of sale, where pharmacists might scan a box, perhaps using Google Glass. They might see an alert that a label has changed, and check to make sure it’s updated, or [to see if there’s] a recall alert not to stock it. We will connect the heartbeat of the FDA between adverse events and the label.”

Knowledge is power, to paraphrase Francis Bacon’s centuries-old truism. When put into the hands of a consumer or a pharmacist, knowledge of adverse effects and food or drug recalls could be a powerful antidote to ignorance, misinformation, and marketing, while saving lives and powering new businesses along the way.