Big Data

This open source tool from MIT Data Lab will change how you see big data

Data USA is an atlas that helps business, government agencies, and schools make better, faster decisions by visualizing big data in fresh ways.

NYC data, visualized. | Image: Data USA

In the early days of big data, "everyone scrambled to collect and store as much data as they could," said Datawheel co-founder Dave Landry. "In most cases, they didn't develop the tools needed to better understand that data. That's the challenge we are trying to tackle."

The rise of the mobile web, IoT, and APIs and modern databases paved the way for big data innovations. Everything in the world can be quantified, and those who scraped and logged early often benefitted from first-mover advantage. By making information easier to access and visualize, Landry said, big data can help businesses make faster and more intelligent decisions.

Data USA—along with its sibling products, DataViva, D3plus, and Observatory of Economic Complexity—was developed by Datawheel, a "visual atlas, with a lot of modern web features," Landry said. The site is the product of several years of work related to data visualization by Landry and his partners. The Datawheel product allows users to drill down to a diverse range of information sets—census data, demographic information, geo-local salary, and education background—-and visualize the information in useful charts and maps.

READ: Big decisions with big data (Tech Pro Research story)

In 2010, along with Cesar Hidalgo of the MIT Data Lab and Datawheel co-founder Alex Simoes, Landry collaborated on a project that visualized the UN's Human Development Index. When the project was complete, Simoes went on to study with Hidalgo at the MIT Media Lab. The duo produced the OEC, and the visualization and API framework for Data USA.

The product was popular, and it raised eyebrows with government agencies and the private sector. "[Organizations] have been trying to figure out how to visualize their data," Landry said. "We started Datawheel together to solve the problem of intelligently visualizing data that was previously hidden in databases in spreadsheets. The Data USA project was born when Deloitte came to us asking for help using various US open datasets."

The tool "aggregates seven different United States open datasets into one central API and visualization platform," Landry explained. "This enables intelligent crosswalks and interconnectivity that was previously hard to do." The aggregated datasets help governments and business make better decisions, he said, by providing timely information. "It also helps illustrate to the general public what data is collected by various government agencies."

The team scrubs and normalizes data using internal Python scripts coupled with the Pandas library to handle mathematical transformations. Data is stored in PostgreSQL, and the back-end API is a Python script that relies on Flask framework. Jinja templates power the front end. Landry said data visualizations are built on D3plus, the team's homebrew engine.

NYC Occupations by share. | Image: Data USA

The result is an accessible, malleable, and functional engine being tested by investors, companies, and academic institutions. "We've been contacted by librarians wanting to list the tool as a resource for research and students," Landry said. "We've also been seeing some local governments embedding the visualizations on their own sites."

The product has been particularly adept at helping government agencies improve how they collect data. One challenge of data collection is verifying the veracity of the information. The Data USA tool, Landry said, has been useful at rectifying this problem by making the process of cross-checking and comparing freshly collected data against established, similar datasets. By visualizing data, we are able to identify problems in their collection and storage, he said.

Landry added that in the near future, access to powerful open source tools like Data USA will be available for free or low cost to business, agencies, and educational institutions. "The more data we can access, the more data we can present visually and the more insights we can provide about everyday life."

Read more

About Dan Patterson

Dan is a Senior Writer for TechRepublic. He covers cybersecurity and the intersection of technology, politics and government.

Editor's Picks

Free Newsletters, In your Inbox