Gathering data at the speed of life can make it hard to discern real information from a large amount of input. One data modeling and mapping project was able to make it work.
Finding a single version of the truth on the epidemiology of COVID-19 has proven elusive during this pandemic. There is no national case registry or medical inventory database. The epidemiological forecasting algorithms like SIR (Sampling-Importance Resampling) and IHME (International Health Metrics and Evaluation) that are used by federal and state governments lack reliable data. There is clearly a need to help public officials discern and navigate through health and economic risks better.
SEE: Return to work: What the new normal will look like post-pandemic (free PDF) (TechRepublic)
"I manage four different data labs throughout the world, and for the first few weeks of COVID-19, we were scrambling," said Eric Haller, executive vice president and global head of Experian DataLabs, which provides advanced data analytics and research. "We had to learn how to shelter in place and to work remotely, but we were driven by a huge sense of responsibility to help government and healthcare providers sort through the data so we could make progress on the pandemic."
The goal of lab efforts was to develop reliable data that could pinpoint and predict virus hot spots.
"Our process took about six weeks to build a core map that tracked COVID-19 outbreaks and responses," Haller said. "We wanted to be able to provide the information to governments and healthcare so they could identify the hot spots and where they needed to double down with efforts for hard-hit communities."
Data streams analyzed
Haller said there were three primary data streams that the analytics looked at.
The first was disease spread as represented by the number of cases and the number of deaths. A second data stream data stream provided co-morbidity rates. For those patients who died during a COVID-19 episode, how many had pre-existing conditions that made them especially vulnerable, such as heart disease or asthma?
"From the correlations of this data, we began to develop a health risk score on a county-by-county basis," Haller said.
SEE: Robotic process automation: A cheat sheet (free PDF) (TechRepublic)
A third data stream looked at social determinants and their effect on COVID-19 spread. How many patients had mobility, such as ready access to public transit? How dense was the housing in the areas where these individuals lived?
The team also looked at demographics, such as which age groups were the most vulnerable.
"What we did was blend all three data models into a master model for over 3,000 counties," Haller said. "This made it simple for users to drill down into any particular county that they wanted to in order to see more specific data."
Haller's teams also creatively used unstructured data such as maps and photos to deduce information like housing density through aerial maps.
For those responsible for data modeling and analytics development, there are three key takeaway points from this project:
1. Obtaining quality data is harder than data modeling
"When we compiled data from different states and localities, there were inconsistencies in data that we had to reconcile," Haller said. "For instance, in New York State, they were reporting the number of COVID-19 deaths but also the number of 'probable' COVID-19 deaths. Some of this data was subjective, and we didn't have a method to scrub that data."
2. Using big data is good if you can eliminate the noise
For an item such as population density, the analytics team used available GPS data, but mapping was still inconsistent because GPS data continuously changes. "When there were questions, we had to use our own perspective to determine what was happening," Haller said.
3. The project can move faster than you think
"We found that we could quickly adjust to having to work and collaborate remotely. The seriousness of the situation also helped us to move faster than we might have in a non-emergency mode," Haller said. "When you work under emergency conditions like these, the smaller issues that can disrupt projects tend to disappear."
- How to become a data scientist: A cheat sheet (TechRepublic)
- Top cloud providers in 2020: AWS, Microsoft Azure, and Google Cloud, hybrid, SaaS players (TechRepublic)
- Power checklist: Local email server-to-cloud migration (TechRepublic Premium)
- Volume, velocity, and variety: Understanding the three V's of big data (ZDNet)
- Best cloud services for small businesses (CNET)
- Big data: More must-read coverage (TechRepublic on Flipboard)