Poor coronavirus data outcomes in states like Georgia and Florida can teach us lessons about how we use data analytics.
News began to surface that states are all using different systems for reporting COVID-19 infections. The most glaring examples of suspect data have occurred in Georgia, where officials apologized for reporting erroneous data that cases in the state were actually decreasing, and Florida, where data was also being scrutinized for accuracy.
SEE: Coronavirus: Critical IT policies and tools every business needs (TechRepublic Premium)
Data flaws in pandemic reporting are nothing new.
In 1968-69, the Hong Kong flu was infecting the world. The World Health Organization (WHO) was used by countries as a global reporting source. Compared with now, the advantage of unifying behind WHO was that everyone around the world was using the same data. The disadvantage in terms of data truth was that data analytics in 1968 was virtually non-existent.
During the late 1960s Hong Kong flu epidemic, the WHO was forced to rely on daily output from one English-language and four Chinese-language newspapers that were published in Hong Kong, where the disease was believed to have originated. There were no ways to scan this unstructured big data into systems and analyze it. Instead, data from media sources was manually reviewed and then entered into the WHO's weekly Epidemiological Record for use around the world. The method was hardly scientific, but it was all we had in 1968.
Fifty years later in an era of big data and analytics, we are still struggling with data accuracy in pandemics.
SEE: How data analytic tools can provide clarity during the coronavirus pandemic (TechRepublic)
At its best, the COVID-19 data that is being gathered is "best effort" but may not be comprehensive enough to give us an accurate picture of what is going on. At its worst, the data may be getting weaponized by government agencies to advance their own agendas. Either way, the consistency of data gathering and the analytics techniques being used from state to state is a disaster if we want data that expresses a single version of the truth. Within each state, there are also concerns as to how accurate the data really is.
The good news is that efforts are being made to improve data quality so we can better combat the pandemic.
After significant pressure, Arizona agreed to release COVID-19 nursing home data, and New York agreed to release COVID-19 data that was broken down by zip code.
In King County, Washington, which includes Seattle, a Bill Gates-funded partnership is tracking the results from COVID-19 home test kits, with a goal of giving local health officials a clearer understanding of how far COVID-19 has penetrated the community, including coverage of cases that previously would have gone unreported, but are now visible because of home testing.
All are efforts at greater data transparency and accuracy—and all hold lessons for those in charge of analytics in their organizations who are also facing similar challenges for a single version of data truth—with assurance that the data being worked with is as accurate as it can be.
Here are three ways we can pursue these elusive goals.
1. Aim for a single version of the data
Data throughout enterprises is as dissimilar and diverse as the data that states are struggling to manage during the coronavirus pandemic. There are data normalization, cleaning, and screening techniques that can be applied to data before it's entered into analytics data repositories. These techniques should be utilized. In short, the data repository should be regarded as a kind of data "clean room"—with all of the garbage being eliminated before any data gets admitted to the repository.
2. Eliminate data bias
Data bias occurs when the data being analyzed is not representative of the population or phenomenon being studied. Sometimes the data collected is insufficient. In other cases, companies, agencies or individuals want to see a certain outcome of the data and willingly or inadvertently inject bias. In all cases, data that is biased skews the results of the analytics that operate on it, and is likely to compromise the quality of business decisions that are based on it. No organization can afford this when it is seeking answers for the realistic situations that it faces.
3. Think about elements that your data might be missing
There is always the chance that you're missing some important data sources that would add variety to your data. A good example is local hospital and healthcare clinics. They track the diseases and treatment outcomes for patients in their systems, but if they had a more comprehensive understanding of underserved communities in their areas, or even of other disease profiles and treatment choices and outcomes in the world, their scientific and therapeutic capabilities would improve.
- The latest cancellations: How the coronavirus is disrupting tech conferences worldwide (TechRepublic)
- Coronavirus having major effect on tech industry beyond supply chain delays (free PDF) (TechRepublic download)
- Coronavirus domain names are the latest hacker trick (TechRepublic)
- Extended Sick Day policy (TechRepublic Premium)
- As coronavirus spreads, here's what's been canceled or closed (CBS News)
- Coronavirus: Effective strategies and tools for remote work during a pandemic (ZDNet)
- How to track the coronavirus: Dashboard delivers real-time view of the deadly virus (ZDNet)
- Coronavirus and COVID-19: All your questions answered (CNET)
- Coronavirus: More must-read coverage (TechRepublic on Flipboard)