Data scientists spend as much as 80% of their time cleaning and preparing big data. See below how this can change.
The term garbage-in-garbage-out dates back to the 1950s and 1960s--the earliest days of computing. It was quickly abbreviated to "GIGO" and represented the idea that if you used poor data for your reporting and computing, the output from those reports and computing wasn't going to be worth much, either.
Now, 60 years later, GIGO remains a significant adverse force on data quality. I saw it first-hand last month when I toured a manufacturing company and passed a room full of what looked like data entry operators.
The CIO explained that the company was preparing to launch a new system but first needed to repair and fix all of the data. The only way they could do that, he said, was through manual data entry "from the folks who really know what the data should look like."
SEE: Research: Big data and IOT - Benefits, drawbacks, usage trends (Tech Pro Research)
I want to think a better way exists to improve data quality, but industry reports tell us that even highly compensated data scientists spend as much as 80% of their time cleaning and preparing big data.
This isn't good news for data scientists, nor for the companies that employ them.
Big data health
"One thing we find when we talk with companies is that they don't understand how big the data health problem really is," said Katie Horvath, CEO of Naveego, which provides data quality tools.
Horvath referred to an IBM study that found while it took only $10 per record to fix data, it cost companies $100/record in bad reporting and decisions if the data went unfixed. "This quickly becomes a serious situation when as much as 47% of all data records have problems in them," she said.
SEE: Quick glossary: Big data (Tech Pro Research)
A measurable example of the real cost of bad data occurs in medical appointment scheduling, where an estimated $150 billion is lost annually because of missed appointments due to data errors, or to a lack of analytics that identifies patients who are most likely to miss appointments.
Fixing your data quality
Below are three steps your company can take to improve the quality of your organization's data.
1. Understand what you want from your data
Some data is more important than others. A first priority should be a visit with key decision makers in your organization to determine which data is most crucial. Some data will be in structured records, and other data will be unstructured, or big data.
2. Standardize your data
Companies run dozens of systems. These systems need to talk to each other, but every vendor identifies data differently. If you try to build a master database, which most companies want so that they can consolidate data in one place, you need a way to locate all of the different data names that a single piece of data is known by, agree on one standard name for that piece of data, and then link all of the data's different names to the standardized data name. This is the only way that you can ensure that every system user is relating to the same data in the same way.
3. Find an automated toolset for your data prep work
Whether it is cleaning data that is broken or standardizing data for universal user understanding, these tasks are too daunting to be done manually. The best way to tackle this is by using automated tools (and there are many available) that apply the data cleaning rulesets that you define so that the tools can do the data cleaning and standardization for you.
- Before Big Data, clean data (TechRepublic)
- Why your company might want to consider outsourcing big data preparation (TechRepublic)
- How to make your business a big data leader: 5 steps (TechRepublic)
- Business analytics: The essentials of data-driven decision-making (ZDNet)
- What to do with the data? The evolution of data platforms in a post big data world (ZDNet)