Data quality: The ugly duckling of big data?

Mary Shacklett explores the realities of dirty data and the challenges of cleaning it, especially when that might mean delays in work.


Image: Wikimedia Commons/University of Liverpool Faculty of Health

TEKsystems (a subsidiary of Allegis Group, a private talent management firm) performed a big data survey in 2013 that revealed 60 percent of IT leaders believed their organizations lacked accountability for data quality, and more than 50 percent of IT leaders questioned the validity of their data.

A big data cloud vendor told me in early 2014 that it routinely used its product with data it uploaded from clients that was less than clean. In the process of uploading data to the cloud so big data analytics can be performed, the client gets a screen with different data fields and highlights the fields that he wants to upload to the cloud, and presses an Upload button. Data is then matched from the input file to fill in the fields for analytics that the client has selected. Within minutes, the client receives a full set of analytics that come with both summary charts and drilldown capabilities. In the process, however, the client might not get everything he wants. The primary reason is that there invariably are fields that he originally requested that the data he has furnished is unable to fill. The situation is symptomatic of data that is not "clean enough" to fully populate the requested fields for an analytics query.

"The old concept of garbage in, garbage out (GIGO) still reigns," acknowledged the cloud provider, "And when we provide analytics to our clients, we don't pretend to have a magic bullet that can clean everything in their data that needs to be cleaned. But the value proposition we provide to companies that have difficulty cleaning all of their data is that at least they can begin to receive some value from big data that will benefit them. In other words, these clients are not in a 'yes or no' situation when it comes to having all of their data clean as a prerequisite before they can start using big data analytics. There is still value to be had from data that is not 100 percent clean."

To a data purist, the philosophy at first glance is difficult to accept. On the other hand, what the TEKsystems study revealed is reality for many organizations.

"There's no need to clean the data--just extract it from the source systems and load it into the warehouse. Our data is already clean… I wonder how many times some poor extract, transform, load (ETL) consultant has heard those words, smiled diplomatically and then scanned the room for a sturdy wall to bang their head against repeatedly," observed Andy Hogg, a Microsoft SQL consultant. Hogg goes on to say that while it seems straightforward to just pull data from source systems, when all of this multifarious data is amalgamated into vast numbers of records needed for analytics, this is where "the dirt really shows."

Dirty data is also a challenge organizationally for companies. You can "job out" ETL tasks to data cleaners, but nobody knows your own "dirt" better than you do. This is because it's hard to get data squeaky clean without knowing what it really should look like within the context of your business. It is here where data cleaning can become a painstakingly manual task. There is also the difficulty of recruiting a business leader with enough organizational authority to assume responsibility for this "back office" task. At the end of the day, cleaning data can be hard to justify for ROI, because you have yet to see what clean data is going to deliver to your analytics and what the analytics will deliver to your business.

There are certainly business cases that support clean data, and big data analytics providers that want their clients to focus on it (PDF), but in the face of so many other organizational demands on C-level executives and those reporting to them, getting to "clean" is seldom a practical pursuit.

This brings me full-circle to the original remarks of the cloud provider—that it is better for companies to "get going" with big data, than to delay work until their data inputs are pristine. As a proponent of data preparation and cleansing, I railed against this point of view when I first heard it. Nevertheless, the approach seems sensible, given the realities of today's heavy workloads. Some companies have adopted this approach. And while their analytics don't answer every question they want to ask, it is a start.

What's your take?

Do you hold back work until your data inputs are sparkly and dirt-free? Share your philosophy about data quality in the comments.