TEKsystems (a
subsidiary of Allegis Group, a private talent management firm) performed a big data survey in 2013 that revealed 60 percent of IT leaders believed their organizations lacked accountability for data quality, and more than 50
percent of IT leaders questioned the validity of their data.

A
big data cloud vendor told me in early 2014 that it routinely used its product with data it
uploaded from clients that was less than clean. In the process of uploading data to the cloud so big data
analytics can be performed, the client gets a screen with different data fields
and highlights the fields that he wants to upload to the cloud, and presses an Upload button. Data is then matched from the input file to fill in the fields
for analytics that the client has selected. Within minutes, the client receives
a full set of analytics that come with both summary charts and drilldown capabilities.
In the process, however, the client might not get everything he wants. The primary
reason is that there invariably are fields that he originally requested that
the data he has furnished is unable to fill. The situation is symptomatic of
data that is not “clean enough” to fully populate the requested fields for an
analytics query.

“The old concept of garbage in, garbage out (GIGO) still reigns,” acknowledged the cloud provider, “And when we provide analytics to our clients,
we don’t pretend to have a magic bullet that can clean everything in their data
that needs to be cleaned. But the value proposition we provide to companies
that have difficulty cleaning all of their data is that at least they can begin
to receive some value from big data that will benefit them. In other words, these
clients are not in a ‘yes or no’ situation when it comes to having all of their
data clean as a prerequisite before they can start using big data analytics. There
is still value to be had from data that is not 100 percent clean.”

To a data purist, the philosophy at first glance is difficult to
accept. On the other hand, what the TEKsystems study revealed is reality for many
organizations.

“There’s no need to clean the data–just extract it from the source systems and load it into the warehouse. Our
data is already clean… I wonder how many times some poor extract,
transform, load (ETL)
consultant has heard those words, smiled diplomatically and
then scanned the room for a sturdy wall to bang their head against repeatedly,” observed Andy Hogg, a Microsoft SQL consultant. Hogg goes on to say that while it seems
straightforward to just pull data from source systems, when all of this
multifarious data is amalgamated into vast numbers of records needed for analytics,
this is where “the dirt really shows.”

Dirty data is also a challenge
organizationally for companies. You can “job out” ETL tasks to data cleaners,
but nobody knows your own “dirt” better than you do. This is because it’s hard
to get data squeaky clean without knowing what it really should look like within
the context of your business. It is here where data cleaning can become a
painstakingly manual task. There is also the difficulty of recruiting a business
leader with enough organizational authority to assume responsibility for this “back office” task. At the end of the day, cleaning data can be hard to justify
for ROI, because you have yet to see what clean data is
going to deliver to your analytics and what the analytics will deliver to your
business.

There are certainly business cases that
support clean data,
and big data analytics providers that want their clients to focus on it (PDF), but
in the face of so many other organizational demands on C-level executives and those
reporting to them, getting to “clean” is seldom a practical pursuit.

This brings me full-circle to the original
remarks of the cloud provider—that it is better for companies to “get going”
with big data, than to delay work until their data inputs are pristine. As a
proponent of data preparation and cleansing, I railed against this point of
view when I first heard it. Nevertheless, the approach seems sensible, given
the realities of today’s heavy workloads. Some companies have adopted this approach. And while their
analytics don’t answer every question they want to ask, it is a start.

What’s your take?

Do you hold back work until your data inputs are sparkly and dirt-free? Share your philosophy about data quality in the comments.