It is estimated that poor data quality costs US companies $600 billion per year. It isn’t just the potential for serious mistakes that poor data engenders, but also the painstaking amount of time and human effort that it takes to fix this data.
In the big data world, data quality issues can multiply as exponentially as the data. Web-generated data is known for its unreliability and quality issues, and then add to this the machine-generated data that characterizes the Internet of Things (IoT); machine-generated data can contain as much useless gibberish as invaluable status information.
The issue was highlighted by data consultant Thomas C. Redmon in a 2013 Harvard Business Review article. Redmon used the example of a product management executive preparing a key report for her firm’s senior team, and then noticing that the market share numbers in the report didn’t make sense. The executive asked an assistant to verify the figures, and the assistant found an error in the data supplied by the market research department, subsequently correcting it. The good news was that the assistant caught the mistake in time. The not so good news is studies cited in the same article revealed that “knowledge workers were wasting up to 50 percent of their time hunting for data, identifying and correcting errors, and seeking confirmatory sources for data they do not trust.”
There are data preparation tools that abstract the extract-transform-load (ETL) formula for big data cleaning so that even end business users can use them, but there are still data quality issues that arise when data is created. It can be as simple as someone in business unit A hastily punching through transactions just to get them into the system, without realizing that someone in business unit B who is at the other end of this data deluge is going to need clean and accurate data in order to apply analytics and to make sound business judgments.
Or, it could be a machine indiscriminately spewing out bucketfuls of bits and bytes as it operates, leaving the user at the end of the process to weed through the data garbage in order to isolate the useful pieces of information.
Also in this mix are the “clean data champions” who understand the cost of working with dirty data, and then try to fix it. However, anyone who has ever served in one of these clean data champion functions knows that the task goes unappreciated, and that few executives and/or corporate reward systems even recognize it.
There are three steps companies can take to improve the quality of their data.
1: Link key data flows to business processes
If web-generated data and orders in the internal order entry system are utilized by marketing for analytics, marketing should be closely linked with the sources of this data (e.g., the order entry group and the provider of web-generated data) so that everyone works together on a data pipeline team that begins when the data is generated and ends when it is consumed. In this way, end-to- end data quality problems can be detected and corrected before they become too overwhelming to address.
2: Set data standards and usage guidelines for your data preparation tools
More companies are adopting self-service data preparation tools that enable end user departments to work with their data quality without having to consult with IT; the problem is that inconsistencies of data practice arise. A standard procedure for proper use of these tools should be defined by IT, which has the historical experience with data cleaning and preparation.
3: Move greater responsibility for clean data to the end business
IT is the custodian of most corporate data but not its creator, nor its primary user. If the quality of transactional and big data are going to improve, it will be managers within the business who finally get fed up with the daily mistakes and loss of time that are consumed with trying to rectify situations created by the data being wrong the first time.