As a CIO, you’re responsible for producing a product called information, which you manufacture from a constituent raw material called data. In your information “factory,” as in any factory, the quality of your raw material directly affects the quality of your finished goods. When initiating your business-intelligence project, you’re likely to be surprised at how bad your raw material—data—really is, and you’ll discover that if you’re going to be serious about business intelligence, you’re going to have to get very serious about data quality as well.
In this article, I’ll define data quality, the high costs associated with poor data quality, and I’ll tell you why data-quality issues tend to snowball when a data mart or warehouse enters the picture.
First of three articles
This article is the first of three in Dan Pratte’s series on data quality. In the next two articles, Pratte looks at what actions might be taken to improve data quality, and he concludes the series with a discussion of Total Quality Management (TQM).
Data quality defined
Data warehousing expert and author Ralph Kimball has said, “Quality data is the truth, the whole truth, and nothing but the truth.” Accordingly, a good measure of data quality is the extent to which the “truth”—as represented by your information systems—matches the ”truth” in the real world. As Kimball points out in The Data Warehouse Lifecycle Toolkit, there are five standards that collectively define data quality. Conformance to these standards is what you should be aiming for in your data mart or warehouse data.
- · Accuracy: The correctness of values contained in the various “fields” of the database record. Is my name spelled correctly? Are dollar amounts recorded properly?
- · Completeness: Users must understand the data’s scope and be absolutely clear as to what comprises a particular data element—“total revenue,” for example.
- · Consistency: Aggregated or summarized information is in agreement with underlying atomic-level detail.
- · Uniqueness: One ”thing” (entity) in the real world must correspond to one and only one thing in your data. One of the two database records, Dan Pratte and D. A. Pratte, should be eliminated, as they represent the same entity in the real world.
- · Timeliness: Data must be current with respect to the needs of the business, and users should be made aware of any deviation by a standard “update” schedule.
Where have these issues been hiding?
Poor data quality isn’t always apparent in processing your day-to-day business transactions. The billing department, for example, may not see the difference between entering “Amoco” or “Amoco Oil” or “Amoco Oil Corp.” in a database. All of these seem to get the job done, so in most organizations, multiple database entities (in this case, company names) often describe a single real-world entity. This is a violation of the ”uniqueness” standard mentioned above, meaning that your data doesn’t truly represent the real world. Generally, this happens because the organization does not require stringent use of the data within the organization that demands any greater conformance with reality.
These sorts of data issues begin to snowball as soon as you begin to design and populate a data mart or warehouse as part of your business-intelligence solution. Why? Because you’re creating a system that relies heavily on the uniqueness of text-type identifiers in order to group data properly. To illustrate, if it’s important that you know the nationwide sales total for Amoco Corporation, then you’d better link all of the constituent sales facts to a unique company name.
Other effects of poor data quality
Poor data quality adversely affects your organization in three key ways:
- · Poor data quality causes inefficiencies in those business processes which depend on data—reports, ordering products, voter registration, and just about everything else for which facts are required. These inefficiencies result in very expensive rework efforts to “fix” the data in order to meet the requirements of various processes.
- · Poor data quality gives rise to poor decisions. A decision can be no better than the information upon which it’s based, and critical decisions based on poor-quality data can have very serious consequences. This is another reason why you should make sure that your data actually represents reality.
- · Poor data quality creates mistrust. Poor data quality can reflect adversely on your organization, lowering customer confidence. If the data’s wrong, time, money, and reputations can be lost.
While you, as CIO, may be surprised at the state of your data, other executives are simply in the dark. The problem, cost, and challenge of poor data quality is traditionally not visible to senior managers because it’s most often cleaned up before they see it. They don’t routinely see that the monthly sales-by-customer report takes three days because of the fiddling and rework that’s required to get the data right. That’s what IT departments do, right? Executives routinely tolerate high levels of waste and rework in their information factories, something they would prohibit with quality initiatives and cost constraints in their brick-and-mortar factories. But as CIOs begin to address these issues, a paradigm shift will occur across the organization, causing people to begin to look at data differently and to respect it as the asset it truly is.
As the usefulness of data extends beyond ordinary business transactions to the support of business-intelligence initiatives, data-quality issues are going to become increasingly evident. In the final analysis, the CIO must lead the charge to inspire the dawn of a new view of data across the organization. As we’ll discuss in future articles on this topic, this new view will require discarding some old paradigms so that you can recognize and exploit the full value of your data.
Tell us your thoughts