If you Google the term data
dictionary, you will get nearly a hundred or more results of definitions
that generally mean the same thing. A data dictionary is data about data. It is
what defines the items in your database/s and can be looked to for definitions,
structures, use, allowable content, and sometimes business rules associated
with the data. It is THE roadmap to data in your database.
I was recently asked the question: Does the ROI on the data
dictionary that my organization religiously maintains justify its cost? I
answered with a resounding yes! Then I thought, why would anyone ask that
question? After all, arent the inherent benefits of a well-thought out and
maintained data dictionary obvious to all? Perhaps not if the question is being
raised.
Having spent a number of years of my career on the
database/applications development side of the house, a data dictionary has
significant meaning to me. Not only do I know what it is, but I know how
important it is to an organization and how difficult life can be without one
when you need to combine data from multiple sources.
Based on the above, you might conclude that a data
dictionary:
- Is common sense
- Is important
- Is worth the effort to create and maintain
- Is part of every organization
Based on these assumptions, you might be surprised at how
many organizations do not have data dictionaries, do not staff a dedicated data
administration/metadata management team, do not update existing ones (they were
created at the time an application was built), or depend on the knowledge of
one or more individuals to maintain this information about the data in their
heads as institutional knowledge.
So whats the big deal with data dictionaries if so many
organizations forgo them?
The big deal is the value that they bring when you have to
share data, whether internally or across organizations. Having one makes life
so much easier, while not having one can result in chaos and misinformation.
Imagine this situation: Organization O has three
departments. The role of organization O is to collect information about Air
Quality so that it can make rulings, set standards, and influence legislation
concerning the quality of air in the environment. Department A in the
organization is responsible for monitoring outdoor air quality. Department B is
responsible for monitoring air quality in indoor environments, while Department
C is responsible for monitoring air quality underground, such as in mines and
sewers.
Each department is made up of brilliant scientists who are
extremely familiar with the chemistry of air. All the scientists are not
necessarily from the same disciplines, but all KNOW air. Now these scientists
begin to construct spreadsheets and databases to capture and manipulate data
about air quality. What do you think the odds are that in each and every
spreadsheet and database that is created, the terminology used to name the
fields is the same? Probably moderately high. Now what are the odds that these fields
with the same names across databases are defined the same and capture data in
the same way? Thats right pretty low. Now, try to combine the data from the
disparate databases in the three departments into a data warehouse or central
repository so that you can do statistics using a larger data set, and boy, do
you have trouble (and a lot of work on your hands).
Unfortunately, many of us are not fortunate enough to be in
an organization at the time that these databases are first developed, and so we
can’t step in with good data administration practices in time to prevent such messes
from happening. In fact, many of us get brought in to deal with the train wreck
of disparate data sources under the guise of creating a data warehouse or
having to define the data to be used in a service as part of a SOA effort.
The good news is that it is never too late to create AND
MAINTAIN data dictionaries–with standards, administrative policies, and
procedures governing data. The bad news is that it takes a considerable amount
of time and effort to create them “after the fact,” particularly when
it is long after the fact and the people who actually know what the data was
intended to mean have left the organization. However, the effort always pays
for itself OVER TIME. Creating data dictionaries will not boost profits overnight,
nor will they suddenly allow you to do more with less or make your company 100
percent more efficient. But by working steadily on the project, you will aid in
decision-making in your organization by improving the quality of the data and
exposing the data to its users (because they now know where it is and what it
means), thus putting your organization in the position to build top quality data
warehouses or SOA services.
All of these goals are worthwhile and can add to
the bottom line by helping you work smarter and faster. But how do you measure
this as ROI? Thats difficult, because many of the benefits are either hard to
measure, are intangible, or have to be measured over very long period of time.
Yet despite all this, it just makes good sense strategically for an
organization, and strategic ROI is measured over a period of many years.
So going back to the question that started this article
yes, by golly, it is worth it! Can I measure it in terms of ROI well enough to
convince someone stuck on pure numbers that they should give the project a go? Maybe
not (lots of intangibles there), but I feel confident that I can argue the
point that every penny spent on building and maintaining an accurate data
dictionary is well spent. Dont believe me? Ask the poor guy working on the
“new” data warehouse for a company that doesnt have one. He will
tell you how much not having one is costing you.