If you Google the term data

dictionary, you will get nearly a hundred or more results of definitions

that generally mean the same thing. A data dictionary is data about data. It is

what defines the items in your database/s and can be looked to for definitions,

structures, use, allowable content, and sometimes business rules associated

with the data. It is THE roadmap to data in your database.

I was recently asked the question: Does the ROI on the data

dictionary that my organization religiously maintains justify its cost? I

answered with a resounding yes! Then I thought, why would anyone ask that

question? After all, aren’t the inherent benefits of a well-thought out and

maintained data dictionary obvious to all? Perhaps not if the question is being

raised.

Having spent a number of years of my career on the

database/applications development side of the house, a data dictionary has

significant meaning to me. Not only do I know what it is, but I know how

important it is to an organization and how difficult life can be without one

when you need to combine data from multiple sources.

Based on the above, you might conclude that a data
dictionary:

  • Is common sense
  • Is important
  • Is worth the effort to create and maintain
  • Is part of every organization

Based on these assumptions, you might be surprised at how

many organizations do not have data dictionaries, do not staff a dedicated data

administration/metadata management team, do not update existing ones (they were

created at the time an application was built), or depend on the knowledge of

one or more individuals to maintain this information about the data in their

heads as institutional knowledge.

So what’s the big deal with data dictionaries if so many
organizations forgo them?

The big deal is the value that they bring when you have to

share data, whether internally or across organizations. Having one makes life

so much easier, while not having one can result in chaos and misinformation.

Imagine this situation: Organization O has three

departments. The role of organization O is to collect information about Air

Quality so that it can make rulings, set standards, and influence legislation

concerning the quality of air in the environment. Department A in the

organization is responsible for monitoring outdoor air quality. Department B is

responsible for monitoring air quality in indoor environments, while Department

C is responsible for monitoring air quality underground, such as in mines and

sewers.

Each department is made up of brilliant scientists who are

extremely familiar with the chemistry of air. All the scientists are not

necessarily from the same disciplines, but all KNOW air. Now these scientists

begin to construct spreadsheets and databases to capture and manipulate data

about air quality. What do you think the odds are that in each and every

spreadsheet and database that is created, the terminology used to name the

fields is the same? Probably moderately high. Now what are the odds that these fields

with the same names across databases are defined the same and capture data in

the same way? That’s right – pretty low. Now, try to combine the data from the

disparate databases in the three departments into a data warehouse or central

repository so that you can do statistics using a larger data set, and boy, do

you have trouble (and a lot of work on your hands).

Unfortunately, many of us are not fortunate enough to be in

an organization at the time that these databases are first developed, and so we

can’t step in with good data administration practices in time to prevent such messes

from happening. In fact, many of us get brought in to deal with the train wreck

of disparate data sources under the guise of creating a data warehouse or

having to define the data to be used in a service as part of a SOA effort.

The good news is that it is never too late to create AND

MAINTAIN data dictionaries–with standards, administrative policies, and

procedures governing data. The bad news is that it takes a considerable amount

of time and effort to create them “after the fact,” particularly when

it is long after the fact and the people who actually know what the data was

intended to mean have left the organization. However, the effort always pays

for itself OVER TIME. Creating data dictionaries will not boost profits overnight,

nor will they suddenly allow you to do more with less or make your company 100

percent more efficient. But by working steadily on the project, you will aid in

decision-making in your organization by improving the quality of the data and

exposing the data to its users (because they now know where it is and what it

means), thus putting your organization in the position to build top quality data

warehouses or SOA services. 

All of these goals are worthwhile and can add to

the bottom line by helping you work smarter and faster. But how do you measure

this as ROI? That’s difficult, because many of the benefits are either hard to

measure, are intangible, or have to be measured over very long period of time.

Yet despite all this, it just makes good sense strategically for an

organization, and strategic ROI is measured over a period of many years.

So going back to the question that started this article –

yes, by golly, it is worth it! Can I measure it in terms of ROI well enough to

convince someone stuck on pure numbers that they should give the project a go? Maybe

not (lots of intangibles there), but I feel confident that I can argue the

point that every penny spent on building and maintaining an accurate data

dictionary is well spent. Don’t believe me? Ask the poor guy working on the

“new” data warehouse for a company that doesn’t have one. He will

tell you how much not having one is costing you.