Has anyone else ever noticed that DBAs and database philosophers have no clue what reality is like? I sure have. Every time I encounter a database that I had nothing to do with, I notice one of two things: either the data is total garbage and nearly impossible to code with but meets the end user’s needs, or the data is pristine and perfect and has nothing to do with what the end user wants. Odd, is it not?

There is definitely a weird dichotomy between reality and database design. For example, some of my customers use a particular ID number for their clients and the people that the sales reps deal with. Everyone keys in this data as zero pads it. The end users insist that it be zero padded. As a result, this number field is a character field and zero padded about half of the time I encounter it, when it is a data set that an end user has been working heavily with. And then I get to the database that a DBA put together, and there is a view that zero pads the data for actual use, but the raw, unpadded data gets dumped on me. The end result? I waste a ton of time trying to reconcile the two data sets.

Another lovely story of database disaster: constraints and foreign key checks. As much as the DBAs love these things, they simply do not work with reality! Every time you encounter a piece of software that insists that you have to do something truly bizarre for those special cases where you are working outside of the process, chances are that you are bumping up against a database constraint or foreign key relationship under the hood. Ticketing systems for help desks are particularly bad about this. eCommerce Web sites are also really bad about this. They enforce some constraint that makes perfect sense in UML, but as soon as the customer wants something special or unusual, especially in the case of a manager override on the standard process, the application cannot handle it and the system throws a temper tantrum.

The worst part is though, the DBAs are right. There is a process in place. It needs to be followed. It is impossible to write an application that allows users to invent their own workflows and adjust to it. And the database needs to be set up according to certain principles, or else it will be extremely slow or not even work at all. I mean, let’s be honest; if the database matched reality, we would be doing FULL OUTER JOIN on everything, and then using some database regex to attempt to figure out where everything goes.

How do I know this? Because I spent nearly all of last week working on Perl code to attempt to match individual physicians with the hospitals they work at, because it institutional accounts get one set of identifying numbers and physicians get another set, and the ID numbers have nothing to do with each other, and a lot of physicians have five or six different addresses they use (a few physical addresses that they work at and a few mailing addresses) and the addresses in the hospital databases do not always precisely match the physician’s addresses. Indeed, sometimes they will have the same address but different ZIP codes, or the same ZIP code but a different city name (think “Brooklyn‿ versus “New York City‿). Or the same place will have multiple addresses for it (New York City is famous for this). So I get to rip apart the addresses into tokens, convert a lot of things (“Street‿ becomes “St.‿, but “Saint‿ also becomes “St.‿ for example), and then figure out a match likelihood (numbers are weighted more heavily than words, because two addresses being on the same street is more likely than two addresses having the same number in them, and so on), and then figure out what match likelihood numbers are the cutoff points between “good‿, “somewhat good‿ and “not good.‿ It is a nightmare.

All because good database design frequently has no bearing on reality.