Data Management

Reality, meet database


Has anyone else ever noticed that DBAs and database philosophers have no clue what reality is like? I sure have. Every time I encounter a database that I had nothing to do with, I notice one of two things: either the data is total garbage and nearly impossible to code with but meets the end user’s needs, or the data is pristine and perfect and has nothing to do with what the end user wants. Odd, is it not?

There is definitely a weird dichotomy between reality and database design. For example, some of my customers use a particular ID number for their clients and the people that the sales reps deal with. Everyone keys in this data as zero pads it. The end users insist that it be zero padded. As a result, this number field is a character field and zero padded about half of the time I encounter it, when it is a data set that an end user has been working heavily with. And then I get to the database that a DBA put together, and there is a view that zero pads the data for actual use, but the raw, unpadded data gets dumped on me. The end result? I waste a ton of time trying to reconcile the two data sets.

Another lovely story of database disaster: constraints and foreign key checks. As much as the DBAs love these things, they simply do not work with reality! Every time you encounter a piece of software that insists that you have to do something truly bizarre for those special cases where you are working outside of the process, chances are that you are bumping up against a database constraint or foreign key relationship under the hood. Ticketing systems for help desks are particularly bad about this. eCommerce Web sites are also really bad about this. They enforce some constraint that makes perfect sense in UML, but as soon as the customer wants something special or unusual, especially in the case of a manager override on the standard process, the application cannot handle it and the system throws a temper tantrum.

The worst part is though, the DBAs are right. There is a process in place. It needs to be followed. It is impossible to write an application that allows users to invent their own workflows and adjust to it. And the database needs to be set up according to certain principles, or else it will be extremely slow or not even work at all. I mean, let’s be honest; if the database matched reality, we would be doing FULL OUTER JOIN on everything, and then using some database regex to attempt to figure out where everything goes.

How do I know this? Because I spent nearly all of last week working on Perl code to attempt to match individual physicians with the hospitals they work at, because it institutional accounts get one set of identifying numbers and physicians get another set, and the ID numbers have nothing to do with each other, and a lot of physicians have five or six different addresses they use (a few physical addresses that they work at and a few mailing addresses) and the addresses in the hospital databases do not always precisely match the physician’s addresses. Indeed, sometimes they will have the same address but different ZIP codes, or the same ZIP code but a different city name (think “Brooklyn‿ versus “New York City‿). Or the same place will have multiple addresses for it (New York City is famous for this). So I get to rip apart the addresses into tokens, convert a lot of things (“Street‿ becomes “St.‿, but “Saint‿ also becomes “St.‿ for example), and then figure out a match likelihood (numbers are weighted more heavily than words, because two addresses being on the same street is more likely than two addresses having the same number in them, and so on), and then figure out what match likelihood numbers are the cutoff points between “good‿, “somewhat good‿ and “not good.‿ It is a nightmare.

All because good database design frequently has no bearing on reality.

J.Ja

About

Justin James is the Lead Architect for Conigent.

76 comments
sconyers
sconyers

We have a huge delivered database that has also been incredibly customized, so of course there are no accurate ERDs. Anywho, it's so normalized that reporting becomes incredibly complicated. It currently takes 8 JOINs and nearly a full page of other SQL to get one particular kind of address, say physical. If you want a mailing address, it takes that same code with a few different pieces in WHERE clauses. This definitely seems to be a case where the DBAs went a little overboard.

Deadly Ernest
Deadly Ernest

in databases, and in every case, it was due to the database designer NOT doing a proper analysis of the business case, and work flows of the work that the database was being designed for - in essence the database designer made a design fault and someone let them get away with passing it off onto the users. I've designed a database where I made mistakes like that, so I threw my first db out and did it again from scratch and got it right done several since then. The key is the business case, and work process analysis - get that right and the rest is easy.

Tony Hopkinson
Tony Hopkinson

Who ever analyses the business requirement, needs to know database and application design or they should be implementation agnostic. Just a minor omission like order number isn't always a number has an impact on an implementation after all. A 'pure' database design generally falls over at the time you write an application. A design done from the application end is usually lacking in data integrity, inefficient and hard to maintain. Whether either design meets the requirements and the requirements are actually the requirements is a perennial problem. The real mistake is treating these views of a function as isolated from each other, that would be nice , but it doesn't happen in the real world.

henkelr
henkelr

The opening quote "the data is total garbage and nearly impossible to code with but meets the end user?s needs, or the data is pristine and perfect and has nothing to do with what the end user wants." indicates to my that the db design for the business requirements was not done properly in the first place. If the data is total garbage then the business is not precisely defined or deals with garbage. This quality of information will be reflected in the business. If the data is pristine and has nothing to do with what the end user wants, then the database does not even resemble the business requirements and is useless to all except those that constructed it. The db design is not the job done by a programmer, dba or developer. The process to convert business requirements into a db design is done by a analyst/designer. The information engineering design skills required to transform business rules into a Data Model by this role is not being performed. I've seen dozens of databases (commercial and home grown), and NONE of them have been designed with any of the tools or skills necessary to perform the transformation from user requirements to a db design. Even off the shelf thrid party products are designed as if using flat files to store data. The RDBMS technology advantages are not being leveraged to deal with the issues identified. Many of the technical issues raised are eliminated if the design is done with the disciplined skills required for the role of analyst/designer in the first place.

Tony Hopkinson
Tony Hopkinson

Wrong in practice. Most of these systems, the initial design is done by a business head. They are little in house projects that then become seen as useful over a wider area. You need someone not only with enough skills, but with a remit to take ownership of data collection applications. Just something as obvious as adding a product or customer id recognised by the 'main' systems will sort out 99% of the problems.

fjfish
fjfish

(assuming they aren't all being coerced to upper case) McLaughlin Mclauglin Both legitimate spellings but to each other. I once tried to write a PL/SQL function to properly do this and then realised the capitalisation depended upon what some 19th century census clerk had written down. Forcing some kind of query name that is all upper case works here, but then what about MacLaughlin maybe being the same person as well, depending upon the 21st century clerk entering the data over the phone...

artful
artful

I'm hoping that some reader of this list can answer a question that has puzzled me for years. I am a woefully inadequate player of a musical instrument called tabla, and I have perhaps 100 CDs of music featuring tabla. One of the greatest players in the world is called Zakir Hussain. His father is called Alla Rahka. How does this work? There is apparently no visible trace of the father's name in the son's.

Justin James
Justin James

That is another one of my pet peeves. I know *why* DBAs and application developers like to uppercase data... it makes searching much faster when you uppercase it once going into the DB, and uppercase any WHERE clases. Too bad the data loses meaning when you uppercase it. It's like deciding to take a "quiet scenic drive" and jumping in the car, heading out to the interstate, and speeding down it at 80 MPH. Is the drive quiet and scenic? I suppose so. But you lost part of the point by doing it quickly and efficiently... Bad analogy. I keep forgetting my rule about replying to posts before coffee... J.Ja

DanLM
DanLM

This really doesn't have anything to do with the topic of the blog, but it does with regard to MySQL. I see that you use MySQL alot in your cleansing . Is it 5.0? If so, are you doing most of your sql in stored procedures which came out in 5? I ask because, well. Lol, I'm playing with MySQL at home because they have some small back end apps at work that use it. Right now, there is only one person that codes to MySQL and he does it in java. There are some seriously good sql coders where I work right now, and I sugested that the java coder start utilizing the stored procedures so that he can utilize these people while he works on the cleansing part of the code. Just wondered if you was doing the same? Dan

Justin James
Justin James

Dan - Because of the nature of what I am doing (long story), I write SQL scripts and run them through the console. That being said, I recommend stored procedures for any situation where you have multiple people working on the same project, a more-than-short development cycle (say, any project that takes more than a work week), and any project where the same functions will need to be accessed my multiple applications, or where the function will need to be run by multiple portions of the same application. Stored procedures are a great tool for maintaining a canonical "this is how we do it" system. J.Ja

sentdata
sentdata

Always the database never the application! We go over it time and time again. But in the end a complex problem gets a complex design that gets a complex table structure that gets a complex table relationship. The only true work around is after you know (Or at least think you know) what the users want. Developers and DBA?s need to work together to find a level of normalization that will work for all players. If my database holds up your application then I have not done my job. If your application makes my database to complex then you have not done yours!

Tony Hopkinson
Tony Hopkinson

I didn't use to, but I went schizophrenic arguing with myself (I wear both hats on occasion).

pmetcalf
pmetcalf

When I get a request for a project at this level (multi-disparate data), I gather all parties and have them agree (on the SLA or SOW) to match their own records to a Master List with a "hidden" primary key. *never ever allow a user to change that* Your first question to yourself is "Who is the sponsor of this project?". Was it some health care committee? An administrator? You need to think up a way to solve this where you don't have to touch the data. Design a xref table, give it to the users and have them populate it. Or have each client add a new field to their Doctor table with your PK in it, then give them an interface to "go get" the next PK when a new doctor is added. What I'm saying is, solve this with getting the sponsor's help in mind. They may actually re-think the problem in a database state of mind and ask you for something a little more do-able.

swstephe
swstephe

Many of those issues could have been solved by technology and procedure, but technology lags due to tradition. Your example of zero-padded numbers, for example, is a hold-over from fixed-width columns and COBOL. The real-world is still conceivable. Postal codes, (for the world), could be supported by vendor-supplied code and plugged into every application and supported with upgrades so nobody ever has to debug that code again. People are notoriously difficult to code around. How do you identify a human being? We have very vague approximations and generalizations that don't apply to the real world, conceptually, (try pinning down someone's gender, race, nationality, health). It is possible, but it has to come from the industry instead of from individual application developers.

Tony Hopkinson
Tony Hopkinson

a people type, output based on locale ? Could get messy that, like any generic solution. What about presenting more than two gender types to certain fundmanentalist types? Race - Human Religion - Jedi Nationality - Yorkshireman Health - MYOB Sex - Please I see difficulties.. Too much tailoring it's where we are now, rigid standardisation and you are going to offend people without meeting their requirements.

Wayne M.
Wayne M.

My Mother supported a Medical system that did provide more than two genders. As I recall the system included: * Male * Female * Male previously Female * Female previously Male * Both * Neither

Justin James
Justin James

swtephe - You are so right about the old COBOL connection. My mind totally blanked that memory out ("pic Counter 00000" if memory serves). The zaero padding is also because the users get so used to zero padding in display (it looks tidy and pretty to them) that they stick data back in with zero padding. Like someone will do a spreadsheet with raw, un-padded data, format it with padding, and then give the spreadsheet to someone else who loads it into a database with zero padding. Poof, instantaneous data problems! Postal codes are indeed an issue, but one could store them un padded, and re-pad on output, I suppose. You are definitely right about trying to code the concept of people! For example, with ethnicity, most databases treat it as a single code, when in reality, you need many codes. Coding in arrays as values, particular arrays of keys to another database, is a pain to do neatly, you end up building a separate table for that one fields, replicating your key field a bazllion times just to do an array. Databases are really lousy for anything that would be represented by a data structure like a hash of hashes or an array of arrays. I would love to see the industry step up to the plate on this stuff, but it is nearly impossible. Half of the world population puts the family name first when written, for example. If we can't get cultures to standardize, how are we going to get the computers to standardize? J.Ja

Dr_Zinj
Dr_Zinj

It's simpley amazing that the preliminary analysis of the business doesn't get analyzed thoroughly before developing a database application for it. On the other hand, there's a butt-load of businesses out there that don't even use consistent business rules, yet expect analysts, designers, and coders to make sense of the whole mess; and then throw a hissy-fit because our reports and analysis are less than perfect. When you do a business analysis, and the owner's/user's can't tell you WHY they do an exception; it's time to take a very strong look at the expection. The analyst needs to understand the business. If the business owner's/user's don't understand their own business, then it's probable that they are in desparate need of a process improvement of those gray areas.

Justin James
Justin James

Dr. Zinj - That is really spot on. Too many businesses barely have business rules, or worse, make them up as they go along, with individual managers or analysts just inventing things with no regard to past precedent or process. This creates total havoc at the application and database layers, as it detroys most code reuse, fosters the development of data islands and silos (at best, loosely connected), and makes the idea of re-running old code with new data a joke. J.Ja

Tony Hopkinson
Tony Hopkinson

has a centralised IT function. I did a lot of time in manufacturing, so I either ended up with leavings of a graduate engineer usually. One of my favourites was a program to collect the results from a testing machine. The product ID was a six digit number and they wanted me to link the results into main system. The data was in a structured binary file and the graduate had left. After some decoding I ended up with stuff like this 666666 500.34 .....7 500.56 .....9 500.67 ....71 500.25 It's the ditto, function clever that graduate wasn't he? Oh yes, very clever I thought!

onbliss
onbliss

I thought you said the field was a character field. So when the raw data was given to you, did it not contain the padded zeroes as well?

Justin James
Justin James

I *usually* get raw CSV files, some padded with zeros, some without. But I also get a lot of Excel files, where it is total chaos, because even in the same column, the data is sometimes zero padded, and sometimes just using an Excel cell format to appear zero padded. I have one customer who has raw, unedited data in the database, so they run a reporting service (I beleive Cognos) to produce an Excel file which they dump on me, logos and goofy headers and "proprietary and confidential" messages and all... and I have the "pleasure" of turning this into usable data. :) J.Ja

onbliss
onbliss

It almost looks like you need some kind of ETL functionality. Once I was writing data onto an Excel file using VB6/ADO. It was little frusturating when Excel treated all-numeric character fields as numbers.

Justin James
Justin James

I end up using FoxPro to handle a lot of the initial data massaging, and then for bigger or more lengthy projects, dump the stuff FoxPro handled back out to a delimited file, load it into MySQL or SQL Server (depending upon my mood, and what I will be using to do the report, and if I need to do additional processing in a particular language) from there. Big ETL tools (ie: anything that bills itself as "Enterprise Class" just do not have the quick turnaround time I require for these doofy 1 day one-off projects that fill much of my time. J.Ja

gsquared
gsquared

The most important thing I've found in database design (software design in general, too) is: Build on the assumption that the business "rules" are merely suggestions, and are subject to change with or without notice. The number of times that a new product line has been launched where I work, with very strict rules laid out by top management, "this is how it will work, we won't allow orders that don't follow these rules" kind of stuff, and the very first order breaks five of the six "set in concrete" rules, is uncountable. The same goes for existing product lines, when I have to write exception code in the pricing calculations because some salesperson came up with a unique new deal with a customer. So I write my code to allow for exceptions to be retrofitted easily. As to the padded numbers, store the data in two columns, one numeric one varchar, so you can search by numeric values (much faster), but have the actual stored value as a string. That might solve a number of your problems on that particular regard. (If you have enough control over the database to implement that.)

Justin James
Justin James

"The number of times that a new product line has been launched where I work, with very strict rules laid out by top management, "this is how it will work, we won't allow orders that don't follow these rules" kind of stuff, and the very first order breaks five of the six "set in concrete" rules, is uncountable." That's exactly the kinds of things I am talking about, the DBA builds constraints and keys based upon business rules, and business rules too frequently are superceded by "the customer is always right" to actually apply 100% of the time. All it takes (as one commenters pointed out) is for a salesperson to give someone a special-case discount, and your constraints just blew up. "That might solve a number of your problems on that particular regard. (If you have enough control over the database to implement that.)" Typically, I am running reports from raw data that they dump on me. Internally, I get control over the data, but when I give things back to them, it needs to meet their requirements. J.Ja

Wayne M.
Wayne M.

SOrry, Justin, I gotta go with the darkside on this. I feel the database should apply the minimal contraints possible on its data. The reason is that the data contraints will either be too loose or too tight for business operations. It really comes down to who's time do you use up due to imprecision. I would prefer to have the developers take extra time to handle difficult data once, than to have users be required to periodically deal with it. It is really much more flexible to keep the rules in application code. If an application rule is preventing data entry, a DBA can patch in the data and an application update is usually much less intrusive to users than changing a database constraint. Besides, this way it keeps the DBA's job interesting.

Tony Hopkinson
Tony Hopkinson

Of course the programmers and the dba should do the work not the user. I thought that was a given. They need to work together, or you are faced with poor data collection or poor data storage, neither one of those will give you a happy smiling customer.

Wayne M.
Wayne M.

I find architectural constraints in databases useful, but feel data validation needs to be pushed as close to the user as possible. Constraints that support the structure of the database need to be mandatory; this prevents orphan table rows. Primary keys need to be nun-nullable and parent-child relationships enforced. These are proper restrictions for the database to enforce back onto the application. Data validation, however, should be applied as close to the user as possible. Furthermore, the costs of an overly restrictive data validation implementation requires a bypass as a mitigation. I believe that there is consensus that an application programmer should not rely on a database constraint to detect data errors. I would restate this as saying, the database constraint should never be more restrictive than the user interface. I will also argue that data validation in the database should be much less restrictive than that applied at the user interface. If we could assure a perfect set of data validation rules, then I would agree that duplicate rules in the database and user interface do no harm to operations. If, however, the validation rule is overly restrictive, then there are basicly three options: one, do not capture some of the data; two, roll back deployment of the application; three, provide a work around to enter the exception data. There would be business reasons behind the selection of any of the three options, but by allowing operations staff to bypass the user interface and make a "rules-free" entry into the database, I have a mitigation if the analysis or implementation is less than perfect. By explicitly defining this option, I also take the responsibility to define the process to "patch" data - also a mandatory step. There are lots of follow on implications to this choice, but in general I take the approach that I would prefer to make extra work for the programmers and DBAs rather than prevent the users from completing their tasks. Please do not intepret this as any sort of programmer versus DBA argument; one of my ongoing frustrations is programmers and DBAs who do not respect one another (another discussion). To be very explicit, I expect DBAs to delegate data validation to the programmers with the full understanding that, should the programmers fail, the DBAs will have to go in to bail them out.

Justin James
Justin James

Sending garbage data to the database and just throwing an error when the constraint fails is lazy at best. That being said, if the developer has enough information to properly handle these oddball cases, then the DBA should as well. Otherwise, the DBA is just going to make some junk tables with little data integrity, which is fine when you have one or two developers writing one application, but as soon as a second application starts hitting that same table, "wacky hijinks ensue." J.Ja

Tony Hopkinson
Tony Hopkinson

constraints in the database as possible myself, it cuts out a lot of silly mistakes on my part. You need good cooperation between the DBA and the application developer though, none of this throw it over the partition crap.

gsquared
gsquared

In order to achieve consistent formatting, etc., on street addresses, I implemented a CASS ("Coding Accuracy Support System" - it's a Post Office abbreviation) program in my ERP/CRM database. It looks up addresses based on Post Office standards and formats them using the Post Office standard. Some addresses, mainly ones that can't be mailed to, can't pass the CASS certification tests, but the majority can. This has massively cleaned up the address formatting. The one I use is from Satori Systems. It can work in Office (has a toolbar for Word, Access, Excel), in SQL, or as a server. They're planning on releasing a Web service version that can accept XML. I don't remember what it cost, but it has been of tremendous use in standardizing address input.

Justin James
Justin James

Labeled "Capital expenses vs. ROI". J.Ja

gsquared
gsquared

Happens all the time. On the CASS software, I was able to get it because it's necessary for our products (we do direct mail marketing), so I was able to piggy-back the other uses of it. Without that, yeah, they have to pay you to build systems to validate addresses, which probably cost them more than a prebuilt piece of software. I get a bunch of things through budget by calculating in-house vs purchase/license costs. It'll take X hours at my salary (Y) to produce = $XY > $(license). (I can also add in lost opportunities based on, "in the same amount of time, I can produce Z code that will accomplish these things, which cannot be purchased commercially.) That works sometimes. You're situation sounds like you have less reasonably managers/budgetary controls. All I can do in that case is offer sympathy.

Bob.Grimes
Bob.Grimes

With regard to incorrect address and mismatched zip codes and incorrect street types, take a look at a product by MapInfo called MapMarker. MapMarker does a splendid job of geocoding nearly all street name mis-spelling, street type and zip plus 4 corrections. The latest version (12) does delivery point validation. If the mail can't be delivered, it's not a valid address. The geobase is taken from the USPS postal database. Don't waste time fixing address and looking up street type and zip codes. If this is going to be an ongoing project, MapMarker will pay for itself many times over.

JohnnySacks
JohnnySacks

Their address validation worked well for me 4 years ago. Costs real money so thats a deal killer for 90% of these cases.

carnac
carnac

RE CASS Processing, see www.semaphorecorp.com - I've been using them for years. Not as slick as Satori, but fast, accurate, programmatically accessible from other apps (lot's of utilities and coding examples), and a fraction of the cost.

Justin James
Justin James

Unfortunately, I have zero budget for tools or software, regardless of what kind of ROI they may provide. In a nutshell, I paid for my favorite text editor & WinZip out of my own pocket, my company is willing to pay for an MSDN subscription, and that is it. A while back, I looked at ESDI for mapping/address/geocoding solutions, and I was told to use MapPoint because it was included with the MSDN subscription, even though it did not meet our needs; the money for ESDI was "too much" despite the fact that the ROI just in saving me time would have paid off. J.Ja

Tony Hopkinson
Tony Hopkinson

I've only seen one that was prepared to pay for the UK equivalent and they didn't do a brilliant job of using it anyway. In the UK the facility had to be licensed so it was passed onto the end user of the software. That was 2003 haven't looked at it since, most firms muddle through without it.

camainc
camainc

I couldn't read this article, because I couldn't get past this sentence: "Every time I encounter a database that I had nothing to do with, I notice one of two things: either the data is total garbage and nearly impossible to code with but meets the end user’s needs, or the data is pristine and perfect and has nothing to do with what the end user wants." Every database that you encounter, that you didn't design or help with the design, is total crap? What companies have you worked for, man? Are you saying that in all the places you've worked, every DBA before you was an idiot? Sounds to me like you are either working for some total losers, or you haven't been working very long, or you have just a ~wee bit~ too high an opinion of yourself.

artful
artful

I agree with you. I've seen my share of awful designs, but I have also seen some awesome designs whose subtle nuances totally blew me away. I have learned some important lessons working on such projects. There's a guy called Dejan Sunderic who hired me on a project a few months ago. (See his book "SQL Server 2005 Stored Procedure Programming in T-SQL and .NET". I feel sooo fortunate to have worked with him, and seen his architecture for a massive project. There's elegance there that I could not have dreamed up myself. Having seen it, though, I now can dream of it.

MadestroITSolutions
MadestroITSolutions

I have to agree with Justin. I don't think he is being arrogant. In my personal experience I am yet to find a database that meets customer demands and follows good rules at the same time. I think your comment only reflects YOUR level of knowledge about the subject. Now I am not saying you don't know what you are doing. I just think you don't have experience in the subject at hand. A very simple and common example of this is the relationship between an entity that represents the user accounts and the rest of the tables in the database. Ideally you want to create referential constraints for records that have been created and/or modified by someone to guarantee that a record is always stamped with the ID of the person that touched it. There, good design right? In reality you cannot do this because at some point you may have to delete the user from the database, and what are you going to do then? remove the constraint?, change the IDs? or perhaps leave the account in a "disabled" state and eventually get to a point where the system does not work because of the size of the tables? These are real life factors that are not evident in UML, but come up as you go along.

Justin James
Justin James

"Every database that you encounter, that you didn't design or help with the design, is total crap? What companies have you worked for, man? Are you saying that in all the places you've worked, every DBA before you was an idiot?" I would have to say that something like 90% of my customers have databases which simply do not meet their users' needs and the principles of good database design at the same time. Are their DBAs "idiots"? Not at all. It is mostly a corporate culture problem. You have business analysts with write access to the database, for one thing. You have data which is purchased from third party sources, which is in a format that makes sense for what it is, but never properly transformed into something that can be used by the business very well or merged into the corporate data set. You have data from a multitude of third party vendors that should relate to each other, but because of differences in how each company handles data, you get differences in how the data is presented. And so on and so on. In fact, I would say that for most of my customers, their biggest pain point is the third party data. The amount of effort needed to transform that data into something usable by the business is incredible. Heck, I design garbage databases on many occassions, and I will freely admit that. Someone will hand me 7 CSV files and ask me to whip up some report based on that data in four hours... beleive me, that database is not going to be rigorously designed. And two weeks later, the "throwaway" database gets used again. And then, the next thing you know, what was a throwaway database for use in a one-off report is suddenly being used to generate all sorts of reports, and to change the database to fix it just isn't happening anytime soon because too many things rely on it being exactly how it is. And a new nightmare is born. I think that tight deadlines are the cause of most of this misery. If you look at the comments from Tony and Mark, both of whom have the utmost respect for, they know what the solution is, which I also wrote about in the post, which is to do a billion and one checks, transformations, etc. on the data as it is going in, whether that input source is an end user, a third party data vendor, a Web service, or something else. But all of that takes an extraordinary amount of time, effort, and testing, which few IT people have available to them. IT is a field in which results can be shown quite quickly. Will they be great results produced be a smooth process and clean code? Absolutely not. But it builds false expectations in users. They do not care that the code was cranked out and did not follow proper practices and is completely unmaintainable, they want their results, and they want them now. So as a result, you get garbage code that uses garbage databases and produces mostly acceptable results. J.Ja

Tony Hopkinson
Tony Hopkinson

a newby's 'bad' database and a professional's is we know why we made it bad :D I've de normalised for application efficiency many times. I didn't want to but that was just the way it had to be. I'm sure another person looking at it thought it was designed by some twit who once wrote a macro. :) One of the biggest problems is what I call data islands. Some manager somewhere puts together a database (access !) and produces a report. The report's useful, becomes popular and then an other person says wouldn't it be nice if it could be linked to this data on an other island. I see difficulties ! A company policy where that initial design gets vetted and things like say Product ID or Customer ID get rationalised can turn these from a nightmare into a definite asset for your 'in-house' IT resource.

Justin James
Justin James

"some MBA who one did a spreadsheet macro" In fact, that describes a large portion of the people I interface with on a regular basis. I was helping one of those types (personal friend, not business) earlier today. I asked her, "did you declare any of these variables, or are they dynamically types?" She was baffled. She can't figure out why subtracting one time value (as oppsoed to a combined date/time) from another does not work so well when the code gets to midnight for one of the values. She's a great person and smart, but has zero experience with writing code outside some "Computers 101" type courses and is now building a help desk ticketing system in bloody VB 6 with an Access backend, and then throwing data for reports out to a text file. It reminds me of a project I did 10 years ago... J.Ja

Tony Hopkinson
Tony Hopkinson

Let them prototype in it, then write it with something proper. By itself it's bad enough when some MBA who one did a spreadsheet macro gets hold of it, it's a nightmare.

Justin James
Justin James

I spent 5 days doing what should have been 1 day, primarily because the raw data was in Access. Access will not even output to a CSV properly, it determines on a row-by-row basis whether or not to quote a field. It took me a day just to see that it was creating a garbage CSV, so I had to export to a pipe delimited file with no quoted character fields. And yeah, a lot of their mid level managers and business analysts love Access. Access makes me queasy. It's too "do it yourself" to be useful by the average employee, and too Office-esque "let's do this for you without you knowing that you need to do it" to work for a pro. J.Ja

julieferguson
julieferguson

As a professor who teaches both database design and programming, I am compelled to discuss theory vs. real-world applications in the classroom. Many textbooks discuss the use of primary keys and how to establish the attribute or attributes that will make up the primary key; but in the real world I often see the use of a computer-generated number as the primary key. These ?autonumber? fields are really identifiers and not primary keys from the pure theoretical viewpoint of relational database design. Yet, I see their validity from a business standpoint. So, in my classroom we discuss the theory of database design and then discuss the difficulty of implementing some of those theories in the business world. Unlike Justin, when I do consulting I seem to only find the databases that are poorly designed?-there is redundant data, no foreign key constraints, etc. Yet, when I talk to management about training, they want to focus on how to use the database management software and not on design. I explain that a solid design theory should be in place first. Unfortunately, management?s response is to train the users on the software and to skip the design theory.

artful
artful

I have been in this business for 20+ years and visited this topic in theory and practice many times. I have had ongoing debates with esteemed experts such as Joe Celko, who is on the opposite end of the spectrum than I. His argument is (approximately, and in case you happen to read this, Joe, please forgive any oversimplications), any column that does not model a real-world attribute is a falsification of reality, and therefore ought not to be in the database, which is a model of a domain in the real world. My argument is threefold: 1. Suppose I'm selling eggs, in cartons and crates. Do I really want to model the particular location of an egg? An egg is pretty much the same as another egg. I don't even care which chicken emitted it. If you are manufacturing automobiles, then you could perhaps use the VIN# as the PK, but what about the screws that hold the interior door panel to the door? Do you really want to individate them? 2. Intelligent Keys typically comprise several columns to make them unique. Any related table therefore must carry that entire key, and its related tables must carry that entire key plus the unique item that individuates rows in the child table. This gets unwieldy very quickly. 3. The purpose of a PK, Celko and that school aside, is to uniquely identify a row. No meaning should be ascribed to its value. It has exactly one purpose, to uniquely identify a row. Admittedly, using autonumbers or identity columns or GUIDs also means that you must do some other checks, so that a user cannot add Afghanistan to the Countries table twice, but at the end of the day I would prefer that all tables containing a Country column use the ISO list of countries and their numbers. Arthur

Tony Hopkinson
Tony Hopkinson

I like GUIDs particularly for moving data from one database to another, auto numbers are a damn nightmare for that. Not to mention only a total nipplehead would expose one which I've seen done on several occasions. The other big advantage is when you suddenly discover the unique business key is no longer or was never really really unique, you don't have to redesign everything. Primary on a GUID, index on a business key is my first choice, after twenty years of experience as well.

Tony Hopkinson
Tony Hopkinson

One classic is where you do have a primary key but not yet. Social Security number for instance. Fits all the requirements, but you can't identify a person by it until you've got it. I do not like auto numbers at all, the number of times you see people passing them to downstream systems where they mean absolutely nothing... I've become a big fan of GUIDs though, for scalability and the facility to move data in between systems without having to rekey everything. However there has to be another key, for 'external' use. A lot of database layout ah features are usually down to design conflicts. Like organise your data for high speed collection and then use it to drive an MIS Reporting system. Those two usually conflict badly.

Justin James
Justin James

"A lot of database layout ah features are usually down to design conflicts. Like organise your data for high speed collection and then use it to drive an MIS Reporting system. Those two usually conflict badly." I am so familiar with this one... the totally relational, transactional database which people are trying to report off of. Even worse is when people are reporting at the same time as people doing input, so it jams the whole thing up... people gotta learn to have a reporting database that is totally flat, optimized for lengthy, sequential reads, and replicate the transactional database to it periodically. J.Ja

Tony Hopkinson
Tony Hopkinson

No only that it's easy to do. A wee trick I've used in manufacturing. Is to use triggers to bash together a daily and shift based estimate, just for a current trend. Then the reporting system back populates the week to dates in the live one. Got a load of anal retentive managers off my back who unhappy at the two percent difference in the figures over a week.

techrepublic
techrepublic

I've got to go with Julie here. A tool is useless if you don't know how to use it properly. However, if you use a tool to build the wrong thing or to build the thing wrong, then that's worse. One can be an expert hammer user and still build a bad house. And, yes, the DBAs *are* right. Good theory is only the first step. For usable, practical software of any kind, incorrect and inflexible design are the culprits. There is no such thing as "special cases", just design deficiencies. Sometimes the only recourse to solving these types of problems is going back to step one and revisiting the requirements. This almost always involves a great deal of time and money. It would have been cheaper in the long run to have done the job right in the first place. I'm sure I'm preaching to the choir here. Cheers!

Editor's Picks