I have worked for several clients on DB2 legacy systems - every one, without exception looks like a conversion from a VSAM shop. The result that the databases MIGHT be 3NF. I used to bring up normalization to clean up the databases, and to improve performance, but I would see nothing but glazed over eyes.
99 percent of the people that see the title of this article will stop reading when they realize the subject of this article.
And my point is? Perhaps, you should answer that question for yourself.
Good luck, really, I mean that.
Discussion on:
View:
Show:
Perhaps if people understood what the results of normalization are in simple terms "saving tons of space" and most importantly "having a more reliable speedier data" base, they would subscribe. Let?s face it; communicating these ideas is not an easy task. One would have to understand the fundamental concepts of a database in order to make sense of the concept of normalization. Good luck..:)
The thing is that with more normalization, more joins are required to satisfy queries. So, if you decide that you want to fully normalize to 3rd NF, then you are saving some space, but are giving away tons in terms of performance. Sure, there will be some data redundancy, but space is cheap these days, and your customers want fast results.
Tim,
Admittedly, many joins cause a performance hit. But you seem to have ignored my point about multi-valued dependencies. Either you escalate the NF number or you place a burden on the front end to cover this lack of integrity.
I have a list of approximately 55K cities in North America in a database normalized as I recommended (i.e. State is in the Cities table, etc.). The front end (which I also wrote) creates a dropdown list, initially unpopulated, which waits for 3 keystrokes before populating itself. So you type in "Spr" and it populates itself, already drastically reducing the choices, and enabling you to distinguish very quickly between the various Springfields.
With all due respect, I think that you have erected a straw man.
Arthur
Admittedly, many joins cause a performance hit. But you seem to have ignored my point about multi-valued dependencies. Either you escalate the NF number or you place a burden on the front end to cover this lack of integrity.
I have a list of approximately 55K cities in North America in a database normalized as I recommended (i.e. State is in the Cities table, etc.). The front end (which I also wrote) creates a dropdown list, initially unpopulated, which waits for 3 keystrokes before populating itself. So you type in "Spr" and it populates itself, already drastically reducing the choices, and enabling you to distinguish very quickly between the various Springfields.
With all due respect, I think that you have erected a straw man.
Arthur
and less often a DBA. I value enforcing as much data integrity as I can.
Unfortunately many DBAs can't or don't program and many programmers couldn't design a decent table with four sticks and a flat bit.
I've seen very complex database designs where the DBA's actually got scared of changing it and were unmoved by the application developers complaints when the database no longer met the requirements.
I've spent much time enforcing rules in code to have a bug or another function leave me looking like a complete arse.
Normalisation for integrity is always a good starting point, for space rarely.
Sacrificing using the data efficiently in favour of storing it efficiently is a damn bad idea.
The worst designs come from simply DBA's and programmers not communicating.
When I'm wearing both hats I talk to myself often. I've never gone past 3NF intentionally, my course stopped there
. I have reversed 3NF intentionally when it was the most practical overall solution.
Unfortunately many DBAs can't or don't program and many programmers couldn't design a decent table with four sticks and a flat bit.
I've seen very complex database designs where the DBA's actually got scared of changing it and were unmoved by the application developers complaints when the database no longer met the requirements.
I've spent much time enforcing rules in code to have a bug or another function leave me looking like a complete arse.
Normalisation for integrity is always a good starting point, for space rarely.
Sacrificing using the data efficiently in favour of storing it efficiently is a damn bad idea.
The worst designs come from simply DBA's and programmers not communicating.
When I'm wearing both hats I talk to myself often. I've never gone past 3NF intentionally, my course stopped there
If one is to interrogate the OLTP database, then I might -- just might -- be willing to acknowledge your point. But when I speak of interrogation, I'm thinking OLAP not OLTP.
In an OLAP database, it makes abundant good sense to de-normalize, as Kimball and his company have proved many times over.
So the real issue here is, when does the data move from OLTP to OLAP. My preference is to do it once a day, but that requires a sign-off from Those On High and the attendant acknowledgment that the data included on reports will be a day old.
20 joins in a single query can be painful, no question about it. But in an OLTP app, what's the problem? It is not meant to be the foundry of answers to questions deemed important by Those On High. That's what OLAP is for.
And finally, the OLAP database should live on a different server than the OLTP database. At today's prices, even Mom 'n' Pop convenience stores can afford this technology (not that they would need it, but you catch my drift).
A.
In an OLAP database, it makes abundant good sense to de-normalize, as Kimball and his company have proved many times over.
So the real issue here is, when does the data move from OLTP to OLAP. My preference is to do it once a day, but that requires a sign-off from Those On High and the attendant acknowledgment that the data included on reports will be a day old.
20 joins in a single query can be painful, no question about it. But in an OLTP app, what's the problem? It is not meant to be the foundry of answers to questions deemed important by Those On High. That's what OLAP is for.
And finally, the OLAP database should live on a different server than the OLTP database. At today's prices, even Mom 'n' Pop convenience stores can afford this technology (not that they would need it, but you catch my drift).
A.
Your example table population script doesn't look like it will work with your example table structure. The population script refers to a column called "NumberOfLanguages" in the table called "Certifications", but that column is actually in the table called "CertificationLanguages".
The data population is missing inserts for CertificationLanguages and ProgrammerCertifications altogether.
Without checking my SQL, I will assume that you are correct, in which case, OOPS. It makes no sense in the CertificationLanguages table; it belongs in the Certifications table.
I apologize for the inconvenience, and after I've verified the correctness of your assertion, I will post corrected code.
Arthur
I apologize for the inconvenience, and after I've verified the correctness of your assertion, I will post corrected code.
Arthur
I believe that the UK Dept of Health and Social Services has gone from five sexes (Male, Female, Male born Female, Female born Male, Undetermined) to seven; I'm not sure what the other two are.
Having a table for sex may seem extreme but to support flexibility all Categories should be in tables.
BTW, I'm glad to see that Sex is used to categorise people and not Gender which strictly speaking categorises nouns.
Having a table for sex may seem extreme but to support flexibility all Categories should be in tables.
BTW, I'm glad to see that Sex is used to categorise people and not Gender which strictly speaking categorises nouns.
The ISO sex codes are 0 = unknown, 1 = male, 2 = female, 9 = lawful person (organizations, corporations, etc.) Perhaps it is time to extend them, but those choice you gave are more for medical than legal use.
Joe,
What an honour to be commented by the esteemed Mr. Celko. I have all your books and have quoted and cited you more than a few times in various pieces.
I totally LOVE the idea that we need an ISO standard for Sex definitions. But I must ask, given your take on NULLs, what about that zero = unknown? And further, 9 = lawful person... it could be a legalistic thing private to particular countries, but in Canada these are known not as legal persons but as fictional persons.
Thanks for writing, Joe. I feel honoured and privileged.
Arthur
What an honour to be commented by the esteemed Mr. Celko. I have all your books and have quoted and cited you more than a few times in various pieces.
I totally LOVE the idea that we need an ISO standard for Sex definitions. But I must ask, given your take on NULLs, what about that zero = unknown? And further, 9 = lawful person... it could be a legalistic thing private to particular countries, but in Canada these are known not as legal persons but as fictional persons.
Thanks for writing, Joe. I feel honoured and privileged.
Arthur
In the old days, a blank (unpunched) card column was read as a zero by FORTRAN. The COBOL people also had a convention of using all 9's to make "Miscellaneous" values sort to the end of the file.
And every country has a different legal terms for things GbmH in Germany, SA in Italy, etc. But the idea is the same.
I like NULLs, but only when they have a single meaning in context. The zero can be counted, so it is fine for something with an absolute scale. I do not pepper the world with NULLs needlessly.
And every country has a different legal terms for things GbmH in Germany, SA in Italy, etc. But the idea is the same.
I like NULLs, but only when they have a single meaning in context. The zero can be counted, so it is fine for something with an absolute scale. I do not pepper the world with NULLs needlessly.
We are in violent agreement on this, Joe... although I must point out that NULLs can be counted as easily as zeroes.
As for peppering the world with NULLs needlessly, I am a big fan of pointless complexity.
Arthur
As for peppering the world with NULLs needlessly, I am a big fan of pointless complexity.
Arthur
A fellow traveller! Thank you for noticing. I despise the use of the word Gender in contexts where Sex is the correct word. In Latin, as I recall through the dim fog of decades passed since I studied it last, there are six genders, but only two sexes.
Regarding your other point, perhaps you are right and there are justifications for a Sexes table. Already I can imagine a few additions:
female -> male -> changed her/his mind
and vice-versa. This could turn out to be an article on recursive sex changes.
Arthur
Regarding your other point, perhaps you are right and there are justifications for a Sexes table. Already I can imagine a few additions:
female -> male -> changed her/his mind
and vice-versa. This could turn out to be an article on recursive sex changes.
Arthur
I did a piece on transition constraints for www.dbazine.com which is a better way to view this than recursion. You have a table with (prior_state, current_state) for the allowed transitions and 2-column FOREIGN KEY in the other table. In the case of the Canadian problem, we can use a 2-digit string, so 00= unknown, 01 = male, 02=female, 12 = male to female, 21= female to male. Of course, things like 91 = corporation to male are disallowed.
Is it not possible that 'address' is not the logically most pertinent property of a typical record, but location? Location could have a table, and address could be a property of location and further qualified by different postal formats. Additional properties of location could be various coordinate systems and so forth. Location could also be an area, a volume, a list, a route (be time bound) etc.; and any calculated point or set of points from these of course...
In other words, should the database designer try to achieve semantic idealism within the domain (which should help future proof locations if your entity suddenly becomes mobile, like a programmer with a laptop), and achieve higher abstraction that way?
In other words, should the database designer try to achieve semantic idealism within the domain (which should help future proof locations if your entity suddenly becomes mobile, like a programmer with a laptop), and achieve higher abstraction that way?
You make an interesting point about how the address is different from the location. I've been working with GIS geodatabases, in which each record contains the x, y, and z coordinates of the spatial location. Since these records typically represent physical assets with a life history (joined by id to a separate work order database), it has become necessary to include datetimestamps in the geodatabase as well. I imagine eventually geodatabases will incorporate x, y, z, and t for the space-time location of each record.
That was one of my big complaints in a few ESRI GIS geoDB courses--they never discussed normalizing the data, which was very frustrating.
Normalization should be used to the highest extent feasible for the particular application. Sticking with the address sample, there are some systems where it's not reasonable to include every city in every state. Normalization tends to scare people because the scope usually isn't well defined.
Normalization should be used to the highest extent feasible for the particular application. Sticking with the address sample, there are some systems where it's not reasonable to include every city in every state. Normalization tends to scare people because the scope usually isn't well defined.
If we want to normalize tables in GIS, then the SDE feature classes would consist of only the shape, length, and area fields, with perhaps foreign id keys for associated asset inventory tables.
All other attributes should be maintained by the SQL Server or Oracle DBMS in separate tables, with spatial views created in SDE (ESRI's spatial database engine) to tie the attributes to the feature classes.
Unfortunately, our SDE feature classes at the city are not normalized, but have all attribute data stored in every record, making record updates and queries inefficient and subject to a buggy SDE.
All other attributes should be maintained by the SQL Server or Oracle DBMS in separate tables, with spatial views created in SDE (ESRI's spatial database engine) to tie the attributes to the feature classes.
Unfortunately, our SDE feature classes at the city are not normalized, but have all attribute data stored in every record, making record updates and queries inefficient and subject to a buggy SDE.
How much of the bugs do you think are a legacy issue, inherited from the previous electronic tracking method?
Clearly, not everything can be normalized, but I considered it critical to identify everything that SHOULD be normalized and then pare that list down to what COULD be normalized without impairing the GeoDB. That meant that all my data (I was bringing years worth of data on streams and lakes that were electronic into a geoDB that created the 'map' of all the fatures) had to be as clean as possible, and it was a huge fight between me and the users--they didn't understand how it would be used or beneficial to normalize. Finally I just said I'm going to do this, test it and tell me if you can tell what changed or if you can't find the data. They couldn't, so I didn't worry after that.
Clearly, not everything can be normalized, but I considered it critical to identify everything that SHOULD be normalized and then pare that list down to what COULD be normalized without impairing the GeoDB. That meant that all my data (I was bringing years worth of data on streams and lakes that were electronic into a geoDB that created the 'map' of all the fatures) had to be as clean as possible, and it was a huge fight between me and the users--they didn't understand how it would be used or beneficial to normalize. Finally I just said I'm going to do this, test it and tell me if you can tell what changed or if you can't find the data. They couldn't, so I didn't worry after that.
As an old friend of mine once said, "The reason they call it real estate is because it's REAL."
Your distinction between Location and Address, while valid, seems not quite relevant to the point I was attempting to make. I do recognize the validity of your distinction, to be sure.
A little more on the issue of Addresses. This is a notoriously difficult problem due to local lexicons, etc. This place has cantons, that place has counties, the other has provinces, the still other has states -- not to mention the whole problem of rural routes (mailboxes a mile from the house and 50 miles from the nearest town). I don't pretend to have solutions to all these variations. Many brighter bulbs than I have thrown megawatts at the problem and as far as I know, there is no universal solution.
I once did an app for a (now) prominent politician and his party in Toronto. This loops back to the notion of "real estate". We decided that in this case, reality means "buildings that are physically standing now, or under construction and with designated street names and numbers". The party itself had all the addresses (from voting polls). The boundaries between ridings (that's what we call them in Canada, I think they are districts in USA but I'm not sure) are elastic, subject to the whim of the ruling party, which can typically be gauranteed to adjust the boundaries in its favour. Whatever the adjustment, however, it ultimately reduces to: buildings A, B and C on street D move from Riding X to Riding Y, and so on.
That political app was made more interesting because I had to deliver lists that distinguished the odd from even street numbers by street. (This was because the guys who bang signs into lawns work only one side of the street.)
The point being, in such a relatively narrow scope, it made sense to have all the street names stored in a lookup table. In many larger-scope applications, this would be absurd. In many smaller-scope apps, storing all the street names for the city is equally absurd.
I can't tell you or any other reader what granularity your application deserves. My point was to raise questions, not provide answers.
Apparently, judging by the responses thus far, I managed to do that. So I'm pleased.
Arthur
Your distinction between Location and Address, while valid, seems not quite relevant to the point I was attempting to make. I do recognize the validity of your distinction, to be sure.
A little more on the issue of Addresses. This is a notoriously difficult problem due to local lexicons, etc. This place has cantons, that place has counties, the other has provinces, the still other has states -- not to mention the whole problem of rural routes (mailboxes a mile from the house and 50 miles from the nearest town). I don't pretend to have solutions to all these variations. Many brighter bulbs than I have thrown megawatts at the problem and as far as I know, there is no universal solution.
I once did an app for a (now) prominent politician and his party in Toronto. This loops back to the notion of "real estate". We decided that in this case, reality means "buildings that are physically standing now, or under construction and with designated street names and numbers". The party itself had all the addresses (from voting polls). The boundaries between ridings (that's what we call them in Canada, I think they are districts in USA but I'm not sure) are elastic, subject to the whim of the ruling party, which can typically be gauranteed to adjust the boundaries in its favour. Whatever the adjustment, however, it ultimately reduces to: buildings A, B and C on street D move from Riding X to Riding Y, and so on.
That political app was made more interesting because I had to deliver lists that distinguished the odd from even street numbers by street. (This was because the guys who bang signs into lawns work only one side of the street.)
The point being, in such a relatively narrow scope, it made sense to have all the street names stored in a lookup table. In many larger-scope applications, this would be absurd. In many smaller-scope apps, storing all the street names for the city is equally absurd.
I can't tell you or any other reader what granularity your application deserves. My point was to raise questions, not provide answers.
Apparently, judging by the responses thus far, I managed to do that. So I'm pleased.
Arthur
These examples and supplementary explanations are very helpful. I guess that my next question would be: are you talking primarily about designing databases to support a specific application, or about databases that function as generic data stores within an enterprise/organization/sector? What if, after an application was developed for a purpose like canvassing, you then wanted to integrate this with another database from a mapping or mailshot or mileage application? Would a higher degree of normalization in both databases help, hinder or make no difference?
I'm glad that you find the examples and explanations useful.
Without saying so (my perennial problem in writing -- lack of specificity), I meant what I wrote originally to refer to OLTP systems. OLAP systems, on the other hand, benefit enormously from de-normalization.
Since you are familiar with the term "canvassing", I can only assume that you've walked various political-database paths in your past. A fellow-traveller, so to speak.
I know that everyone says that normalization impacts upon performance, but I think that it in a well-designed system it does so a lot less than people think. Mind you, a 20-join query is bound to hurt, no matter how well designed. For that reason, I might simplify (i.e. de-normalize) some of the lookups, but most definitely NEVER the core transaction tables.
My general rule of thumb is, de-normalization belongs in the static tables while normalization belongs in the dynamic tables. It's only a rule of thumb, and I will break said rule from time to time. Break it, perhaps; discard it, NEVER. The risks are simply not worth the potential reward, in my opinion (i.e. the multi-valued problem: Albany, Ontario, Ireland = nonsense).
Arthur
Without saying so (my perennial problem in writing -- lack of specificity), I meant what I wrote originally to refer to OLTP systems. OLAP systems, on the other hand, benefit enormously from de-normalization.
Since you are familiar with the term "canvassing", I can only assume that you've walked various political-database paths in your past. A fellow-traveller, so to speak.
I know that everyone says that normalization impacts upon performance, but I think that it in a well-designed system it does so a lot less than people think. Mind you, a 20-join query is bound to hurt, no matter how well designed. For that reason, I might simplify (i.e. de-normalize) some of the lookups, but most definitely NEVER the core transaction tables.
My general rule of thumb is, de-normalization belongs in the static tables while normalization belongs in the dynamic tables. It's only a rule of thumb, and I will break said rule from time to time. Break it, perhaps; discard it, NEVER. The risks are simply not worth the potential reward, in my opinion (i.e. the multi-valued problem: Albany, Ontario, Ireland = nonsense).
Arthur
I disagree. I think that if you are going to denormalize, you will want to do it in your transaction tables, because that is where the bulk of your queries are going to come from, not lookup or static ones.
Well, Tim, I agree that we disagree. To my mind, transaction tables contain the last month's data; everything else is in the OLAP database not the OLTP database. Granulate to suit, but I see utterly no point in cluttering the transaction database with history beyond a given event horizon.
Arthur
Arthur
Generally data change transactions are understood and can be managed and designed for.
It's output and analysis where the impact of joining hits you.
That's why I favour driving reporting databases from transactional ones. Deriving denormalised data from quality data is easy, the other way round a class one nightmare.
I think he meant static as in yesterday's production or call time figures, not a country code table.
Neither is static as such, just not as dynamic.
It's output and analysis where the impact of joining hits you.
That's why I favour driving reporting databases from transactional ones. Deriving denormalised data from quality data is easy, the other way round a class one nightmare.
I think he meant static as in yesterday's production or call time figures, not a country code table.
Neither is static as such, just not as dynamic.
Increased normalization creates the need for increased joins in SQL statements which degrades application performance. (I heard that there is a limit to the number of joins supported but can't recall the number.)
Accounting software databases are especially "de-normalized" because of the huge number of tables they commonly hold, and therefore the high requirement for joins to do effective work.
Accounting software databases are especially "de-normalized" because of the huge number of tables they commonly hold, and therefore the high requirement for joins to do effective work.
I agree with Jeffm. Producing tables to 3NF was a requirement in the 70s and 80s when the cost of storage was very high. Now-a-days storage cost is low and many companies also look to 'de-normalize' tables to improve efficiency and return datasets on queries in milliseconds as a response to customer queries. The design of the database might look good on paper but if a customer has to wait 4 or 5 seconds before you can provide them with confirmation of an order or sufficient credit limit on their account before processing their order they may just go elsewhere.
I agree that speed of order confirmation is of paramount import in an on-line order application, but at best you are arguing only for de-normalized temp tables, which are then resolved into normalized tables.
Otherwise you have two choices:
1. Take the risks of multi-valued dependencies that I described in the article (i.e. Chicago, Ontario, Serbia)
2. Move all the logic to prevent the above into the front end.
Where speed is paramount, I would go with choice 1. In other words, the order entry table could be de-normalized, but after that, I would immediately normalize the contents of said table. Even here, going back to the case of City, State and Country, I would de-normalize the lookup table at most, so I still had the three Springfields mentioned, but ultimately all I would store is the CityID.
I have built a large number of fairly substantial databases (initial footprint 500GB, expected growth 500GB per year), and they work quickly. Perhaps this changes the discussion to appropriate use of indexes, vertical partitioning, hash indexing, and so on.
But I am not afraid of large quantities of data. I am afraid, however, of bad retrieval and storage strategies. Most of all, I am afraid of bad designs implemented on flimsy grounds.
Arthur
Otherwise you have two choices:
1. Take the risks of multi-valued dependencies that I described in the article (i.e. Chicago, Ontario, Serbia)
2. Move all the logic to prevent the above into the front end.
Where speed is paramount, I would go with choice 1. In other words, the order entry table could be de-normalized, but after that, I would immediately normalize the contents of said table. Even here, going back to the case of City, State and Country, I would de-normalize the lookup table at most, so I still had the three Springfields mentioned, but ultimately all I would store is the CityID.
I have built a large number of fairly substantial databases (initial footprint 500GB, expected growth 500GB per year), and they work quickly. Perhaps this changes the discussion to appropriate use of indexes, vertical partitioning, hash indexing, and so on.
But I am not afraid of large quantities of data. I am afraid, however, of bad retrieval and storage strategies. Most of all, I am afraid of bad designs implemented on flimsy grounds.
Arthur
requires you to cope with conditioning the data.
That may not break the system, but resolving any incompaitbility issues in teh temporary data is another cost.
All design is compromise and when all is said and done databases were designed to store data to use, not to produce an ER diagram our teachers would have give 10 out of 10 to.
That may not break the system, but resolving any incompaitbility issues in teh temporary data is another cost.
All design is compromise and when all is said and done databases were designed to store data to use, not to produce an ER diagram our teachers would have give 10 out of 10 to.
I agree.
Occasionally I have used temp tables to assist with the speed issue, but overall I dislike this approach, and I prefer to stick with the late Dr. Codd. Normalize, normalize, normalize. It can become absurd, as in the case of the Sexes table (which is better handled by a check constraint rather than a lookup, IMO). But I have worked with some fairly substantial databases and the speed of lookup has not been an issue -- given smart indexes etc.
Arthur
Occasionally I have used temp tables to assist with the speed issue, but overall I dislike this approach, and I prefer to stick with the late Dr. Codd. Normalize, normalize, normalize. It can become absurd, as in the case of the Sexes table (which is better handled by a check constraint rather than a lookup, IMO). But I have worked with some fairly substantial databases and the speed of lookup has not been an issue -- given smart indexes etc.
Arthur
big databases, is inertia. As the systems gets more settled, as more and more rules are encapsulated in the storage layer, abstraction to the application degrades.
Reworking the database for new application requirements has more and more impact and gets scarier and scarier.
Database designs just like application ones need constant re-factoring as the environment changes.
All depends on your definition of 'broke' and 'fix'
Reworking the database for new application requirements has more and more impact and gets scarier and scarier.
Database designs just like application ones need constant re-factoring as the environment changes.
All depends on your definition of 'broke' and 'fix'
The reason for normal forms is data integrity, so speed does not matter in one sense. Hey, if it does not have to be right, the answer is 42! Doug Adams rules!
The reason is simple; most queries are based on the relationships in the real data. You ask about the Orders and their details, but not about the Orders and the employee bowling league scores. RDBMS was meant to do joins, so do not fear them!
Going to 5NF is usually not that hard and it makes the **average** query much faster than a de-normalized schema. Therefore the system as a whole runs much better. It is damn hard for "Tony the Cowboy" to think beyond his own part of the system. These guys brag about how fast their stuff runs because they violated tiered architecture, denormalized, used proprietary code, etc. But they never tell you they screwed up everything else and what a bitch it is to maintain their code.
Mother Celko's rule for 5NF is "look for a 3-legged stool" -- a natural n-way relationship where (n >2) such as buyer-seller-lender = Mortgage.
The reason is simple; most queries are based on the relationships in the real data. You ask about the Orders and their details, but not about the Orders and the employee bowling league scores. RDBMS was meant to do joins, so do not fear them!
Going to 5NF is usually not that hard and it makes the **average** query much faster than a de-normalized schema. Therefore the system as a whole runs much better. It is damn hard for "Tony the Cowboy" to think beyond his own part of the system. These guys brag about how fast their stuff runs because they violated tiered architecture, denormalized, used proprietary code, etc. But they never tell you they screwed up everything else and what a bitch it is to maintain their code.
Mother Celko's rule for 5NF is "look for a 3-legged stool" -- a natural n-way relationship where (n >2) such as buyer-seller-lender = Mortgage.
If you are Joe Celko the author-extraordinaire, I truly respect your vast experience and professional achievements. I can?t sit at that table but I am a professional programmer that had 8 years of experience at a company that produces construction accounting software, which is hugely complicated data-wise. My experiences there and since prompted my original post.
Back in the mid-1990's that company decided to do a complete redesign of it's product and the design team (I wasn't on it) decided to reduce the normalization of the data structure due to concerns about two things. First was that the number of joins allowed by the ANSI specs at the time was not sufficient for the queries we needed to construct. The second was that the performance of the product would be degraded by a highly normalized data structure that created hugely complex queries. Since then my work has exposed me to another accounting company's data structures which are similarly de-normalized.
You can press that normalizing to 5NF ?is usually not that hard? and that it ?makes the **average** query much faster than a de-normalized schema?. I can ask back what is an **average** query and is there a limit in more complex structures that makes the non-average (ie, complex) query slower in a maximally-normalized schema.
What?s the limit?
Back in the mid-1990's that company decided to do a complete redesign of it's product and the design team (I wasn't on it) decided to reduce the normalization of the data structure due to concerns about two things. First was that the number of joins allowed by the ANSI specs at the time was not sufficient for the queries we needed to construct. The second was that the performance of the product would be degraded by a highly normalized data structure that created hugely complex queries. Since then my work has exposed me to another accounting company's data structures which are similarly de-normalized.
You can press that normalizing to 5NF ?is usually not that hard? and that it ?makes the **average** query much faster than a de-normalized schema?. I can ask back what is an **average** query and is there a limit in more complex structures that makes the non-average (ie, complex) query slower in a maximally-normalized schema.
What?s the limit?
Your situation sounds pretty common. Often designers try to normalize to a specific form (typically 3rd), then later realize that it is taking too many joins to retrieve data, and performance starts to wane. At that point you start to do some selective denormalization. While normalization is a great idea in theory, strictly adhering to it is generally not all that great of an idea. My approach is to normalize somewhere close to 3rd normal form, look at some of the main queries on the system, and denormalize as necessary to produce faster results. Often, I will gladly trade space for performance gains.
>> First was that the number of joins allowed by the ANSI specs at the time was not sufficient for the queries we needed to construct. The second was that the performance of the product would be degraded by a highly normalized data structure that created hugely complex queries.
I have found the performance issue to be mostly false, A denormalized table often has to be normalized before it can be used. A common example is a table to mimic a timesheet. There is a column for a (sign_in/sign_out) flag and another for the time. In order to get the durations, you have to match the times for each event and it is a bear of a query. The correct normalization was a sign-in time column and a nullable sign-out time column with a proper constraint on the pair.
The real problem is that people do not know how to design data and how to design an RDBMS.
I have found the performance issue to be mostly false, A denormalized table often has to be normalized before it can be used. A common example is a table to mimic a timesheet. There is a column for a (sign_in/sign_out) flag and another for the time. In order to get the durations, you have to match the times for each event and it is a bear of a query. The correct normalization was a sign-in time column and a nullable sign-out time column with a proper constraint on the pair.
The real problem is that people do not know how to design data and how to design an RDBMS.
If the query that provides the content to drive a web page defined by the application requirements takes five seconds, then something has to change.
That's either optimisation, denormalisation or redesign based on how the data is used.
Given the last option in most cases requires a considerable amount of re-work at the application level, it does not get chosen very often.
If you design a database soley on the data, you have a generic solution, there will always be some situations in a complex system where the cost of that abstraction is unacceptable.
I agree with the last sentence, with the proviso that the RDBMS is not an isolated standalone system.
Yippeee Ky eh.
That's either optimisation, denormalisation or redesign based on how the data is used.
Given the last option in most cases requires a considerable amount of re-work at the application level, it does not get chosen very often.
If you design a database soley on the data, you have a generic solution, there will always be some situations in a complex system where the cost of that abstraction is unacceptable.
I agree with the last sentence, with the proviso that the RDBMS is not an isolated standalone system.
Yippeee Ky eh.
As Fred Brooks phrased it so exquisitely, "Any program worth writing is worth writing twice -- once to understand the problem and once to solve it."
We have all written bad database designs. That comes with the territory we have chosen. But that does NOT mean that the late Dr. Codd got his marbles in the wrong slots.
I have read Codd's stuff very carefully, including all 333 rules in "Version Two". I do have one beef with his rules, which amounts only to adding one more rule. My rule of thumb is, "When in doubt, go with Codd." This rule has never failed me.
Arthur
We have all written bad database designs. That comes with the territory we have chosen. But that does NOT mean that the late Dr. Codd got his marbles in the wrong slots.
I have read Codd's stuff very carefully, including all 333 rules in "Version Two". I do have one beef with his rules, which amounts only to adding one more rule. My rule of thumb is, "When in doubt, go with Codd." This rule has never failed me.
Arthur
Who say the design was 'wrong'.
Is it practical ?
Is it implementable ?
Is it still right ?
Is it affordable ?
Is it maintainable ?
Was it ever right ?
Were the requirements you designed it from right ?
Were they interpreted correctly?
.....
Has someone important changed their mind?
No developer is happy with their first code, probably not their second or third.
Introduce an enhancement or an add on, there are 'broken windows' all over the place.
That's whether you develop databases or applications or both.
The only evidence of omniscience in our industry was the guy who said.
If you work real hard on getting it right, it won't go wrong as quickly,and if you make this clear to the bean counters they will sack your arse for not being able to quantify exactly how much money you wll save then in the future now, based on past results.
Is it practical ?
Is it implementable ?
Is it still right ?
Is it affordable ?
Is it maintainable ?
Was it ever right ?
Were the requirements you designed it from right ?
Were they interpreted correctly?
.....
Has someone important changed their mind?
No developer is happy with their first code, probably not their second or third.
Introduce an enhancement or an add on, there are 'broken windows' all over the place.
That's whether you develop databases or applications or both.
The only evidence of omniscience in our industry was the guy who said.
If you work real hard on getting it right, it won't go wrong as quickly,and if you make this clear to the bean counters they will sack your arse for not being able to quantify exactly how much money you wll save then in the future now, based on past results.
"The real problem is that people do not know how to design data and how to design an RDBMS."
I couldn't have said it better myself, Joe.
Arthur
I couldn't have said it better myself, Joe.
Arthur
In practice, AFAIK, the limit is about 20 joins. But that statement ignores several factors. Since I'm pretty much wedded to MS SQL (though I confess to an ongoing affair with MySQL), I will confine my comments to my virtual wife.
Table-UDFs are a godsend. They enable you to create a virtual table that combines columns from N tables, and to treat the result set as if it were a table. You can even join a table UDF to other tables or other table UDFs. (c.f. my TechRepublic article on same.)
Without table-UDFs, I believe that the limit is 20 joins, but I have never had to go anywhere near the limit. In part, it's because I use the atomic/molecular approach (atomic means one table's columns of interest, expressed as a view or UDF; molecular means built from said atoms). C.f. my TechRepublic article on same.
Arthur
Table-UDFs are a godsend. They enable you to create a virtual table that combines columns from N tables, and to treat the result set as if it were a table. You can even join a table UDF to other tables or other table UDFs. (c.f. my TechRepublic article on same.)
Without table-UDFs, I believe that the limit is 20 joins, but I have never had to go anywhere near the limit. In part, it's because I use the atomic/molecular approach (atomic means one table's columns of interest, expressed as a view or UDF; molecular means built from said atoms). C.f. my TechRepublic article on same.
Arthur
Right on, Joe! That's the point I keep trying (apparently in vain) to make.
I'm reminded of a friend of mine who always answers internet forms with "yes" -- and it makes sense, although the data is less than useful:
Name: yes
Address: yes
City: yes
Sex: occasionally
More seriously, I'm with you 100% on this, Joe. I have worked with DBs comprising several terabytes and performance has not been a problem. Admittedly, I use tricks like hash indexes but so what? I end up with RI and performance too.
Arthur
Arthur
I'm reminded of a friend of mine who always answers internet forms with "yes" -- and it makes sense, although the data is less than useful:
Name: yes
Address: yes
City: yes
Sex: occasionally
More seriously, I'm with you 100% on this, Joe. I have worked with DBs comprising several terabytes and performance has not been a problem. Admittedly, I use tricks like hash indexes but so what? I end up with RI and performance too.
Arthur
Arthur
"Sex: occasionally" would be wonderful at my age! It has been so long I cannot even remember who gets tied up first.
People forget that indexing, hashing, pointer chains, et al are not part of the SQL language. They are implementation devices that provide a physical locator. The physical locator can be hidden (Dr. Codd's definition of a surrogate key) or exposed (a clear violation of 1NF, since it is not an attribute of the entity or relationship modeled by the table).
People forget that indexing, hashing, pointer chains, et al are not part of the SQL language. They are implementation devices that provide a physical locator. The physical locator can be hidden (Dr. Codd's definition of a surrogate key) or exposed (a clear violation of 1NF, since it is not an attribute of the entity or relationship modeled by the table).
- Keyboard Shortcuts:
- Prev
- Next
- Toggle









































