Normalization is one of the commandments of IT database pros, whether you are a data modeler, DBA, or SQL developer. Arthur Fuller defines each normal form and explores why most database designers don't go beyond the Third normal form.
Author's note: Thanks to the astute comments from TechRepublic members slawrence and SirPssd, I realized that the column NumberOfLanguages was in the wrong table and that I did not include the scripts to populate the associative tables. All four code listings have been updated with the correct code. I apologize for any inconvenience this may have caused.
Normalization is one of the commandments of IT database pros, whether you are a data modeler, DBA, or SQL developer. We learn its principles early, as well as something about normal forms.
But take a look at most production databases and at best you will find that the Third normal form (3NF) has been implemented. Very few databases reflect higher normal forms, such as Boyce-Codd normal form (BCNF), the Fourth normal form (4NF), and the Fifth normal form (5NF). So, why don't most database designers go beyond 3NF?
The normal forms
In order to answer this question, it's important to understand the differences between 3NF, BCNF, 4NF, and 5NF. Here are concise definitions of each normal form.
Each table must have a primary key, i.e., a minimal set of attributes that can uniquely identify a record. Eliminate repeating groups (categories of data that would seem to be required a different number of times on different records) by defining keyed and non-keyed attributes appropriately. Atomicity: Each attribute must contain a single value, not a set of values.
Second normal form (2NF)
The database must meet all the requirements of the 1NF. In addition, if a table has a composite key, all attributes must be related to the whole key. And, data that is redundantly duplicated across multiple rows of a table is moved out to a separate table.
Third normal form (3NF)
Data stored in a table must be dependent only on the primary key and not on any other field in the table. The database must meet all the requirements of the 2NF. Any field that is dependent not only on the primary key but also on another field is moved out to a separate table.
Fourth normal form (4NF)
There must be no non-trivial multi-valued dependencies of attribute sets on something other than a superset of a candidate key. A table is said to be in 4NF if and only if it is in the BCNF and multi-valued dependencies are functional dependencies. The 4NF removes unwanted data structures: multi-valued dependencies.
There must be no non-trivial join dependencies that do not follow from the key constraints. A table is said to be in the 5NF if and only if it is in 4NF and every join dependency in it is implied by the candidate keys.
Single-valued and multi-valued dependencies
I also want to make sure you are crystal-clear on the two types of dependency: single-valued and multi-valued.
An example of a single-valued dependency is an employee who works for just one department in the organization. The employee might conceivably be transferred between departments, but cannot work for two departments simultaneously.
You can find a common example of a multi-valued dependency in almost every database that concerns addresses. Typically, one sees in the table Programmers the columns City, State, and Country. These might be text or better, numeric integers that do lookups into the next higher level. City looks in the table Cities, State in the table States, and Country into the table Countries. The problem with this arrangement is that it exposes the risk of nonsense addresses, such as Chicago, New York, Canada. This is because the dependencies are multi-valued. A fully normalized version would move the State column into the Cities table and the Country column into the States table, leaving only the City column in the Programmers table. We could then create a query that joins the three tables and present this query in the front-end, so the user can select the appropriate city, seeing Springfield, IL, Springfield, MA, and Springfield, OR.
Let's look at a trickier example of the multi-valued dependency problem. A given programmer may be proficient in several languages and also possess certifications. Each certification requires proficiency in one or more languages, and each language may be relevant to one or more certifications. As the programmer develops proficiency in a new language, she may become eligible for one or more new certifications. What we would like to do is determine which programmers are eligible for which certifications.
Listings A and B create the tables and populate them with a few sample rows. Listings C and D are scripts that query the tables, finding those programmers who are eligible for certifications. The only difference between the two queries is their order—Listing C orders the results by certification, while Listing D orders the results by programmer.
I encourage you to run Listings A and B, and play around a bit before looking at the two queries. See if you can figure out how to produce the list of programmers eligible for the certifications. After looking at my solutions, see if you have a better way; if so, post it in the article discussion or send it to me at firstname.lastname@example.org.
How far should you go with normalization?
To be sure, each progressive step may impact upon overall performance. I have seen normalization taken to absurd lengths. In a project I worked on recently, there was even a table for the sexes, as if this list were likely to change anytime soon!
Where you draw the line in the sand is ultimately up to you, but you will be better equipped to draw it with a sound understanding of the various normal forms and the risks of not going far enough.
Miss a tip?
Check out the SQL Server archive, and catch up on the most recent editions of Arthur Fuller's column.