Author’s note: Thanks to the astute comments from TechRepublic members slawrence and SirPssd, I realized that the column NumberOfLanguages was in the wrong table and that I did not include the scripts to populate the associative tables. All four code listings have been updated with the correct code. I apologize for any inconvenience this may have caused.
Normalization is one of the commandments of IT database
pros, whether you are a data modeler, DBA, or SQL developer. We learn its
principles early, as well as something about normal forms.
But take a look at most production databases and at best you
will find that the Third normal form (3NF) has been implemented. Very few databases
reflect higher normal forms, such as Boyce-Codd
normal form (BCNF), the Fourth normal form (4NF), and the Fifth normal form (5NF). So, why don’t most database
designers go beyond 3NF?
The normal forms
In order to answer this question, it’s important to
understand the differences between 3NF, BCNF, 4NF, and 5NF. Here are concise
definitions of each normal form.
Each table must have a primary key, i.e., a minimal set of attributes that
can uniquely identify a record.
Eliminate repeating groups (categories of data that would seem to be required a
different number of times on different records) by defining keyed and non-keyed
attributes appropriately. Atomicity: Each attribute must contain a single
value, not a set of values.
Second normal form (2NF)
The database must meet all the requirements of the 1NF. In addition, if a table
has a composite key, all attributes must be related to the whole key. And, data
that is redundantly duplicated across multiple rows of a table is moved out to
a separate table.
Third normal form (3NF)
Data stored in a table must be dependent only on the primary key and not on any
other field in the table. The database must meet all the requirements of the
2NF. Any field that is dependent not only on the primary key but also on
another field is moved out to a separate table.
normal form (BCNF)
There must be no non-trivial functional dependencies of attributes on something
other than a superset of a candidate key (called a superkey).
Fourth normal form (4NF)
There must be no non-trivial multi-valued dependencies of attribute sets
on something other than a superset of a candidate key. A table is said to be in
4NF if and only if it is in the BCNF
and multi-valued dependencies are functional dependencies. The 4NF removes
unwanted data structures: multi-valued dependencies.
There must be no non-trivial join dependencies that do not follow from the key
constraints. A table is said to be in the 5NF if and only if it is in 4NF and
every join dependency in it is implied by the candidate keys.
Single-valued and multi-valued dependencies
I also want to make sure you are crystal-clear on the two
types of dependency: single-valued and multi-valued.
An example of a single-valued dependency is an employee who
works for just one department in the organization. The employee might
conceivably be transferred between departments, but cannot work for two
You can find a common example of a multi-valued dependency
in almost every database that concerns addresses. Typically, one sees in the
table Programmers the columns City, State, and Country. These might be text or
better, numeric integers that do lookups into the next higher level. City looks
in the table Cities, State in the table States, and
Country into the table Countries. The problem with this arrangement is that it
exposes the risk of nonsense addresses, such as Chicago, New York, Canada. This
is because the dependencies are multi-valued. A fully normalized version would
move the State column into the Cities table and the Country column into the
States table, leaving only the City column in the Programmers table. We could
then create a query that joins the three tables and present this query in the
front-end, so the user can select the appropriate city, seeing Springfield, IL,
Springfield, MA, and Springfield, OR.
Let’s look at a trickier example of the multi-valued
dependency problem. A given programmer may be proficient in several languages and
also possess certifications. Each certification requires proficiency in one or
more languages, and each language may be relevant to one or more
certifications. As the programmer develops proficiency in a new language, she
may become eligible for one or more new certifications. What we would like to
do is determine which programmers are eligible for which certifications.
Listings A and
B create the tables and populate them
with a few sample rows.
and D are scripts that query the
tables, finding those programmers who are eligible for certifications. The only
difference between the two queries is their order—Listing C orders the results
by certification, while Listing D orders the results by programmer.
I encourage you to run Listings A and B, and play around a
bit before looking at the two queries. See if you can figure out how to produce
the list of programmers eligible for the certifications. After looking at my
solutions, see if you have a better way; if so, post it in the article
discussion or send it to me at email@example.com.
How far should you go with normalization?
To be sure, each
progressive step may impact upon overall performance. I have seen normalization
taken to absurd lengths. In a project I worked on recently, there was even a
table for the sexes, as if this list were likely to change anytime soon!
Where you draw the line in the sand is ultimately up to you,
but you will be better equipped to draw it with a sound understanding of the
various normal forms and the risks of not going far enough.
Miss a tip?
Check out the SQL Server archive, and catch up on the most recent editions of Arthur Fuller’s column.