Data Management

Database normalization starts with the keys

Database normalization protects data integrity by removing ambiguous data relationships. Eric Roland explains the roles that primary, foreign, and composite keys play in normalization.

Database normalization is a fundamental premise in database design because it removes ambiguous relationships between data, helping protect data integrity. Before diving into the first three of the five normal forms that constitute normalization, I need to explain the role of keys in databases.

During the design phase of database schema development, architects delineate entities (customers, invoices, products, etc.) and how these entities relate to one another. Delineation provides the basis for all Relational Database Management Systems (RDBMS) design. Generally speaking, each entity should have its own unique identifier, which is known as the primary key. In Figure A, the emp_id field in the employee table is the primary key. Foreign keys establish relationships between entities. The foreign key contains primary key values for another table. In Figure A, the employee entity has a job_id that relates to the jobs table.

Primary and foreign key relationships are the foundation of referential integrity, namely that nonnull foreign keys must exist as an actual primary key value in the referring table.

Figure A
The employee table

In Figure A, note that many employees can share the same job type. Cardinality is the term used to express this type of relationship. Cardinality describes the numeric relationships between occurrences of entities in the primary/foreign key table. A one-to-one relationship exists when exactly one instance in the primary key table can be related to precisely one row in the foreign key table. The last type of cardinality is a many-to-many relationship. In our sample tables, the titleauthor table in Figure B models the fact that one title can have many authors, and an author may have many titles.

Figure B
The titleauthor table

Figure C shows a SQL statement that returns the ids of all titles with multiple authors.

Figure C
SQL statement

Candidate and composite primary keys
Entities may need more than one field to make them unique, so we create what is called a composite primary key. For example, in Figure B, the titleauthor table’s primary key consists of both an au_id and title_id.

I’ve added a Social Security Number (ssn) field to the employee table to demonstrate that a table can have other fields that act as primary keys, but for design reasons I chose not to make it a primary key. In Figure B, the ssn can be encrypted and so, not easily utilized by the application. In this case, ssn is what is known as a candidate key, which is any key or combination of keys that can uniquely act to identify a record. While a table can have only one primary key, it can have many candidate keys.

Is this normal?
Normalization is the process of reviewing the relationships among entities so that potential problems don’t arise during development, testing, and production. At the heart of this statement is the notion that not all relationships are equal; thus, we are making a normative or value-based judgment on which relationships are most appropriate for modeling our system. Database normalization emerged as an industry standard because database architects quickly realized the shortcomings of working with nonnormalized databases.

Currently, there are five normal forms, plus the Boyce-Codd normal form. A database schema moves to the Boyce-Codd normal form by removing repeating groups and redundant data, eliminating partial dependencies, ensuring that all nonkey attributes are fully dependent on the primary key, and making certain that all determinates are candidate keys.

The forms progress from least to most restrictive. Typical business applications don’t enforce past the third form because it would take a great deal of time and money to implement. In addition, it would require too many joins to pull back application data, which would decrease performance.

The first three normal forms can be summarized with the following phrase, “The key, the whole key, and nothing but the key.” I’ll look at each in turn.

The first normal form: The key
In order to make a table meet the criteria of the first normal form, the database designer must ensure that every row is unique, that every cell in a column utilizes the same data type, and that each cell only holds one value. Typically, the designer needs to focus on removing any repeating groups to move a database schema into the first normal form.

To illustrate repeating groups, I’ve added building information to the employee table in Figure D. You can see that the employee’s building number, building name, and building location would be repeated for each new employee entered. If we wish to alter a building’s name, we have to update every employee record containing that building information, and other anomalies could occur as well. Consolidating this repeated information into a new table and establishing a foreign key relationship will yield a better design for the application.

Figure D
Repeating groups

Real-world example
I recently discussed an integration bridge between an application I’m building and an application a customer will be migrating to. I was told that the client intends to create primary keys for the sales force by concatenating the salesperson’s name with his or her territory_id, so that you’d get something like Figure E.

Figure E
Primary keys

As you can see, if a salesperson is assigned to more than one territory, we need to duplicate all of the person’s data, with the exception of a new key. This is the signature of a design that doesn’t meet the first normal form because we have redundant data.

The second normal form: The whole key
For the second normal form, you must ensure that all nonkey columns are dependent on the whole key. This is targeted at composite key tables and dictates that all nonkey columns must be dependent on the entire key. If you refer to Figure B, you can see that the titleauthor table has a composite primary key. In this table, the designers chose to store the royalty that the author earned on that title. This meets the criteria of the royalty being dependent on the “whole key” as it requires both the author information and the title information to make a valid relationship.

The third normal form: Nothing but the key
The third normal form is concerned with removing what are known as transitive dependencies, which occur when nonkey columns are actually dependent on other nonkey columns. I’ve altered the employee table to contain a job_name column (Figure F). With this addition, the employee table is clearly not in the third normal form because job_name is dependent on the job_id. Also, Figure E has an example of a transitive dependency that I see in many database designs, namely that the city/state/zip code data should reside in another table.

Figure F

In this article, I‘ve reviewed the key concepts relating to database normalization. In the next article, I’ll discuss the advantages and disadvantages of this design methodology and provide detailed examples.

Keeping it normal
What do you think of the first three forms of database normalization? Drop us an e-mail or post a comment below.


Editor's Picks

Free Newsletters, In your Inbox