Data Management

Relational databases: Applying the First Normal Form

Just knowing the normal forms is not enough, you need to apply them to your database design. Find out how to apply First Normal Form as our database design series rolls on.

The origins of relational database theory can be mapped to Dr. E. F. Codd’s 1970 paper “A Relational Model of Data for Large Shared Data Banks,” in which he spelled out an initial set of seven abstract rules, known as normal forms, for constructing a well-designed database. The first four of these rules—First Normal Form (1NF), Second Normal Form (2NF), Third Normal Form (3NF), and Boyce-Codd Normal Form (BCNF)—provide adequate guidance, in most cases.

The normal forms are so abstract that some developers have trouble figuring out how to apply them. Perhaps the best way to understand the normal forms is to start applying them to data, since rules tend to make more sense when you actually have data to divide. In this article, I’ll work through applying the 1NF rule, which is initially the most complicated to apply, to a sample book catalog database.

All about relational database theory
This is the third article in a weekly Builder.com series on relational database design theory. The previous articles are:
"Relational databases: The inspiration behind the theory"
"Relational databases: Using normal forms to create databases"


As you may recall, the 1NF requirements are:
  • Multivalued fields must be removed to another table.
  • Each field must be atomic, or as small as possible.
  • Each field must have a key.
  • Repeating values must be removed to another table.

The simple table I’ll be working with is meant to store information about a collection of books. So far, the Books table has the following fields:
 
{Title, Author, ISBN, Price, Publisher, Category}

Remove multivalued fields to another table
The first step in applying 1NF is to make sure the table contains no multivalued fields, which by definition store more than one possible entry. Our initial list has two possible violations: Author and Category. Many books have more than one author, so the Author field presents a problem. Likewise, a book could match more than one category. For instance, Treasure Island could be categorized as children’s, adventure, classic, and so on.

The only way to correct this problem is to move the fields that violate the rule to another table. You may find that a field works better in another existing table, but that's rarely the case. Most of the time, you'll create new tables for each field you remove. Meeting the first 1NF requirement leaves me with two new tables for the Author and Category fields, and the following fields left in the Books table:
 
{Title, ISBN, Price, Publisher}

Each field must be atomic
The next 1NF requirement states that each field must be atomic, meaning each field must store the smallest data element possible. This rule facilitates searching and sorting. Again, the Author field presents a problem, because a name can be divided into multiple pieces of information. We need a field for the first name and a field for the last name, which will make searching for an author by name much easier.

At this point, I’ll further divide the Authors table into at least two fields: FirstName and LastName, and my database layout is as follows:
 
Books: {Title, ISBN, Price, Publisher}
Authors: {FirstName, LastName}
Categories: {Category}

Each field must have a key
It’s hard to search a table that has duplicate records; in fact, the relational model forbids a table from containing duplicate records. So, the values in at least one field or column in a table must be unique. Uniqueness can be determined by checking a key, which may be composed of a single column or a combination of columns, called a composite key.

There are a number of different key types:
  • Super key: A column or set of columns that uniquely identifies a record within a table.
  • Candidate key: A super key that contains only the minimum number of columns necessary to determine uniqueness.
  • Primary key: The candidate key used to uniquely identify records within the table.
  • Alternate key: A candidate key that isn't chosen to be the primary key.
  • Foreign key: A column or set of columns within a table that match the candidate key in the same or another table. Foreign keys allow you to relate a record in one table to data in another table.

The types of keys in this list aren’t mutually exclusive; a key can fit into more than one category at a time. By definition, each table must have at least a primary key.

To determine the primary keys for our sample tables, let’s start by finding the super keys. The super keys for the Authors and Categories tables are easy to find since these have so few fields, but Books is a little more difficult. Although there's little chance of having two books with the same name, it isn't impossible, so I can’t rely on Title to be a key for the Books table. It's less likely that I'll encounter two books with the same name from the same publisher or with the same ISBN, so I can create a few super keys from those possibilities. Table A identifies super keys for all three tables:
Table A
Books Authors Categories
Title FirstName, LastName Category
Title, ISBN
ISBN
Super keys for the sample database

The Books table actually has a few more super keys, such as Title, ISBN, and Publisher, but including them would be overkill.

Finding the candidate keys
Now it's time to narrow down the lists of super keys to find the candidate keys, which contain only the minimum number of columns necessary for uniqueness, for each table. The Categories table is no problem at this point, because there's only one field. The Authors table has only one super key as well, so it's the obvious choice for a candidate key.

The Books table, however, is a little trickier, but in the final analysis, the ISBN field appears to be the best candidate key. The ISBN field should be unique, but since publishers assign these numbers, it’s possible to find two different books sharing the same ISBN. The truth is, you could probably use the ISBN values by themselves and never run into a problem, but it takes only one glitch to render your database useless. For that reason, I’ve decided to use the Title, ISBN composite super key for the Books table's candidate key.

Determine the primary key
The primary key is simply the candidate key you end up using to uniquely identify each record in the table. After all this work, it's easy to assign a primary key to each table. Now I have the following table definitions for the sample database (asterisks indicate primary key fields):
 
Books: {*Title, *ISBN, Price, Publisher}
Authors: {*FirstName, *LastName}
Categories: {*Category, Description}
 

Notice that Categories contains a new field, Description. A single-field table would be acceptable, but including a field for the description text will help explain each category more fully.

Counter fields as primary keys
Most RDBMSs offer a type of counter or autonumbering data type that automatically assigns a consecutive numeric value to each record. Although these counter fields will guarantee that you have a candidate key, the key will be meaningless for searching. If you use a counter field as your primary key, you’ll still need at least one other field to find records in a meaningful way. For more on this topic, see "The great primary-key debate."

What about the Authors table? The chances that you'll duplicate both the first and last name in the Authors table might seem remote, especially if your library is small. However, it certainly isn't impossible by any means. You could assign a counter field to the table, which would definitely resolve the uniqueness issue, but that wouldn’t help you differentiate between authors.

The simplest solution to our problem here seems to be the addition of some kind of contact information, such as an e-mail address or the author's state or region of residence, but that still isn't that specific. You could possibly end up with two authors with the same first and last name living in the same state. It's unlikely but not impossible, and it’s best to err on the side of caution. You could consider adding each author's entire mailing address, but for the sake of keeping the example simple, I’ll add just the state and ZIP code, and extend the primary to include both of the new fields:
 
Authors: {*FirstName, *LastName, *State, *ZIP}
 

At this point, this is really the only way to guarantee uniqueness in the Authors table.

Remove repeating values
The last requirement we have to meet for 1NF is that there be no repeating groups in our data. It’s pretty difficult to discern whether you meet this requirement when you don’t have real data to review, but we’ve come this far, so we should try anyway. Later on, if I see a value repeated numerous times, I can consider moving that field to a new table. If the tables in my database are otherwise properly normalized, a bit of remodeling won't be a big deal. In fact, many database applications seem to stay in a constant state of growth.

The most obvious violator of this last rule is the Publisher field in the Books table. Although there are a lot of publishers, I will no doubt find myself entering the same publishers again and again. For Books to meet 1NF, I’ll have to move that field to another table. Now the sample database looks like this:
 
Books: {*Title, *ISBN, Price}
Authors: {*FirstName, *LastName, *State, *ZIP}
Categories: {*Category, Description}
Publishers: {*Publisher}
 

You probably won’t run into duplicate Publisher names, but to be safe, you could decide to use an autonumber field as the primary key. If so, your table design might resemble the following:
 
Publishers: {*PublisherID, Name}
 

Finally, the tables in the sample database all seem to meet 1NF. Each field is as small as possible, there are no repeating groups or multivalued fields, and each table has a key. You might be wondering what's left? After all, I started out with one table and applying just one form has increased the table count to four. We still have three more forms to apply. Don't be surprised if the above structure changes as I work through applying 2NF, 3NF, and BCNF.

We’ll probably revisit 1NF again after 2NF and 3NF, because problems sometimes slip through the best designs. In fact, I’ve intentionally left one problem in the sample database. See if you can figure out where that problem is while waiting for the next installment in this series, which will involve applying 2NF and 3NF to the sample database.

About

Susan Sales Harkins is an IT consultant, specializing in desktop solutions. Previously, she was editor in chief for The Cobb Group, the world's largest publisher of technical journals.

0 comments

Editor's Picks