10 tips for choosing between a surrogate and natural primary key

The superiority of surrogate versus natural keys is a hotly debated topic among database developers. Susan Harkins offers some insights into when -- and why -- you might choose one over the other.

According to relational database theory, a properly normalized table must have a primary key. But database developers disagree about whether surrogate or natural primary keys are best. Data comprises a natural key. A surrogate key is a meaningless value, usually generated by the system. Some developers use both, allowing the application and data to guide their decision, while others use one or the other religiously.

The following tips lean heavily toward supporting surrogate keys (as I do), but I recommend that you not make a religion out of choosing a key type. Be practical, be reasonable, be realistic, and use the key you can apply with the most confidence. Just remember that you have to live with your choice, and others will have to maintain it down the road.

Note: This article is also available as a PDF download.

1: A primary key value must be unique

A primary key uniquely identifies each record within a table and relates records to additional data stored in other tables. A natural key might require several fields to accomplish a unique identity for each record. A surrogate key is unique in and of itself.

2: The primary key should be as compact as possible

In this case, compact refers to the number of fields required to uniquely identify each record. Relying on data can require several fields. Developers who strongly support the superiority of natural keys insist that working with a multiple-field primary key is no harder than working with a single-field primary key. It certainly can be just as easy, but it can also make a grown man cry.

A primary key should be compact and contain the fewest possible fields. A natural key might require many fields. A surrogate key requires only one field.

3: There may be a natural single-field candidate

Sometimes, the data presents a single-field primary key candidate. In-house codes, part numbers, and ISO standardized items are just a few examples. When this is the case, adding a surrogate key does seem superfluous, but be thoughtful when making your final decision. Even though the data seems stable now, it only seems that way. Data and rules change (see #4).

4: Primary key values should be stable

A primary key must be stable. You're not supposed to change a primary key value. Unfortunately, for that rule, data isn't stable. In addition, natural data is subject to business rules and other influences you can't control. Developers know that and accept the condition.

A surrogate key is a meaningless value with no relationship to the data whatsoever, so there's no reason to ever change it. In fact, if you have to change a surrogate key value, you've done something wrong.

5: You must know the primary key value to create the record

Primary key values can't be Null. In other words, you must know the primary key value to create a record. Should you create a record before knowing the primary key value? Theoretically, no; realistically, sometimes you must.

The system creates surrogate key values when you create the new record, so the primary key value exists as soon as the record does.

6: No duplicate records are allowed

A normalized table can't contain duplicate records. Mechanically it can, but to comply with relational theory, it can't. A primary key can't contain duplicate values either, and a unique index prevents duplicates. These two rules complement one another and are often an argument for a natural key. Natural key advocates argue that a surrogate key allows duplicates. If you want to use a surrogate primary key, apply an index to the appropriate fields to prevent duplicate records -- problem solved.

7: Users like to see the primary key

There's a misunderstanding about the users' need to be familiar with the primary key value. There's no reason, theoretical or otherwise, for users to see a record's primary key value. In fact, users don't even need to know such a value exists. It works behind the scenes and means nothing to the user entering and updating data, running reports, and so on. There is no need to associate the primary key value to the record itself. Once you forsake the notion that users actually need the primary key value, you can more readily consider a surrogate key.

8: Surrogates add an unnecessary field

Using a surrogate key does require an extra field, which some argue is a waste of space. After all, everything needed to uniquely identify the record and relate it to data in other tables already exists within the record. Why add an additional column of data to do what the data itself can do?

The overhead for an auto-generating value field is minimal and it requires no maintenance. Alone, this reason just isn't enough to recommend a natural key, but it is a valid point.

9: Don't systems make mistakes?

Not everyone trusts system-generated values. Systems can make mistakes and... wait a minute. I've never seen this happen, but it is possible. On the other hand, a system that's prone to this kind of corruption might end up with a problematic natural value as well. Let's not confuse issues. The best way to protect your entire database, not just your primary key values, is to back it up regularly. Natural data isn't any more dependable than a system-generated value.

10: Circumstances may seem to require a natural key

The only reason I've encountered for forcing a natural key involved records from integrated systems. In other words, sometimes applications sharing similar tables create new records independently of one another. Unless you plan for it, the two databases probably will generate the same values. A natural key, in this circumstance, would definitely eliminate any possibility of duplicate primary key values.

There are simple workarounds that will accommodate a surrogate key. You can give each system a different seed value, but even that can still get messy. GUIDs work but often with a hit to performance. A combined field of the record's system-generated field and a source code, used only when the databases are connected, is another alternative. There are others, even though a natural key might be the most reasonable choice in this situation.