After Hours

10 tips for choosing between a surrogate and natural primary key

The superiority of surrogate versus natural keys is a hotly debated topic among database developers. Susan Harkins offers some insights into when -- and why -- you might choose one over the other.

According to relational database theory, a properly normalized table must have a primary key. But database developers disagree about whether surrogate or natural primary keys are best. Data comprises a natural key. A surrogate key is a meaningless value, usually generated by the system. Some developers use both, allowing the application and data to guide their decision, while others use one or the other religiously.

The following tips lean heavily toward supporting surrogate keys (as I do), but I recommend that you not make a religion out of choosing a key type. Be practical, be reasonable, be realistic, and use the key you can apply with the most confidence. Just remember that you have to live with your choice, and others will have to maintain it down the road.

Note: This article is also available as a PDF download.

1: A primary key value must be unique

A primary key uniquely identifies each record within a table and relates records to additional data stored in other tables. A natural key might require several fields to accomplish a unique identity for each record. A surrogate key is unique in and of itself.

2: The primary key should be as compact as possible

In this case, compact refers to the number of fields required to uniquely identify each record. Relying on data can require several fields. Developers who strongly support the superiority of natural keys insist that working with a multiple-field primary key is no harder than working with a single-field primary key. It certainly can be just as easy, but it can also make a grown man cry.

A primary key should be compact and contain the fewest possible fields. A natural key might require many fields. A surrogate key requires only one field.

3: There may be a natural single-field candidate

Sometimes, the data presents a single-field primary key candidate. In-house codes, part numbers, and ISO standardized items are just a few examples. When this is the case, adding a surrogate key does seem superfluous, but be thoughtful when making your final decision. Even though the data seems stable now, it only seems that way. Data and rules change (see #4).

4: Primary key values should be stable

A primary key must be stable. You're not supposed to change a primary key value. Unfortunately, for that rule, data isn't stable. In addition, natural data is subject to business rules and other influences you can't control. Developers know that and accept the condition.

A surrogate key is a meaningless value with no relationship to the data whatsoever, so there's no reason to ever change it. In fact, if you have to change a surrogate key value, you've done something wrong.

5: You must know the primary key value to create the record

Primary key values can't be Null. In other words, you must know the primary key value to create a record. Should you create a record before knowing the primary key value? Theoretically, no; realistically, sometimes you must.

The system creates surrogate key values when you create the new record, so the primary key value exists as soon as the record does.

6: No duplicate records are allowed

A normalized table can't contain duplicate records. Mechanically it can, but to comply with relational theory, it can't. A primary key can't contain duplicate values either, and a unique index prevents duplicates. These two rules complement one another and are often an argument for a natural key. Natural key advocates argue that a surrogate key allows duplicates. If you want to use a surrogate primary key, apply an index to the appropriate fields to prevent duplicate records -- problem solved.

7: Users like to see the primary key

There's a misunderstanding about the users' need to be familiar with the primary key value. There's no reason, theoretical or otherwise, for users to see a record's primary key value. In fact, users don't even need to know such a value exists. It works behind the scenes and means nothing to the user entering and updating data, running reports, and so on. There is no need to associate the primary key value to the record itself. Once you forsake the notion that users actually need the primary key value, you can more readily consider a surrogate key.

8: Surrogates add an unnecessary field

Using a surrogate key does require an extra field, which some argue is a waste of space. After all, everything needed to uniquely identify the record and relate it to data in other tables already exists within the record. Why add an additional column of data to do what the data itself can do?

The overhead for an auto-generating value field is minimal and it requires no maintenance. Alone, this reason just isn't enough to recommend a natural key, but it is a valid point.

9: Don't systems make mistakes?

Not everyone trusts system-generated values. Systems can make mistakes and... wait a minute. I've never seen this happen, but it is possible. On the other hand, a system that's prone to this kind of corruption might end up with a problematic natural value as well. Let's not confuse issues. The best way to protect your entire database, not just your primary key values, is to back it up regularly. Natural data isn't any more dependable than a system-generated value.

10: Circumstances may seem to require a natural key

The only reason I've encountered for forcing a natural key involved records from integrated systems. In other words, sometimes applications sharing similar tables create new records independently of one another. Unless you plan for it, the two databases probably will generate the same values. A natural key, in this circumstance, would definitely eliminate any possibility of duplicate primary key values.

There are simple workarounds that will accommodate a surrogate key. You can give each system a different seed value, but even that can still get messy. GUIDs work but often with a hit to performance. A combined field of the record's system-generated field and a source code, used only when the databases are connected, is another alternative. There are others, even though a natural key might be the most reasonable choice in this situation.

About

Susan Sales Harkins is an IT consultant, specializing in desktop solutions. Previously, she was editor in chief for The Cobb Group, the world's largest publisher of technical journals.

36 comments
Alvaro C
Alvaro C

I enjoyed the article a lot! I use a mixed scheme when it comes to primary keys: I use natural ones because...well they are natural and in my opininion they reflect better the nature of the data and makes it easier to browse data when outside the system (in some utility)


I use surrogate when the number of keys are four or more, or when the are no combination that would make the row unique (don´t kill me!, it happens!)


2: The primary key should be as compact as possible: true, but what does "as compact as possible" mean?. If , as Susan states: the number of fields required to uniquely identify each record, then surrogate keys win, but if refers to storage space, they lose since two o more indexes would have to be defined, one for the surrogate key and other for the unique columns 


4: Primary key values should be stable

A surrogate key is a meaningless value with no relationship to the data whatsoever. I guess this is the base of all debates. For example, the plate on your car should not (in theory)  carry any information about it. In mexico however they carry some kind of. Also does the Tax Id (RFC) and the Unique Polulation Registry Code (CURP, http://en.wikipedia.org/wiki/Unique_Population_Registry_Code) which are derived from your name and birthday with some hash collision solving algorithm.


Even using  a single integer for a key carries information: key 12 was inserted after key 5. Not much but it has information.about the data. Perhaps a random number should be used.


Anyway, I´ll try a pure surrogate scheme on my next database design. One fellow and I have a bet about which one is easier and this page has dropped gasoline to our debate :D

emyl_79
emyl_79

despite being in favor of using surrogate keys, I think that natural key could be used as primary key in some particular use cases. As an example, Oracle Index-Organized-Tables could give huge performance benefits when doing range scans, but rows must be organized by primary key. So if range scans are done on the natural key (or the first part of it) it could be useful to have it also as primary key.

Tony Hopkinson
Tony Hopkinson

Especially from one of the OPs main reader base, access developers. Your entire post was founded on the assumption that what is perceived to be a natural key is a key... Until you rethink that, every thing I've said is going right over your head... Besides what about the order line example, surely you've seen someone do that...

stupid user name
stupid user name

Never rely on an external system to provide your unique key. NEVER.

larry
larry

I love an article that confirms my thinking, Thanks

dportas
dportas

There is too much fudging of issues and vague use of terminology in this article to correct every wrong assumption in it. Users and consumers of data require a way to identify the real world facts represented by database tuples. The identifiers they use (ie natural keys) are of prime importance to the correct use of the data and to data integrity generally. Surrogate keys, if they are used at all, serve a very different purpose and are by definition optional and dispensable as far as business requirements are concerned. So the premise of the artcile - that there is a choice to be made between either a surrogate or a natural key - is an entirely false dilemma. Natural keys are fundamental to successful database implementations that meet business requirements and achieve data integrity. You may or may not decide to create surrogates as well as natural keys but to present this as a choice between one or the other is just arbitrary and misleading.

alvin.c.steele
alvin.c.steele

The point of constraints (domain, foreign-key, unique) is to use the database itself to keep the data as clean as possible. If you have defined your entity in such a way that you can identify a Natural key candidate (as you should be able to in a 3rd normal form database), you should give serious consideration to implementing that Natural key as a unique constraint. NOT necessarily as the primary key (for the reasons cited in the article), but as an unique key. This provides 2 benefits: 1) when and if the database is updated outside of the original application (adhocs, new systems, et al.), the data quality is better preserved. 2) you are providing the database engine with useful information. Tom Kyte of Oracle says that the more information you provide the database engine, the more effectively it can process your SQL queries.

doug
doug

Doesn't the fake keys make the whole normalization idea meaningless?

Jeterdawg
Jeterdawg

I am 100% in favor of using surrogate keys, which is much validated by the article. They are cleaner, easier, and consistent. I would only add this to tip #7: yes, users don't care what the primary key is for a record, but they might care what that surrogate ID is. What if that ID is used in the web address (i.e. www.flowers.com?flowerid=5), and users like being able to just change it to go from record to record instead of relying on links all the time? I would not recommend going out of your way just to hide the surrogate key (at least the value).

don.gulledge
don.gulledge

Visual Foxpro had a unique file name (8 characters based on Date Time). An index is created so a loop that test its existence is used before it is accepted for use. You can add a prefix character to identify different record types of a system or prefix that identifies its creator making concurrent systems possible. The key marries the relationships between the parent and child tables and is not reliant upon the user entered data. Users make mistakes and a subordinate key puts separation between these mistakes and the structure of the system. Makes it a lot easier to build a complex system with complex relationships. Takes the complexity out of keying a system and the user never knows. The key is meaningless for human reading, but I can look up a key and find the human information it references so why do I need to read it or understand it. It worked great. It's easy to make a three tiered relational system using three keys. Takes up 30 characters field space at most. You can wipe that out in one text field, so where's the high cost of storage space. We're not working in the old days of cobol and packed fields to fit it all in.

Spitfire_Sysop
Spitfire_Sysop

Couldn't you use a high precision time stamp in a field called "Creation date" as your primary key? This is functional, natural and "system generated unique". Best of both worlds? It does limit you to making one new record per nano-second and perhaps that is a limitation that is unacceptable. I await a database engineer to come and shoot holes in my theory. Proceed.

dportas
dportas

On the other hand, you can choose which column(s) to define as your "PRIMARY KEY" (PK constraint I mean) and the IOT will be based on that. A UNIQUE constraint on any other keys is functionally equivalent in other ways. So the fact that Oracle IOTs are based on the PRIMARY KEY constraint does not need to influence your choice of key(s) in any important way. That is an example of why Susan Harkins' article is so pointless and misguided. Read no further than her first "tip" that "a primary key value [sic] must be unique"! Um, yes. But that's a requirement for *every* key and not just one particular key you designate to be the "primary" one. So as far as uniqueness is concerned it makes absolutely no difference which of the keys are natural and which are surrogate. Similar remarks apply to the other "tips" in the article (stability, simplicity, etc also being desirable characteristics for any keys). No-one should take the advice in this article seriously.

dportas
dportas

by adding PRIMARY KEY, NOT NULL, UNIQUE and other constraints to ensure that column(s) are unique and non-nullable in a table. There is no room for "perception" about what the keys are when they are plain to see in DDL and enforced by the DBMS. A key is what is constrained to be unique and irreducible (ie a minimal superkey). "Besides what about the order line example" What about it? You didn't say anything interesting. Are you suggesting that we shouldn't make line numbers unique within an order and that you'd rather allow duplicates? What would be the point of that? "Shuffling up" line numbers is a very easy operation to do whether the columns are key or not. And yes I have done it. I don't recall ever being asked to allow duplicate line numbers on anything. You appear to be hijacking this thread rather than discussing the article in question. I asked if there was any part of the article that you would consider was sound advice. I'd like to know because I struggle to see how anyone could take away anything useful from it.

hobbes
hobbes

... I can't tell if this is sarcastic or not.

Tony Hopkinson
Tony Hopkinson

Use a surrogate or a natural as your PRIMARY key. So with that in mind might be time to revise your post, before a potential employer gets the impression you don't know what you are talking about. Now if you'd have said using surrogates was basically a no-brainer, I could have lived with that...

Tony Hopkinson
Tony Hopkinson

If nothing else because surrogate keys are not exposed the UI will be driven by the natural key, so not having an index on it would be 'erm foolish. If it happens to be unique as well, that's a bonus... They key point between natural and surrogate keys, is in most cases the natural key is outside of the control of the software using it, whereas you are always in charge of the surrogate. One can support your business logic the other can destroy it on a whim. It's not hard choice really....

jrborgeson
jrborgeson

Normalization is based on the natural key, which may or may not be used as the primary (aka physical) key of a table. The natural (aka logical) key is what defines a logically unique record from the business point of view. As the article mentions, when the primary key is defined as a surrogate, one should put a unique index on the natural key. If this is not done, then the database is NOT normalized. You could create logically duplicate records. Using a simplified example, we define a table of stock positions held by various customers,with a logical key of customer-id and stock-id. If two records existed for Jim B and IBM, the data would not be normal. I haven't said how the physical key is defined for that table because it doesn't matter in terms of normalizing the data.

AMusnikow
AMusnikow

As far as I have been able to determine, Christopher J. Date is still alive.

Tony Hopkinson
Tony Hopkinson

If you keyed persons by personid instead of SSN. SSN still only exists once, it's just added to the data you've normalised out into the persons table.

gedwards
gedwards

In a multithreaded, multi processor machine, couldn't more than 1 thread actually insert the same timestamp? I'd rather depend on the database engine than timestamp to generate a unique key. Unique Key choices might also change depending on if you do any normalizing the database. As always, sometimes it just depends. Not every answer works well for every situation.

Tony Hopkinson
Tony Hopkinson

First of all CreationDate is this the time the thing you are storing was created, or the time the record was created? The implication of meaning then makes you ask what time zone, daylight saving. Adjusting the system clock. DateTimes are 8 byte floating points, not an ideal choice for an index. The index will either be very sparse and therefore inefficient, or if you are hammering, hashing will result in numerous collisions and so be inefficient.... BigInt identity column, better choice.... . Not sure I'd big myself up with a title of database engineer though, read chapter one would be nearer.

suchitraj
suchitraj

The clear visibillity of a primary is also important .Imagine searching a record with created on stamp as "31/03/2011 12:30:11".And in concurrent systems there is a probability of getting a duplicate record also.

Tony Hopkinson
Tony Hopkinson

Don't use natural keys as your primary key. Whether that's a key that does follow the rules stable, known and unique is irrelevant. Renumbering is an easy operation?. Yes and no. If you Ignore the fact that changing a key by purist rules means it isn't one, which is not my default option. Think about what doing it means. Update OrderLines set line_number = 3 where line_number = 4 and Order_Number = 1 Yeah that's easy. As long as there isn't already a 3 and there is a 4. So you have to lock the entire order for the length of all the renumbering actions. If as is likely in a shuffle up, you are rekeying more than one record, there is an explicit ordering of the update statements. You could get round that with Update Order_lines set line_number = line_number =1 Where Line_Number > 3 but you are still hugely dependant on the state of the affected records. I personaly wouldn't put line_number in the table at all. If it was required on the output becaseu that is the natural way users think about lines on an order, I;d do some sort of calculated field. If it was legacy system knocked up by someone unaware of the issues and I couldn't get rid of line number easily then I'd simply update to be a consecutive integer after any deletes, inserts etc were complete. None of this even considers the problems surrounding communication of changes to orders and how they impact say invoices and vice versa in a synchronous system, never mind an asynchronous one... No hijacking involved, merely some reasons why Ms Harkins advice does turn out to be sound if you look further than a mere database schema for one aspect of some system.

dportas
dportas

All the points in the article except paras 4 and 8 apply equally to ANY keys whether a table has a surrogate or not (and point 4 is wrong anyway, or at least overly prescriptive and dogmatic). The author even says "The only reason Ive encountered for forcing [sic] a natural key involved records from integrated systems"! As for implying that I don't know what I'm talking about, well if you think that a "primary" key means anything different from any other key then I would certainly question what you are talking about. A key is a key. Appending the word "primary" in front of it doesn't alter the meaning of the article and can't possibly support any of the claims that the author makes.

yogi_john
yogi_john

As I learned about six weeks ago, SSNs may not be unique. When adding a new employee to our health insurance, the company rejected the application because the SSN was in their system for someone else. I verified the SSN with the employee's card. I recognize the other person's SSN might have been wrong, but in researching the issue I did discover that mistakes can happen such that an SSN is assigned to different people.

dportas
dportas

"US's SSN, Britain's NI Number..." I can't think why anyone would want to use an SSN or NI number as a key. In every organisation I worked for employees were identified by an employee number / payroll number, which was certainly required to be unique. In HR and payroll systems I have worked on the payroll number was implemented as a key in the "employee" table. It would be extraordinary in my opinion to think of designing an HR system without a natural key for employees. You aren't really suggesting that you would are you? "Your argument ... is proven bollocks though" You have destroyed a straw man, that's all. "We seem to be talking past each other" Yes, the topic is the original article and you haven't said anything much about it or my criticisms of it.

Tony Hopkinson
Tony Hopkinson

What is a natural key? US's SSN, Britain's NI Number, a phone number, a part id, ISBN number. First of all we aren't in control of them, so even if as far as we can see, they are unique within our data domain, there is absolutely no guarantee that uniqueness will be maintained. Therefore they cannot be considered stable. In your system do you wish to constrain it to so the only valid operations on the data require the natural key. Even if that currently meets the need, the likelihood of it changing is high, therefore the natural key could for a set of valid operations, such as get the natural key, be null. So they arenlt stable and they are nullable therefore they are not valid candidates for a key on the tuple. So if you use a surrogate key to meet first normal form and then use it to enforce relations between tables. If you wish to relax the uniqueness contraint to allow nullable values you can easily. If you wish to expand the natural key to include say Country code for phone number when you go international you can easily. If you'd made the natural key the primary key, and then enforced all your relations on it then you are in a world of hurt, and that's without having any code or any downstream consumers! Natural key = database key is an initial assumption and later fallacy because it's not a key as per database theory. Could the article have put that better, perhaps. Your argument that surrogates are an unnecessary overhead because there is a natural "key" is proven bollocks though. I can come up with so many examples in my career where the foolish assumption that a natural key is unique is database terms caused hideous problems. Here's a simple one. Line item number on an order or invoice. Lost count of how many times I've seen some newbie key the OrderLines table by OrderNumber and LineNumber. Delete a line in the middle of the order, you have to shuffle them up. Don't even mention if you have item by item stock receipts, production fulfilment or invoicing or allow line by line remittance. My rule is don't use a natural key as your primary key ever. If there is a viable unique key constraint within the system, fine use it, data integrity is always good, and should that chnage all you do is drop the thing, job done, aside from basic operational changes. My approach to database applications has become (took me a while to learn this one as well) is I put as much data integrity in as I can affiord to back stop my application, not to drive it. When the applications breaks the rules, it falls over. Far too often I've seen the two opposites. You can't change the thing without being severely beaten by the bean counters, whio won't underatnd crap about why. Or all too often, no rules at all, and everthing in code, as though it was a hierachical database written by an idiot. Sod the theory, yoir rukle might be true, now, ther's no rule that says it has tio stay that way. If your application is founded on it, it will founder...

dportas
dportas

Keys are about data integrity, irrespective of how the application uses them. The reason why we enforce keys in the database is precisely so that they are independent of the application. I don't believe the author is thinking of how keys might be used by an application but if she is then she should say so. Even if one accepted your unconventional explanation it wouldn't make much sense of the article. Eg. Point 4. Stability is a desirable characterstic of ALL keys not just ones used by the application but it's not an absolute requirement and isn't always wanted. In fact one of the benefits of a surrogate is that you can change it with only minimal impact. 5 "you must know the key" but that applies equally to every key and if you wanted only a surrogate key value and nothing else then you might as well use a key generator that's independent of the table in question. 6. "apply an index" And how is that any different from what you would do with a surrogate key? etc, etc ... but as I say, there isn't enough space to refute everything that's wrong here. You said: "Natural keys do not have to be unique" but all keys are required to be unique and not nullable - that after all is the definition of a key and the reason we implement them in the database is to make sure they remain unique! Keys are an important topic that deserve much better than this article. If you really think there is any useful advice here then I'd love to know what it is.

Tony Hopkinson
Tony Hopkinson

you build you data access logic round. It's the one the code uses. It's the default one the DBMS will use unless told otherwise (implicitly or explicitly). You don't put primary in front of it for a laugh! Alternate keys as in Unique Key, no problem, but that should be the natural one not the surrogate, and that's if you can afford it seeing as it cannot be null in most implementations. Index, well that's simply an optmisation, and completely dependant on design and implementation and naff all to do with the logical schema. If you want to go purist and implementaion agnostic discounting the primary directive, then you'd have a point given you could gaurantee that the natural key would be static, or you are prepared to pay the huge potential price when that turns out to be not exactly correct.... So come again? Without the word primary, the article would be a waste of pixels. Natural keys do not have to be unique, they could initially be unknown.... PrImary and Unique Keys have to be unique, natural keysn do not.... Your entire argument on it's arse that isn't it?

jrborgeson
jrborgeson

People DBs might be one of the few cases where there is no natural key. Therefore surrogate keys must be used, but notice that this won't help you prevent multiple records for the same person. SSN doesn't work for the reasons noted. Name and Address doesn't work because the same person can have two addresses, not to mention that both are unstructured text data. In this case, human intervention is often needed to identify and merge dups.

Tony Hopkinson
Tony Hopkinson

It's a classic example of why you need surrogate keys. and in this case simply used to illustrate that surrogate keys don't violate normalisation principles. That's not even counting the fact that it might not be unique. That just means it's not a candidate for a key, no matter how you lean on the issue.

stupid user name
stupid user name

Another case. An infant under the age of 1 year is not required to have a SSN. The SSN is, however, required to file an exemption for the IRS.

Editor's Picks