Data Management

Code concepts: Database normalization and de-normalization

Developer Justin James provides an overview of the basics of normalization and de-normalization, with an emphasis on the Third Normal Form.

 

One of the central ideas in relational databases is the concept of normalization. Normalization and de-normalization play a significant role in your ability to write applications that perform well and tolerate data changes without creating inconsistencies. In this article, I will talk about the basics of normalization and de-normalization, so that you can get a handle on this topic.

The field of database theory contains a number of normalization levels; however, when the typical developer or DBA says that a database is "normalized," they are referring to what is called the Third Normal Form or 3NF (there are quite a few "Normal Forms"). This is what I will focus on in this column.

In 3NF, each table has a key column, and the only actual data in the table are things that directly belong to that key's concept; this is very similar to object-oriented theory. Table A is an example dataset that is not properly 3NF. Table A

Not normalized. (Click the image to enlarge.)

Because Table A is not normalized, we create the following problems for ourselves:

  • If a customer record ever changes, we need to make the change in every invoice record in the table, which leaves a lot of room for error.
  • If another table needs to refer to customers, it will either need to refer to invoices (which is confusing and does not create a canonical record for each customer) or replicate the customer data elsewhere, creating two sources of customer data.
  • Making a change to a customer record involves locking every row with that customer, which can be problematic.
  • There's no way of uniquely identifying any given customer.
To normalize this table, we identify each unique record type; in this example, we have two record types (invoice and customers) within each actual record. Next, we split those records into separate tables and give each table a unique ID. This ID should be a value that is wholly independent from the record because, even if the business logic says that the data cannot change, the business logic could change in the future. Additionally, the ID value (a "primary key") should be one that the database can easily look for and allow for easy partitioning of data between servers if needed; as such, using an integer value with an autoincrementer is ideal for primary keys. After normalization, we will have two pairs of tables (Table B and Table C). Table B

Normalized invoice table. (Click the image to enlarge.)
Table C

Normalized customer table. (Click the image to enlarge.)

As you can see, we now have two tables that only contain the data relevant to the table. As a result, if we have another table that needs to deal with customers, it can refer to the independent customer table, which will eliminate data duplication; also, each customer only needs to be updated in one place. In addition, we can now uniquely identify customers. As a bonus, we save a ton of space in the database.

This all sounds great, right? It sure does. But as we know, there are rarely any free lunches in the world of software development. In this case, the cost of this tasty meal is performance, particularly for reading and aggregate operations. Searching goes quite quickly, thanks to all of those convenient primary keys. But let's pretend that we want to make a sales report at the end of the month that looks like our pre-normalization table; we'll need to execute a long-running query (after all, we're grabbing a month's worth of data) that will need to perform JOIN operations on all of these tables. In the process, we will need to lock all of those rows and generally jam up the database in the process. There are good reasons why reports get run at night, and this is one of them.

We resolve this problem with a data warehouse. (I am not going to go far into data warehouse theory, which is a deep topic.) A data warehouse is designed for pure read speed and typically looks like a de-normalized table. In fact, the way a data warehouse usually works is to periodically take a "snapshot" of the online, relational database used for the day-to-day work and create a variety of de-normalized tables, each designed to meet a particular reporting need. From there, reports are run against the data in the warehouse. This way, the long running, resource-hungry reports can be run against tables that are optimized for those reports' needs without tying up the online database. The drawback is significant expense in having extra database servers and storage, and the fact that you are reporting against data that are not real-time accurate.

I hope this helps you get on your way to understanding databases better.

Related TechRepublic resources

J.Ja

Disclosure of Justin's industry affiliations: Justin James has a contract with Spiceworks to write product buying guides. He is also under contract to OpenAmplify, which is owned by Hapax, to write a series of blogs, tutorials, and other articles.

---------------------------------------------------------------------------------------

Get weekly development tips in your inbox Keep your developer skills sharp by signing up for TechRepublic's free Web Developer newsletter, delivered each Tuesday. Automatically subscribe today!

About

Justin James is the Lead Architect for Conigent.

13 comments
ssampier
ssampier

I understand this now. In my DB class in college the professor told us this about one-2-many etc in a standard monotone chant (of course the accent was still there since the professor is Polynesian). Most of us repeated the information back without understanding it. Since normalization is a complex thing, I would imagine the user should be shielded from it. Assuming your database was in Access, would you design a form for data entry?

Vladas Saulis
Vladas Saulis

I'd redefine the question of normalized/de-normalized database design to input-oriented vs report-oriented application design. The sole purpose of the most mordern applications is - to make reports (possibly - as fast as possible). In real world data input takes very little longer when database is de-normalized, that it's really unnoticable. If your data input takes 0.1-0.2 sec longer for de-normalized database (with some more effort for programming), does it really matter when report creation speed changing dramatically - from minutes to seconds? From my practice, after a little denormalization of database, reports creation speed increased 60 times (real fact) with unnoticable changes for data input speed. Most modern databases have multi-indexes and even dynamic indexing, so that's no problem to create report-oriented app design anymore. Additionally, the use of 4GL instead of SQL helps to make more fast unwinded JOIN's. So, I'm for fine de-normalization as a final step of your application database design. And this should be done at early development stage.

Justin James
Justin James

DO youhave any tips, advice, or suggestions for other developers when it comes to data normalization? J.Ja

Tony Hopkinson
Tony Hopkinson

Simple scenario. Customers and orders. One way to do the UI and to normalise, would be a custopmers tables keyed by customerid, with orders keyed by orderid having a foreign key to orders. So you'd have a form to create and maintain customers, choose a customer, use the key in a master detail scenario to chhose or add orders for that customer. Then you get this request. To be able to either look up an existing customer witha dropdown, or enter some basic detauls along with teh order in one go if it's a new customer. Doable but a lot more work. If at that point you say F*** it and go for a CustomerOrders Table, you risk having the same customer more than once among other things. You can combat that with even more code, but teh real killer is a simple coding error will break your data in a possibly hard to find way. The biggest single advantage of normalisation in terms of coding, is imosing the rules traps all sorts of daft arse mistakes, like having an order with no customer before they get any where near customer land. It also falls quite naturally into a better design in terms of coupling and cohesion.

knudson
knudson

I was developing an inhouse app for our DBA group, another DBA was doing the DB design. It was fully over normalized (4th normal form) to the point I could not figure out how to get the data out. I was told "I designed the DB you figure out how to get the data out" But a proper level of normalization is required for data integrity. And always remember the 5th normal form, ABBYNormal, a DB thats been in place a long time and modified by those that don't have a good understanding of normalization.

Tony Hopkinson
Tony Hopkinson

The costs is essentially data integrity, so you can easily end up with a very fast report of total rubbish. May not be a big issue on the sort of data volumes access can manage, but on big systems it's a serious issue. Denormalise in to a data warehouse specifically optimised for your reportiung needs tends to be a better way to go at that point.

Vladas Saulis
Vladas Saulis

How many schoolchildren does it need to design database? :)

Tony Hopkinson
Tony Hopkinson

if you were writing it out with pen and paper. Pouring through a 1000, cards looking for an id number, may be if you ordered them by ID, it would be quicker. Looking through them for all people with a first name of Justin, perhaps an index. JJ, Mrs JJ, and son of JJ move house, change their adress in one place or three... 2NF. The local town council goes PC, and descides to renme your home town from Hicksville to Rusticthorpe, potential 3NF... When I learnt normalistion, up to 3NF, it was all appallingly obvious to me, but I was a recording clerk in a manual system.... Denormalisation, is an optimisation strategy, with some potentially very expensive trade offs in return for performance. Like any optimisation strategy don't start with it too early. Go at it too quick and you could end up solving the problem another way after you've done it, but too late to undo it, so you end up with a poor high cost, low quality design for no reason except you were too quick off the mark. It should be your last available choice, not the first.

Shellbot
Shellbot

I think most people slip into a coma when they first start looking at normalisation :) I find it interesting..but then..I'm a DBA :) As you say, denormalising too quickly can become a massive problem. I'm working with a very denormalised db at the minute and its a nightmare..on the other side, I've also got a highly normalised one which is just painful to work with. The creater was a bit over zealous and writing queries for it can be incredibly tedious :) Moderation can be a good thing

Editor's Picks