Data Management

To be or not to be normal: That is the database question

Normalization is a foundation of database design, but is it worth the performance hit? Discover the advantages and disadvantages of normalization and how to deal with performance concerns.

There are advantages and disadvantages to normalizing database schemas. I’ll provide specific examples of when and why you should normalize or denormalize your final design based on performance requirements.

Learn more about database normalization
Read "Database normalization starts with the keys."

Advantages of normalization
Provide indexing
When the developer creates a primary key constraint, SQL Server will automatically create a unique clustered index on the column(s) for which the constraint has been created. A clustered index is created if no other clustered index has been defined for the table. In most cases, the creation of a clustered index will speed up data access and may increase insert, update, and delete performance. I should note that, for benchmarking, vendors frequently remove indexes to increase performance.

When you're creating primary keys, Microsoft recommends using integer primary keys because SQL Server is optimized to sort and index them. Likewise, when joining tables, creating nonclustered indexes on the foreign key column(s) will enhance performance.

Minimize modification anomalies
Modification anomalies arise when data is inserted, updated, or deleted, and information is lost in unexpected ways. In the database table in Figure A, rightsizing (deleting) an employee could delete information about a corporate building unless you store building information in a separate table.

Figure A
Database table


To create a new building entry, you would need to insert or update an employee record. This design also lends itself to a common problem in which you could have buildings with the same name spelled differently (e.g., Warehouse1, warehouse1, warehouse 1, etc.).

Reduce table/row size
By removing duplicate data, we conserve disk space and increase the amount of row space available for other fields. Row size can be an issue because the fundamental building block of a row in SQL Server is the page, which is 8 KB in SQL Server 7.0/2000. This means that a row can contain only 8060 bytes of data, excluding the text and image data types. Creating primary/foreign key constraints reduces the number of null or empty values in columns and thus reduces the size of the database.

Enforce referential integrity
I’ve seen many database problems that were caused by a simple lack of referential integrity. These problems usually manifest themselves as an inability to extract important data or relate information from one table with data in another table because there is nothing on which to join the two tables. For example, if no referential integrity existed in an e-commerce application, an order could exist without an associated customer. This is clearly a problem. Some designers might put logic in the GUI or middle tier to prevent this, but it would still be possible for someone to enter this information directly into the database via SQL statements. Developers typically use both front-end logic and referential integrity in the database to eliminate these concerns.

The disadvantage of normalization
Joins
Database normalization comes with a price. The only real drawback to having a highly normalization database structure is that you may need a large number of joins to pull back the records the application needs to function. The fact that the database is usually an application performance bottleneck only aggravates the problem. This will become apparent in environments in which many concurrent users access the database. To help alleviate this issue, application designers denormalize the database design after the normalization process in order to meet the application performance requirements, usually by adding summary and/or redundant data back into the normalized tables.

For instance, a data architect might keep a running total of employees’ accumulated earnings for the year in the employee table rather than recalculating the data through a series of joins. It might also be reasonable to duplicate attributive table information into the primary tables (e.g., copy an employee’s job description into the employee table).

Another reason to carefully review the degree of normalization in a database design is the database’s intended use. Specifically, you should consider whether the database will be used primarily for reporting or transaction activity. However, with smaller databases, and for companies that are unable or unwilling to have separate databases, the system may be required to do both. In online transactional processing (OLTP) databases, normalization is vital for ensuring data integrity and increasing the speed with which records can be inserted, updated, and deleted. Conversely, in online analytical processing (OLAP) databases, read performance is most important, and therefore database designers implement a much higher degree of denormalization. This is why most companies of any size implement data warehouses or data marts on top of their production OLTP databases.

To illustrate when denormalization may be appropriate, Figure B  shows the relationship between authors, their titles, and the sales of those titles.

Figure B
The relationship between authors, their titles, and sales


Suppose the application we’re creating will require frequent summing of all sales for individual authors. I’ve included a SELECT statement and the associated execution plan from Query Analyzer, in Figure C  and Figure D, respectively.

Figure C
SELECT statement


Figure D
Execution plan from Query Analyzer


To avoid this join and speed up the query, we could automatically update a total_sales field in the authors table as individual titles are sold, possibly by utilizing a trigger. By taking this action, the query could be executed without any joins, thus making it much more efficient, as you can see in Figure E  and Figure F.

Figure E
SELECT statement


Figure F
Executing the query without any joins


This is an example of storing summary data. At other times, you may find it advantageous to store redundant data, so that it’s not necessary to join back to attribute tables in order to pull back required data. Another place to review normalization is when the schema has supertypes and subtypes. Subtypes are created when a supertype—for example, the employee table—is subdivided into several tables because it doesn't share common fields.  For example, some employees may lack a forklift operator's license number; therefore, normalization dictates that we create a subtype. Denormalizing this data and rolling back subtypes into their supertypes can increase application performance.

You must carefully and thoroughly review database normalization for each new database that is developed. The decisions you make during the design phase will have a fundamental impact on the overall cost, usability, and efficiency of your final application. You should also review the level of denormalization to fine-tune performance as the application grows and usage increases.

Not so normal
What do you think about database normalization and performance? Drop us an e-mail or post a comment below.

 
0 comments

Editor's Picks