Data Management

Handling NULL values in SQL Server 2005

If you are a developer, you will almost certainly have to deal with NULL values at some point, even if the database is overly normalized. The more knowledge you have about dealing with these NULL values, the more headaches you will avoid in future projects.

In the simplest terms, a NULL value represents an unknown value. It's unknown in the sense that the value is: missing from the system, may not be applicable in the current situation, or might be added later. NULL values are different than any other value and are sometimes hard to compare and handle.

I think there will always be a debate as to whether NULL values should exist in a normalized OLTP environment. (Read a previous article for database normalization tips.) Academics typically argue that you should always normalize your schema to 3rd normal form, and this should take care of your data redundancies and missing values. However, it is typically impractical to normalize your SQL Server database all the way to 3rd normal form. It looks good on paper and would work great in an ideal world, but it usually doesn't perform well because of the extra joins involved in accessing data you need.

So it begins to make sense to have a little bit of data redundancy and placeholders for missing data in the SQL Server database. (Using data redundancy to increase database performance is outside the scope of this article.) The issues with using these NULL values are: You have to handle them a little bit differently than other values; and there are some small performance implications.

A closer look at handling NULLs

Here's a run-through of different scenarios that you will likely encounter in a production environment where NULL values are allowed.

Note: All NULL comparisons in this article are done under the context of the ANSI_NULLS ON setting, which is a database option that determines how NULL comparisons are handled. Under ANSI_NULLS ON, two NULL values will never be equal to each other because the two separate values are unknown. With ANSI_NULLS OFF, two separate NULL values will evaluate to equal values.

The following script loads the SalesHistory table that I will use for the examples. Take special notice of the CASE statement in the WHILE loop. The CASE statement assigns the value NULL for every other record inserted into the SalesHistory table where the product to be inserted is Computer. This is allowed because the Product column in the SalesHistory table allows NULL values. Also notice that NULL values are assigned to variables using the assignment (=) operator. While NULL values are assigned this way, NULL values are not interrogated in such a way.

IF OBJECT_ID('SalesHistory')>0     

DROP TABLE SalesHistory;

GO

CREATE TABLE [dbo].[SalesHistory]

(           

      [SaleID] [int] IDENTITY(1,1) NOT NULL PRIMARY KEY,           

      [Product] [varchar](10) NULL,                 

      [SaleDate] [datetime] NULL,                 

      [SalePrice] [money] NULL

)

GO

DECLARE @i SMALLINT, @Product VARCHAR(10)

SET @i = 1

WHILE (@i <=100)

BEGIN                     

      SET @Product = CASE WHEN @i%2 = 0 THEN 'Computer' ELSE NULL END

      INSERT INTO SalesHistory(Product, SaleDate, SalePrice)                       

VALUES (@Product, DATEADD(mm, @i, '3/11/1919'), DATEPART(ms, GETDATE()) + (@i + 57))     

      INSERT INTO SalesHistory(Product, SaleDate, SalePrice)           

VALUES('BigScreen', DATEADD(mm, @i, '3/11/1927'), DATEPART(ms, GETDATE()) + (@i + 13))                 

     

INSERT INTO SalesHistory(Product, SaleDate, SalePrice)               

VALUES('PoolTable', DATEADD(mm, @i, '3/11/1908'), DATEPART(ms, GETDATE()) + (@i + 29))                 

      SET @i = @i + 1

END

GO

Now that I have data in my SalesHistory table, I can run queries to look at how NULL values are handled.

This query will return all rows from the SalesHistory table, where the Product column contains the value NULL by use of the IS NULL statement.

SELECT *

FROM SalesHistory

WHERE Product IS NULL

This query will not return any rows. This is due to the handling of NULL values in SQL Server. A NULL value is not equal to any other value, so I cannot use the equality operator here. This can be a bit confusing until you get used to it. Remember to use IS NULL when you are testing to see if a column or variable contains the value NULL.

SELECT *

FROM SalesHistory

WHERE Product = NULL

This query returns all rows from the SalesHistory table where the Product column does not contain the value NULL by using the IS NOT NULL statement. This is the direct inverse of the IS NULL statement.

SELECT *

FROM SalesHistory

WHERE Product IS NOT NULL
This query returns all records from the SalesHistory table. The query uses the ISNULL function on the Product column. For every NULL value in the Product column, the ISNULL function replaces that value with the literal string 'Missing'. I use this function a lot when I need to compare columns with possible NULL values to other values. The value that you are substituting must be of the same data type as the field that may contain the NULL value.
SELECT ISNULL(Product, 'Missing')
FROM SalesHistory

In this query, I am aggregating the number of products sold and the sale price of the products by the Product column. NULL values are distinct values but are grouped together when aggregations occur on them, which can be a bit confusing. Notice the single record in the resultset for the NULL product.

SELECT Product, COUNT(*) AS ProductCount, SUM(SalePrice) AS ProductSales

FROM SalesHistory

GROUP BY Product

Any time a NULL value is present in a column that an aggregation is performed on, such as a COUNT, AVG, or SUM function, those values will be ignored and therefore not included in the functions result.

The following query will not compile because the fields that are referenced are missing from the SalesHistory table; however, it will work fine for showing the functionality of the COALESCE function. The COALESCE function returns the first non-NULL value in the field list that it accepts. The COALESCE function is very useful in a join operation or in queries where you are comparing a single value from a list of possible fields to a single value.

SELECT COALESCE(Product, ProductDescription, Product, 'Missing Info')
FROM SalesHistory

This query uses the NULLIF function, which returns the value NULL if the two values passed into the function are the same value.

SELECT *, NULLIF(Product, NULL)
FROM SalesHistory

Operations on NULL values

Since NULL values are unknown, operations on them typically require some extra processing. The following query returns a unique list of Products from the SalesHistory table, including the NULL record. As the list of values are returned, I am concatenating the literal string 'Sold' to the end of the Product value. On the record that returns the NULL value, this concatenation will not work correctly.

SELECT Product + ' Sold'

FROM SalesHistory

GROUP BY Product

I can use the ISNULL function from above to replace the NULL value with an empty string so that I can concatenate the 'Sold' value to the end. This example probably wouldn't be that useful in a production situation, but it does illustrate that you need to take special care when you encounter NULLs in string operations. You should also take care of calculations on numeric fields that allow NULL values.

SELECT ProductType = ISNULL(Product,'Unknown') + ' Sold:' , COUNT(*) AS ProductCount

FROM SalesHistory

GROUP BY Product

Consider this when defining table structure and constraints

In SQL Server, if a UNIQUE constraint is defined upon a NULLABLE column, only one NULL value will be allowed in that column. It makes sense to me that the column should allow more than one NULL value because NULL values are distinct values. This is something to consider when you are defining your table structure and constraints. Perhaps Microsoft will fix this in a future version of SQL Server.

The more you know

If you are a developer, you will almost certainly have to deal with NULL values at some point, even if the SQL Server database is overly normalized. The more knowledge you have about dealing with these NULL values, the more headaches you will avoid in future projects.

Tim Chapman is a SQL Server database administrator who works for a bank in Louisville, KY, and has more than 7 years of IT experience. He is also Microsoft certified in SQL Server 2000 and SQL Server 2005. If you would like to contact Tim, please e-mail him at chapman.tim@gmail.com.

-----------------------------------------------------------------------------------------

Get SQL tips in your inbox

TechRepublic's free SQL Server newsletter, delivered each Tuesday, contains hands-on tips that will help you become more adept with this powerful relational database management system. Automatically subscribe today!

About

Tim Chapman is a SQL Server MVP, a database architect, and an administrator who works as an independent consultant in Raleigh, NC, and has more than nine years of IT experience.

8 comments
PK_Sean
PK_Sean

I think the academics have blown in on this one and have really made working with databases much more complicated. A NULL value simply represents the absence of any value for that field. All interaction with this field value of NULL should be in that context. By taking the NULL concept a step farther and saying it represents an "UNKNOWN" value is simply an intellectual mistake. If I don't place a value in a field, whether you think it means that it unknown to me or not - it means in the end that there is no value in that field. Hence, the concept of IS NULL being required to perform a comparison vs using = NULL is a silly differentiation.

r937
r937

quote: "However, it is typically impractical to normalize your SQL Server database all the way to 3rd normal form. It looks good on paper and would work great in an ideal world, but it usually doesn?t perform well because of the extra joins involved in accessing data you need." this is absurdly simplistic, if not flat out wrong it totally ruined the credibility of the article

Tony Hopkinson
Tony Hopkinson

Aside from the confusion of perception of using = for an equality test and a nullability test, it would require a lot more logic in the parser. Introducing syntactic garbage like a null integer equals a null date for instance.

BOUND4DOOM
BOUND4DOOM

Sorry just now getting time from some projects to read some stuff, anyway, I think this table could be in 3rd normal form, Or it couldn't it is going to depend on your usage of it and it's context in the database. The Table in this example itself is just a plain bad example. Now you could very easily use this table in a database, as long as you are not building a very big or useful application that you are planning on distribution, like say maybe this is a simple cheap system that maybe your mom and pop shops might use that runs on Access. Since it is so small, you really aren't going to have performance problems anyway. However the minute you think large scale or you throw in a Global Scalable Company and this goes to pieces quickly. So it is going to depend really on your needs. Take for example if this was a supply chain database, for a large global company, well other countries do not have states. So your going to have a ton of nulls in there. Second if you are using global, your going to be using Unicode, hence quickly doubling the size of your database on disk and affecting performance more that way than anything else especially if you are going to query on the extra columns, to query them effectively you need to set up some indexes, hence causing more work on the database for no reason, where going down to 3rd normal form not only allow you flexibility but rigid rules as well while keeping your database as compact as it can be on disk size. Another example why this example wouldn't work or that could cause you other problems as well. Suppose you were a bun supplier to McDonalds restaurants only in Chicago or any other major City, you could easily have 70-80 McDonald's in the same Zip code. Which suddenly doesn?t make this work too well. Especially since you would want only one record per restaurant for everything. This one record would be used for billing, shipping, accounting, etc. Anyway Bad example of 3rd normal form. Now suppose you do a database in 3rd normal form and then do this in one of the tables for speed, this might be good for only one or two queries, I mean seriously how many queries are you going to have on this data. In reality, you would be better off taking advantage of several of the Data Caching methods in most modern databases because doing something like this just to speed up queries now all updates Inserts and deletes now have to apply to multiple tables, hence scaling up the complexity in your databases greatly which with locking and transactions and triggers and everything else you get going on this will hurt performance more. So anyway Large scale global, best to go to 3rd normal form or as close as you can get to it. I may still have a nullable column or two in a table, however they do make sense for that specific table. Like say an optional description field or something along that nature. Something not critical to the data integrity but something a user might want.

chapman.tim
chapman.tim

apparently you've not done too much real world development.

Editor's Picks