Networking

Understand the effects of NULLs on inner and outer JOINs

In this SQL Server tip, Arthur Fuller explains the effects of NULLs, and the different results obtained by inner and outer JOINs.

SQL Server developers must understand two things: the difference between inner and outer JOINs and the effect of NULLs upon these choices. This tip addresses both subjects.

Get SQL tips in your inbox
TechRepublic's SQL Server newsletter, delivered each Tuesday, contains hands-on tips that will help you become more adept with this powerful relational database management system.
Automatically sign up today!

Let's take it one step at a time -- first, the differences between inner and outer JOINs, and then the implications of NULLs on our queries. In order to examine the difference between inner and outer JOINs, we'll need some sample data.

Suppose you have two tables, T1 and T2, each of which contain a column called TestJoin. The column contains the values 1, 2, and 3 in table T1, while the column contains NULL, 2, and 3 in table T2. Inner JOINing these tables on the column TestJoin returns two rows, since you cannot join the value 1 to the NULL. An outer JOIN is the opposite. For instance, the following yields only those rows where Customers have placed Orders.

SELECT CustomerID, OrderID FROM Customers Inner Join Orders
On Customers.CustomerID = Orders.CustomerID

(In many businesses, this may be valid; other businesses may define a Customer as one who has placed at least one Order. That is a business rule that ought to be reflected in the database definition, not the middle tier, in my opinion, though some would differ.)

An outer join would read:

SELECT CustomerID, OrderID FROM Customers LEFT OUTER JOIN Orders
OnOrders.CustomerID = Customers.CustomerID

which would result in a list of all customers, irrespective of whether they placed orders.

Run the simple script in Listing A to create some test tables.

Here is the SQL to compare the inner and outer JOINs:

SELECT  InnerOuter.T1.T1ID, InnerOuter.T1.NameAS Name1, 
        InnerOuter.T2.T2ID, InnerOuter.T2.NameAS Name2
FROM    InnerOuter.T1 LEFT OUTER JOIN
               InnerOuter.T2 ON InnerOuter.T1.T1ID = InnerOuter.T2.T1ID

When you run both queries, you will confirm that the inner JOIN returns two rows, while the outer JOIN returns three. This principle holds true even if you add a third table, as you can see from Listing B.

If you join the third table in the SELECT statements, you find the same results: two rows from the inner JOIN and three from the outer JOIN. See Listing C.

Be aware of the effect of NULLs on aggregate functions. To observe this effect, add a NULL money column to T2 called Amount. Add a few more rows to T2 something like this:

T2ID  T1ID  Name  Amount
1     1     T2Text1     NULL
2     NULL  T2Text2     NULL
3     3     T2Text3     120.0000
4     1     T2Text4     123.0000
5     1     T2Text5     234.0000
6     3     T2Text6     345.0000
NULL  NULL  NULL  NULL

The T1ID value 1 has three rows, one of whose Amount is NULL. The T2ID value 3 has two rows, neither of which are NULL. (You can add as many rows as you wish, but ensure that there are different numbers of rows with NULL values in the Amount column.)

Now run this query:

SELECT  InnerOuter.T1.T1ID, 
        Sum(InnerOuter.T2.Amount)AS TotalAmount,
        Count(InnerOuter.T2.T1ID)AS NumberOfRows,
        Count(InnerOuter.T2.Amount)AS NumberOfAmounts,
        Avg(InnerOuter.T2.Amount)AS AverageAmount
FROM    InnerOuter.T1 LEFT OUTER JOIN
        InnerOuter.T2 ON InnerOuter.T1.T1ID = InnerOuter.T2.T1ID
GROUP BY
        InnerOuter.T1.T1ID

The above data produces this result set:

1     357.00      3     2     178.50
2     NULL  0     0     NULL
3     465.00      2     2     232.50

The first returned row indicates that if you count the T1ID column, you get the correct number of rows, but if you count the Amounts, you get the number of non-NULL values. Note that the averages reflect the number of non-NULL values, not the number of rows.

You might think these examples are artificial, since given the relationships between the tables, it would seem unlikely that the foreign key columns allow NULLs. However, I see this sort of thing all the time in databases. HR hires a new employee but has yet to assign her to a department. Or, a new customer is added to a database before a sales rep is assigned to that customer.

There are various ways to work around the absence of such information. In my opinion, the worst approach is to add what is known colloquially as the 0th row, in which a "fake" row is added to every lookup table, and this zero value is used whenever a foreign key is unknown—thus, the foreign keys will never be NULL. This approach has two significant problems: (1) It conceptually falsifies reality by introducing a new type of data into the table, when relational theory insists that a table holds exactly one type of data; (2) It complicates queries because you will always have to eliminate such rows from standard queries.

Miss a tip?

Check out the SQL Server archive, and catch up on the most recent editions of Arthur Fuller's column.

3 comments
david.fournier
david.fournier

Hi. First off I thought it was a generally well written article. Must say I disagree with the blanket statement that 0th values are bad. I've found time and time again, that real world performance is dramatically increased by getting rid of outer joins wherever possible. Keeping outer joins in order to uphold relational theory is generally a luxury I don't have.

stevea
stevea

I agree that using a fake foreign key, the 0th row, introduces additional problems into the situation. Do you feel that allowing the NULL foreign key and fixing all the stored procs to deal with NULLs is the best solution for this problem?

chapman.tim
chapman.tim

For reader clarification, the reason you cannot join the value 1 with the value NULL is becuase the value NULL is unknown. You, and the system for that matter, will not know the value NULL until an explicit value is entered into that field. Also, when you are complying to ANSI rules concerning NULLs, a NULL value will never be equal to another NULL value. So, a statement along the lines of SELECT CASE WHEN NULL = NULL THEN 'Same' ELSE 'Different' END will return Different, whereas when ANSI NULLS are OFF, the value will return Same. Tim