Understand the effects of NULLs on inner and outer JOINs

In this SQL Server tip, Arthur Fuller explains the effects of NULLs, and the different results obtained by inner and outer JOINs.

SQL Server developers must understand two things: the difference between inner and outer JOINs and the effect of NULLs upon these choices. This tip addresses both subjects.

Get SQL tips in your inbox
TechRepublic's SQL Server newsletter, delivered each Tuesday, contains hands-on tips that will help you become more adept with this powerful relational database management system.
Automatically sign up today!

Let's take it one step at a time — first, the differences between inner and outer JOINs, and then the implications of NULLs on our queries. In order to examine the difference between inner and outer JOINs, we'll need some sample data.

Suppose you have two tables, T1 and T2, each of which contain a column called TestJoin. The column contains the values 1, 2, and 3 in table T1, while the column contains NULL, 2, and 3 in table T2. Inner JOINing these tables on the column TestJoin returns two rows, since you cannot join the value 1 to the NULL. An outer JOIN is the opposite. For instance, the following yields only those rows where Customers have placed Orders.

SELECT CustomerID, OrderID FROM Customers Inner Join Orders
On Customers.CustomerID = Orders.CustomerID

(In many businesses, this may be valid; other businesses may define a Customer as one who has placed at least one Order. That is a business rule that ought to be reflected in the database definition, not the middle tier, in my opinion, though some would differ.)

An outer join would read:

SELECT CustomerID, OrderID FROM Customers LEFT OUTER JOIN Orders
OnOrders.CustomerID = Customers.CustomerID

which would result in a list of all customers, irrespective of whether they placed orders.

Run the simple script in Listing A to create some test tables.

Here is the SQL to compare the inner and outer JOINs:

SELECT  InnerOuter.T1.T1ID, InnerOuter.T1.NameAS Name1, 
        InnerOuter.T2.T2ID, InnerOuter.T2.NameAS Name2
               InnerOuter.T2 ON InnerOuter.T1.T1ID = InnerOuter.T2.T1ID

When you run both queries, you will confirm that the inner JOIN returns two rows, while the outer JOIN returns three. This principle holds true even if you add a third table, as you can see from Listing B.

If you join the third table in the SELECT statements, you find the same results: two rows from the inner JOIN and three from the outer JOIN. See Listing C.

Be aware of the effect of NULLs on aggregate functions. To observe this effect, add a NULL money column to T2 called Amount. Add a few more rows to T2 something like this:

T2ID  T1ID  Name  Amount
1     1     T2Text1     NULL
2     NULL  T2Text2     NULL
3     3     T2Text3     120.0000
4     1     T2Text4     123.0000
5     1     T2Text5     234.0000
6     3     T2Text6     345.0000

The T1ID value 1 has three rows, one of whose Amount is NULL. The T2ID value 3 has two rows, neither of which are NULL. (You can add as many rows as you wish, but ensure that there are different numbers of rows with NULL values in the Amount column.)

Now run this query:

SELECT  InnerOuter.T1.T1ID, 
        Sum(InnerOuter.T2.Amount)AS TotalAmount,
        Count(InnerOuter.T2.T1ID)AS NumberOfRows,
        Count(InnerOuter.T2.Amount)AS NumberOfAmounts,
        Avg(InnerOuter.T2.Amount)AS AverageAmount
        InnerOuter.T2 ON InnerOuter.T1.T1ID = InnerOuter.T2.T1ID

The above data produces this result set:

1     357.00      3     2     178.50
2     NULL  0     0     NULL
3     465.00      2     2     232.50

The first returned row indicates that if you count the T1ID column, you get the correct number of rows, but if you count the Amounts, you get the number of non-NULL values. Note that the averages reflect the number of non-NULL values, not the number of rows.

You might think these examples are artificial, since given the relationships between the tables, it would seem unlikely that the foreign key columns allow NULLs. However, I see this sort of thing all the time in databases. HR hires a new employee but has yet to assign her to a department. Or, a new customer is added to a database before a sales rep is assigned to that customer.

There are various ways to work around the absence of such information. In my opinion, the worst approach is to add what is known colloquially as the 0th row, in which a "fake" row is added to every lookup table, and this zero value is used whenever a foreign key is unknown—thus, the foreign keys will never be NULL. This approach has two significant problems: (1) It conceptually falsifies reality by introducing a new type of data into the table, when relational theory insists that a table holds exactly one type of data; (2) It complicates queries because you will always have to eliminate such rows from standard queries.

Miss a tip?

Check out the SQL Server archive, and catch up on the most recent editions of Arthur Fuller's column.

Editor's Picks

Free Newsletters, In your Inbox