SQL Server developers must understand two things: the
difference between inner and outer JOINs and the
effect of NULLs upon these choices. This tip
addresses both subjects.
Let’s take it one step at a time — first, the differences
between inner and outer JOINs, and then the
implications of NULLs on our queries. In order to
examine the difference between inner and outer JOINs,
we’ll need some sample data.
Suppose you have two tables, T1 and T2, each of which contain a column called TestJoin.
The column contains the values 1, 2, and 3 in table T1, while the column contains NULL,
2, and 3 in
table T2. Inner JOINing these tables on the column TestJoin returns two rows, since you cannot join the value
1 to the NULL. An outer JOIN is the opposite. For instance, the following yields only those rows where
Customers have placed Orders.
SELECT CustomerID, OrderID FROM Customers Inner Join Orders
On Customers.CustomerID = Orders.CustomerID
(In many businesses, this may be valid; other
businesses may define a Customer as one who has placed at least one Order. That
is a business rule that ought to be reflected in the database definition, not
the middle tier, in my opinion, though some would differ.)
An outer join would read:
SELECT CustomerID, OrderID FROM Customers LEFT OUTER JOIN Orders
OnOrders.CustomerID = Customers.CustomerID
which would result in a list of all
customers, irrespective of whether they placed orders.
Run the simple script in
Listing A to create some test tables.
Here is the SQL to compare the inner and outer JOINs:
SELECT InnerOuter.T1.T1ID, InnerOuter.T1.NameAS Name1,
InnerOuter.T2.T2ID, InnerOuter.T2.NameAS Name2
FROM InnerOuter.T1 LEFT OUTER JOIN
InnerOuter.T2 ON InnerOuter.T1.T1ID = InnerOuter.T2.T1ID
When you run both queries, you will confirm that the inner
JOIN returns two rows, while the outer JOIN returns three. This principle holds
true even if you add a third table, as you can see from Listing B.
If you join the third table in the SELECT statements, you
find the same results: two rows from the inner JOIN and three from the outer
JOIN. See Listing C.
Be aware of the effect of NULLs on
aggregate functions. To observe this effect, add a NULL money column to T2
called Amount. Add a few more rows to T2 something like this:
T2ID T1ID Name Amount
1 1 T2Text1 NULL
2 NULL T2Text2 NULL
3 3 T2Text3 120.0000
4 1 T2Text4 123.0000
5 1 T2Text5 234.0000
6 3 T2Text6 345.0000
NULL NULL NULL NULL
The T1ID value 1 has three rows, one of whose Amount is
NULL. The T2ID value 3 has two rows, neither of which are NULL. (You can add as
many rows as you wish, but ensure that there are different numbers of rows with
NULL values in the Amount column.)
Now run this query:
SELECT InnerOuter.T1.T1ID,
Sum(InnerOuter.T2.Amount)AS TotalAmount,
Count(InnerOuter.T2.T1ID)AS NumberOfRows,
Count(InnerOuter.T2.Amount)AS NumberOfAmounts,
Avg(InnerOuter.T2.Amount)AS AverageAmount
FROM InnerOuter.T1 LEFT OUTER JOIN
InnerOuter.T2 ON InnerOuter.T1.T1ID = InnerOuter.T2.T1ID
GROUP BY
InnerOuter.T1.T1ID
The above data produces this result set:
1 357.00 3 2 178.50
2 NULL 0 0 NULL
3 465.00 2 2 232.50
The first returned row indicates that if you count the T1ID
column, you get the correct number of rows, but if you count the Amounts, you
get the number of non-NULL values. Note that the averages reflect the number of
non-NULL values, not the number of rows.
You might think these examples are artificial, since given
the relationships between the tables, it would seem
unlikely that the foreign key columns allow NULLs.
However, I see this sort of thing all the time in databases. HR hires a new
employee but has yet to assign her to a department. Or, a new customer is added
to a database before a sales rep is assigned to that
customer.
There are various ways to work around the absence of such
information. In my opinion, the worst approach is to add what is known
colloquially as the 0th row, in which a “fake” row is
added to every lookup table, and this zero value is used whenever a foreign key
is unknown—thus, the foreign keys will never be NULL. This approach has two
significant problems: (1) It conceptually falsifies reality by introducing a new
type of data into the table, when relational theory insists that a table holds
exactly one type of data; (2) It complicates queries because you will always
have to eliminate such rows from standard queries.
Miss a tip?
Check out the SQL Server archive, and catch up on the most recent editions of Arthur Fuller’s column.