You can’t teach an old dog new tricks—at
least that’s what the cliché says. Many developers intentionally try to be the “old
dog” so they don’t have to learn new tricks. Some developers, even those
who have been developing client server, Web-based, and smart client
applications, still approach problems with a procedural mindset. They seek to
find code solutions for problems that can often be more efficiently solved with
the application of set-based thinking in SQL.
In this article, you’ll learn just a few
tricks that you can use to improve the performance of your processes and better
leverage your SQL server.
In the absence of data
The NOT IN clause is a good example of
how set-based thinking can be used to solve one of the classic data problems. I
often challenge prospective employees to solve the following problem.
You have two tables. The first table, the
Customer table, contains a customer id field and a customer name. The second
table, the Order Headers table, contains an order id, a customer id, and an
order date. In real life, the tables would, of course, have more fields. I ask
the recruits to create a list of all customers who don’t have a record in the
Orders table. This operation is a common one because it is often used to clean
up database tables by removing unneeded records.
The solution is NOT IN. We’re looking for
all customers from the Customers table where the customer id is not present, or
not in, the orders table. At first glance, you might construct a SQL statement
that looks something like this:
SELECT Customers.ID, Customers.Name
WHERE Customers.ID NOT IN SELECT Orders.CustomerID FROM Orders
This is the most basic solution. It
returns what you’re looking for. It gets a bit trickier when there’s more than
one field involved because you can’t use the NOT IN syntax.
For this, we’ll need to add another table
to our scenario. In addition to the Customer and Orders table, we also have an
Order Lines table, which contains fields for the order’s identifier, the
product identifier, and a quantity.
In this case, we’re looking for customers
who have not ordered the company’s new flagship product so they can be sent a
marketing letter espousing its benefits. That product ID is 1000. We need to
find the situations where there are records in the Customer table that have no
orders with an order line containing the 1000 product. The way that we put this
together is to first look at what will generate a list of orders with the new
SELECT DISTINCT Orders.ID
FROM Orders INNER JOIN OrderLines ON Orders.ID = OrderLines.OrderID
This causes the two tables to be joined
such that only orders that have order lines with the new product will be
returned. The DISTINCT part of the SQL statement causes the SQL server to
eliminate potential duplicates. This may occur if the same new product exists
on multiple lines. In this case, we could have put the product criteria (OrderLine.ProductID = 1000) in a WHERE
clause; however, it was left as a part of the JOIN syntax since it is integral
to how the results should be returned. When we put it together with the rest of
the problem, having the new product criteria as a part of WHERE presents a further
problem, as we’ll discuss. In the meantime, it’s time now to couple this with a
different way of testing for existence, an outer (left) join.
SELECT DISTINCT Customers.ID, Customers.Name
FROM Customers LEFT JOIN Orders ON Customers.ID = Orders.CustomerID
INNER JOIN OrderLines ON Orders.ID = OrderLines.OrderID AND
OrderLines.ProductID = 1000
WHERE Orders.CustomerID IS NULL
An outer join, in this case, a left join,
says that you want all records from one side of the join, in this case, all
customers, since that is on the left of the join statement. As a result, it’s
possible to get all of the customers, regardless of whether they have orders or
not. When the customer doesn’t have orders, all of the order (and, in this
case, order line) fields will be null. Putting a filter at the bottom to return
only records where the Orders.CustomerID
IS NULL causes the query to look for a situation that is nonexistent unless
there are no orders. Since the Orders table is joined on the Orders.CustomerID field, it can only be
null when there is no record.
The reason that this will work for
identifying orders that have order lines for the new product is because of the
inner join, which is connecting the Orders and Order Lines. If the correct
records are not in the Order Lines table, then the records in the Orders table
won’t be returned either. The net result is that, if there’s not a line for the
new product, the order won’t be returned and the customers with no orders will
Above we mentioned that the OrderLines.ProductID = 1000 had to be a
part of the JOIN instead of being a part of a WHERE clause. This is because we’re
changing from “what we want.” to something that “we don’t want.”
If the OrderLines.ProductID = 1000
were moved down to the WHERE clause, no records would be returned. This is
because, for an OrderLines.ProductID
to be equal to 1000, it must exist. The order line must be present in the table.
On the other hand, if the order line exists, then the order will exist and the
other part of the WHERE clause, insisting that Orders.CustomerID be null, will always be false. In other words,
you’re requiring that the order line be present and that the order itself not
be present. This will never work. By including the OrderLines.ProductID = 1000 in the JOIN, it is executed from the
context of the join and yields the right results.
Thinking in SQL
By using set-based logic, in other words,
thinking in SQL, you can solve some particularly thorny problems that would be
exceedingly difficult to solve with traditional procedural based programming.