Data Management

Thinking SQL: Set-based logic can improve query performance

Improve the performance of your processes and better leverage your SQL server with set-based thinking.

You can't teach an old dog new tricks—at least that's what the cliché says. Many developers intentionally try to be the "old dog" so they don't have to learn new tricks. Some developers, even those who have been developing client server, Web-based, and smart client applications, still approach problems with a procedural mindset. They seek to find code solutions for problems that can often be more efficiently solved with the application of set-based thinking in SQL.

In this article, you'll learn just a few tricks that you can use to improve the performance of your processes and better leverage your SQL server.

In the absence of data

The NOT IN clause is a good example of how set-based thinking can be used to solve one of the classic data problems. I often challenge prospective employees to solve the following problem.

You have two tables. The first table, the Customer table, contains a customer id field and a customer name. The second table, the Order Headers table, contains an order id, a customer id, and an order date. In real life, the tables would, of course, have more fields. I ask the recruits to create a list of all customers who don't have a record in the Orders table. This operation is a common one because it is often used to clean up database tables by removing unneeded records.

The solution is NOT IN. We're looking for all customers from the Customers table where the customer id is not present, or not in, the orders table. At first glance, you might construct a SQL statement that looks something like this:

SELECT Customers.ID, Customers.Name
   FROM Customers
  WHERE Customers.ID NOT IN SELECT Orders.CustomerID FROM Orders

This is the most basic solution. It returns what you're looking for. It gets a bit trickier when there's more than one field involved because you can't use the NOT IN syntax.

JOINing together

For this, we'll need to add another table to our scenario. In addition to the Customer and Orders table, we also have an Order Lines table, which contains fields for the order's identifier, the product identifier, and a quantity.

In this case, we're looking for customers who have not ordered the company's new flagship product so they can be sent a marketing letter espousing its benefits. That product ID is 1000. We need to find the situations where there are records in the Customer table that have no orders with an order line containing the 1000 product. The way that we put this together is to first look at what will generate a list of orders with the new product ordered.

SELECT DISTINCT Orders.ID
FROM Orders INNER JOIN OrderLines ON Orders.ID = OrderLines.OrderID
AND OrderLines.ProductID=1000

This causes the two tables to be joined such that only orders that have order lines with the new product will be returned. The DISTINCT part of the SQL statement causes the SQL server to eliminate potential duplicates. This may occur if the same new product exists on multiple lines. In this case, we could have put the product criteria (OrderLine.ProductID = 1000) in a WHERE clause; however, it was left as a part of the JOIN syntax since it is integral to how the results should be returned. When we put it together with the rest of the problem, having the new product criteria as a part of WHERE presents a further problem, as we'll discuss. In the meantime, it's time now to couple this with a different way of testing for existence, an outer (left) join.

SELECT DISTINCT Customers.ID, Customers.Name
 FROM Customers LEFT JOIN Orders ON Customers.ID = Orders.CustomerID
   INNER JOIN OrderLines ON Orders.ID = OrderLines.OrderID AND
                                        OrderLines.ProductID = 1000
  WHERE Orders.CustomerID IS NULL

An outer join, in this case, a left join, says that you want all records from one side of the join, in this case, all customers, since that is on the left of the join statement. As a result, it's possible to get all of the customers, regardless of whether they have orders or not. When the customer doesn't have orders, all of the order (and, in this case, order line) fields will be null. Putting a filter at the bottom to return only records where the Orders.CustomerID IS NULL causes the query to look for a situation that is nonexistent unless there are no orders. Since the Orders table is joined on the Orders.CustomerID field, it can only be null when there is no record.

The reason that this will work for identifying orders that have order lines for the new product is because of the inner join, which is connecting the Orders and Order Lines. If the correct records are not in the Order Lines table, then the records in the Orders table won't be returned either. The net result is that, if there's not a line for the new product, the order won't be returned and the customers with no orders will be returned.

Above we mentioned that the OrderLines.ProductID = 1000 had to be a part of the JOIN instead of being a part of a WHERE clause. This is because we're changing from "what we want." to something that "we don't want." If the OrderLines.ProductID = 1000 were moved down to the WHERE clause, no records would be returned. This is because, for an OrderLines.ProductID to be equal to 1000, it must exist. The order line must be present in the table. On the other hand, if the order line exists, then the order will exist and the other part of the WHERE clause, insisting that Orders.CustomerID be null, will always be false. In other words, you're requiring that the order line be present and that the order itself not be present. This will never work. By including the OrderLines.ProductID = 1000 in the JOIN, it is executed from the context of the join and yields the right results.

Thinking in SQL

By using set-based logic, in other words, thinking in SQL, you can solve some particularly thorny problems that would be exceedingly difficult to solve with traditional procedural based programming.

Editor's Picks