Data Management

Filtering data with SQL Server

In SQL Server, data analysis begins with the GROUP BY statement. This powerful SQL feature filters data to return the exact information you need.

Relational data is typically stored in a normalized format, which means that you repeat as little data as possible; typically just keys that tie one table to another. Furthermore, normalization specifies that you don’t store calculated values in the database; instead, you calculate them on the fly when needed.

It’s often important to perform some analysis on your data. For example, you might want to know which orders are for the most products, or which orders are for the most dollars. These questions require you to create formulas in your SQL statement. One of the most important portions of these statements is the GROUP BY clause.

Looking at orders in Northwind
The Northwind database is one of the two sample databases included with a SQL Server installation. It's not perfect, but it's fine for the purpose of discussing the GROUP BY statement. One of the reasons it works well is because it has a standard set of tables, including Customers, Orders, Order Details, and Products, for dealing with orders. You can see the schema for these tables in Figure A.

Figure A
Table schema

If you wanted to see the OrderID and the ProductID for each order, the following SQL command would work just fine:
SELECT o.OrderID, od.ProductID
FROM Orders o, [Order Details] od
WHERE o.OrderID = od.OrderID

Here, you join Orders and Order Details on the OrderID field. This gives you a list that shows the OrderID and ProductID for each line item ordered.

You can scroll though this list and try to find the order with the largest number of items. However, it‘s much easier to simply ask the database to count the number of items for you. If you don’t care about the individual items and just want to find the orders with the largest number of items ordered, you might be tempted to try the following SQL statement:
SELECT o.OrderID, Count(od.ProductID) as NumItems
FROM Orders o, [Order Details] od
WHERE o.OrderID = od.OrderID

Here, you’re attempting to count the number of products and display that count in a new field named NumItems. However, if you execute this statement, you’ll get the following error:
Server: Msg 8118, Level 16, State 1, Line 1
Column 'o.OrderID' is invalid in the select list because
it is not contained in an aggregate function and there is no GROUP BY clause.

In this case, you’re trying to count ProductID, but OrderID is not being counted or summed, or having any other operation performed on it.

What you’re counting here is not the total quantity of items ordered but the number of unique items ordered. In other words, you’ll see that a particular order might include three items, but not that the customer ordered five of each. What you want is a count of products by order. You can use the GROUP BY clause to see the total number of products ordered.

The way to think of GROUP BY is to ask, “How do I want to see the data?” If part of your answer is “by” something, you might end up using GROUP BY. In this case, you want to see the number of products by order, so you’ll group by the OrderID field. In addition, you’ll use an ORDER BY clause to make it easier to find the orders with the largest number of items. The new query will look like this:
SELECT o.OrderID, Count(od.ProductID) as NumItems
FROM Orders o, [Order Details] od
WHERE o.OrderID = od.OrderID

Now you have the answer to your question. As you can see by the partial result set shown in Figure B, order number 11077 has 25 products, while the next largest order has only six items.

Figure B
Partial result set

Understanding the rules
GROUP BY is rather flexible, as long as you follow the rules. For example, you can have multiple columns in the ORDER BY clause. If you want to see how many of each type of product has been ordered by each customer, you’ll have to create a query that joins the customer to the products, which is done through the orders. Figure A showed the four-table join that will be used. You then group by the customer and product while summing up the Quantity column in the Order Details table. The query looks like this:
SELECT c.CompanyName, p.ProductName, Sum(od.Quantity) as TotalBought
FROM Customers c, Products p, Orders o, [Order Details] od
c.CustomerID=o.CustomerID AND
o.OrderID=od.OrderID AND
GROUP BY c.CompanyName, p.ProductName
ORDER BY CompanyName, TotalBought DESC

The results of this query will show the total number of each product bought by each customer for all records in the database, as shown in Figure C.

Figure C
Query results

It’s also possible to put multiple aggregate columns in the query. For example, assume that you want to see a list of the orders showing the largest quantity for a single item as well as the total number of items for that order. The following statement uses the Max function to show the largest quantity for a single item. It also sums all the quantities for the total number of items bought.
Max(od.Quantity) as TopItem,
Sum(od.Quantity) as TotalBought
Orders o,
[Order Details] od,
Products p
o.OrderID = od.OrderID AND

As you can see in Figure D, you have the order, the quantity for the top item ordered, and total items for the order. Now for an easy contest: How do you see the name of the product that matches the TopItem column? The first correct answer sent to wins a copy of my book, SQL Server 2000 Web Application Developer’s Guide.

Figure D
Query results

The GROUP BY is a powerful construct for filtering data. For aggregating values in your SQL tables, it’s an invaluable construct. Use it when you want to see data grouped “by” certain conditions.

Editor's Picks