How to Use GROUP BY with Distinct Aggregates and Derived tablesBy Jeff Smith on 31 July 2007 | Tags: Queries , SELECT In How to Use GROUP BY, we worked on a simple report request and covered the basics of GROUP BY and the issue of duplicate rows caused by JOINs. Today we'll finish up that report while examining SUM(Distinct), and see just how crucial derived tables are when summarizing data from multiple tables. The Problem with SUM(Distinct)We previously learned that we can use COUNT(Distinct) to count columns from the duplicated table, so what about SUM(Distinct)? It seems like that should do the trick, since we only want to sum distinct shipping cost values, not all the duplicates. Let's give it a try: select o.Customer, count(*) as ItemCount, sum(od.Amount) as OrderAmount, count(distinct o.OrderID) as OrderCount, sum(distinct o.ShippingCost) as TotalShipping from Orders o inner join OrderDetails od on o.OrderID = od.OrderID group by o.Customer Customer ItemCount OrderAmount OrderCount TotalShipping ---------- ----------- --------------------- ----------- --------------------- ABC 6 725.0000 3 95.0000 DEF 2 350.0000 1 10.0000 (2 row(s) affected) And there it is! We seem to have solved our problem: looking back to our Orders table, we can see that the TotalShipping cost per Customer now looks correct. But wait ... It is actually wrong! This is where many people have problems. Yes, the data looks correct. And, for this small sample, it just randomly happens to be correct. But SUM(DISTINCT) works exactly the same as COUNT(DISTINCT): It simply gets all of the values eligible to be summed, eliminates all duplicate values, and then adds up the results. But it is eliminating duplicate values, not duplicate rows based on some primary key column! It doesn't care that shipping cost 40 belonged to orderID #1 and that shipping cost 30 belonged to OrderID #2; it simply doesn't separate them that way. The expression SUM(Distinct ShippingCost) is basically evaluated like this:
If you aren't getting the concept, you still might not see the problem. In fact, at this point, many people never do. They see that SUM(x) returns huge numbers that cannot be right, so they tweak it and try SUM(DISTINCT x), and the values look much more reasonable, and they might even initially tie out perfectly, so off to production it goes. Yet, the SQL is incorrect; it is relying on the fact that currently no two orders for a customer have the same shipping cost. Let's demonstrate by adding another order: insert into Orders values (5, 'DEF', '2007-01-04', 10) insert into OrderDetails values (9, 5, 'Item J', 125) Running that simply adds another Order for Customer DEF, shipping cost of $10, with one OrderDetail item for $125. Now, let's execute that same SELECT again to see how this new Order affected our results: select o.Customer, count(*) as ItemCount, sum(od.Amount) as OrderAmount, count(distinct o.OrderID) as OrderCount, sum(distinct o.ShippingCost) as TotalShipping from Orders o inner join OrderDetails od on o.OrderID = od.OrderID group by Customer Customer ItemCount OrderAmount OrderCount TotalShipping ---------- ----------- --------------------- ----------- --------------------- ABC 6 725.0000 3 95.0000 DEF 3 475.0000 2 10.0000 (2 row(s) affected) The ItemCount, OrderAmount and OrderCount columns look great. But the TotalShipping cost for DEF still shows $10! What happened!? Can you figure it out? Remember how SUM(Distinct) works! It just takes distinct values passed to the function and eliminates duplicates. Both orders for DEF had a shipping cost of $10, and SUM(Distinct ShippingCost) doesn't care that the two $10 values are for different Orders, it just knows that the 10 is duplicated for the Customer, so it only uses the 10 once to calculate the SUM. Thus, it returns a value of 10 as the total shipping cost for those two orders, even though it should be 10+10=20. Our result is now wrong. The long and short of it is this: Never use SUM(Distinct) ! It doesn't usually make logical sense in most situations; there may be a time and place for it, but it is definitely not here. Summarizing Derived TablesSo, how do we fix this? Well, like many SQL problems, the answer is simple: Do it one step at a time, don't try to join all of the tables together and just add SUM() and GROUP BY and DISTINCT almost randomly until things work; break it down logically step by step. So, before worrying about totals per Customer, let's step back and focus on returning totals per Order. If we can return totals per Order first, then we can simply summarize those Order totals by Customer and we'll have the results we need. Let's summarize the OrderDetails table to return 1 row per Order, with the ItemCount and the total Order Amount: select orderID, count(*) as ItemCount, sum(Amount) as OrderAmount from orderDetails group by orderID orderID ItemCount OrderAmount ----------- ----------- --------------------- 1 2 250.0000 2 3 375.0000 3 1 100.0000 4 2 350.0000 5 1 125.0000 (5 row(s) affected) Nice and simple, easy to verify, things look good. Because we are grouping on OrderID, we can say that these results have a virtual primary key of OrderID -- that is, there will never be duplicate rows for the same Order. In fact, here's another basic rule to always remember: The virtual primary key of a SELECT with a GROUP BY clause will always be the expressions stated in the GROUP BY. We can now take that SQL statement and those results and encapsulate them in their own derived table. If we join from the Orders table to the previous SELECT as a derived table, we get: select o.orderID, o.Customer, o.ShippingCost, d.ItemCount, d.OrderAmount from orders o inner join ( select orderID, count(*) as ItemCount, sum(Amount) as OrderAmount from orderDetails group by orderID ) d on o.orderID = d.orderID orderID Customer ShippingCost ItemCount OrderAmount ----------- ---------- --------------------- ----------- --------------------- 1 ABC 40.0000 2 250.0000 2 ABC 30.0000 3 375.0000 3 ABC 25.0000 1 100.0000 4 DEF 10.0000 2 350.0000 5 DEF 10.0000 1 125.0000 (5 row(s) affected) Let's examine those results. There are no duplicate rows or values anywhere; there is exactly one row per Order. This is because our derived table has a virtual primary key of OrderID, so joining from Orders to our derived table will never produce duplicates. This is a very useful and simple technique to avoid duplicates when relating a parent table to a child table: summarize the child table by the parent's primary key first in a derived table, and then join it to the parent table. The parent table’s rows will then never be duplicated and can be summarized accurately. Now we have our total ItemCount per order, as well as our total OrderAmount per order. And we can see that if we sum these results up, our ShippingCost column will be fine, since it is never duplicated. No need for distinct. In fact, we can even use a regular COUNT(*) expression to get the total number of orders per customer! So, we can simply add "GROUP BY Customer" to the previous SQL, calculate what we need with aggregate functions, and remove any columns (like OrderID) that we will not be summarizing. You might also notice that at this point, the total ItemCount per Customer is no longer a COUNT(*) expression; it is a simple SUM() of the ItemCount value returned from our derived table. Here's the result: select o.Customer, count(*) as OrderCount, sum(o.ShippingCost) as ShippingTotal, sum(d.ItemCount) as ItemCount, sum(d.OrderAmount) as OrderAmount from orders o inner join ( select orderID, count(*) as ItemCount, sum(Amount) as OrderAmount from orderDetails group by orderID ) d on o.orderID = d.orderID group by o.customer Customer OrderCount ShippingTotal ItemCount OrderAmount ---------- ----------- --------------------- ----------- --------------------- ABC 3 95.0000 6 725.0000 DEF 2 20.0000 3 475.0000 (2 row(s) affected) And there you have it! We examined our data, logically considered the implications of our JOINS, broke the problem down into smaller parts, and ended up with a fairly simple solution that we know will be quick, efficient and accurate. Adding More Tables a Summarized SELECTTo finish things up, suppose our schema also has a table of Customers: Create table Customers ( Customer varchar(10) primary key, CustomerName varchar(100) not null, City varchar(100) not null, State varchar(2) not null ) insert into Customers select 'ABC','ABC Corporation','Boston','MA' union all select 'DEF','The DEF Foundation','New York City','NY' ... and we wish to also return each customers' name, city and state in our previous results. One way to do this is to simply add the Customers table to our existing join, and then add the customer columns to the SELECT clause. However, unless you add all of the customer columns to the GROUP BY as well, you will get an error message indicating that you need to either group or summarize all columns you wish to display. We aren't trying to calculate a COUNT() or a SUM() of Name, City and State, so it doesn't make sense to wrap those columns in an aggregate expression. So, it appears that we must add them all to our GROUP BY clause to get the results we need: select o.Customer, c.customerName, c.City, c.State, count(*) as OrderCount, sum(o.ShippingCost) as ShippingTotal, sum(d.ItemCount) as ItemCount, sum(d.OrderAmount) as OrderAmount from orders o inner join ( select orderID, count(*) as ItemCount, sum(Amount) as OrderAmount from orderDetails group by orderID ) d on o.orderID = d.orderID inner join customers c on o.customer = c.customer group by o.customer, c.customerName, c.City, c.State Customer customerName City State OrderCount ShippingTotal ItemCount OrderAmount ---------- -------------------- --------------- ----- ----------- ------------- --------- ----------- ABC ABC Corporation Boston MA 3 95.0000 6 725.0000 DEF The DEF Foundation New York City NY 2 20.0000 3 475.0000 (2 row(s) affected) Technically, that works, but it seems silly to list all of those customer columns in the GROUP BY ... After all, we are just grouping on Customer, not on each of the customer's attributes, right? What's interesting is that the solution is something we already talked about and the same technique applies: Since Customer has a one-to-many relation with Orders, we know that joining Customers to Orders will result in duplicate rows per Customer, and thus all columns in the Customer table are duplicated in the results. You might notice that this is exactly the same scenario that applies when joining Orders to OrderDetails. So, we handle this situation the same way! We simply summarize our Orders by Customer first, in a derived table, and then we join those results to the Customer table. This means that no columns from the Customer table will be dupicated at all, and there is no need to add them all to our GROUP BY expression. This keep our SQL clean, organized, and logically sound. So, our final results now look like this: select c.Customer, c.customerName, c.City, c.State, o.OrderCount, o.ShippingTotal, o.ItemCount, o.OrderAmount from ( select o.customer, count(*) as OrderCount, sum(o.ShippingCost) as ShippingTotal, sum(d.ItemCount) as ItemCount, sum(d.OrderAmount) as OrderAmount from orders o inner join ( select orderID, count(*) as ItemCount, sum(Amount) as OrderAmount from orderDetails group by orderID ) d on o.orderID = d.orderID group by o.customer ) o inner join customers c on o.customer = c.customer Customer customerName City State OrderCount ShippingTotal ItemCount OrderAmount ---------- -------------------- --------------- ----- ----------- ------------- --------- ----------- ABC ABC Corporation Boston MA 3 95.0000 6 725.0000 DEF The DEF Foundation New York City NY 2 20.0000 3 475.0000 (2 row(s) affected) ConclusionI hope this two part series helps a little bit with your understanding of GROUP BY queries. It is vital to identify and understand what the virtual primary key of a result set is when you join multiple tables, and to recognize which rows are duplicated or not. In addition, remember that COUNT(Distinct) can be useful, but SUM(Distinct) should very rarely, if ever, be used. In general, if you find that values you need to SUM() have been duplicated, summarize the table causing those duplicates separately and join it in as a derived table. This will also allow you to break down your problem into smaller steps and test and validate the results of each step as you go. GROUP BY is a very powerful feature, but is also misunderstood and abused, and the easiest way to leverage it is to carefully build your SQL from smaller, simpler parts into larger, more complicated solutions.
|
- Advertisement - |