How to Use GROUP BY with Distinct Aggregates and Derived tables

By Jeff Smith on 31 July 2007 | Tags: Queries , SELECT

In How to Use GROUP BY, we worked on a simple report request and covered the basics of GROUP BY and the issue of duplicate rows caused by JOINs. Today we'll finish up that report while examining SUM(Distinct), and see just how crucial derived tables are when summarizing data from multiple tables.

The Problem with SUM(Distinct)

We previously learned that we can use COUNT(Distinct) to count columns from the duplicated table, so what about SUM(Distinct)? It seems like that should do the trick, since we only want to sum distinct shipping cost values, not all the duplicates. Let's give it a try:

select 
  o.Customer, count(*) as ItemCount, sum(od.Amount) as OrderAmount, 
  count(distinct o.OrderID) as OrderCount, 
  sum(distinct o.ShippingCost) as TotalShipping

from 
  Orders o
inner join 
  OrderDetails od on o.OrderID = od.OrderID
group by 
  o.Customer

Customer   ItemCount   OrderAmount           OrderCount  TotalShipping         
---------- ----------- --------------------- ----------- --------------------- 
ABC        6           725.0000              3           95.0000
DEF        2           350.0000              1           10.0000

(2 row(s) affected)

And there it is! We seem to have solved our problem: looking back to our Orders table, we can see that the TotalShipping cost per Customer now looks correct.

But wait ... It is actually wrong!

This is where many people have problems. Yes, the data looks correct. And, for this small sample, it just randomly happens to be correct. But SUM(DISTINCT) works exactly the same as COUNT(DISTINCT): It simply gets all of the values eligible to be summed, eliminates all duplicate values, and then adds up the results. But it is eliminating duplicate values, not duplicate rows based on some primary key column! It doesn't care that shipping cost 40 belonged to orderID #1 and that shipping cost 30 belonged to OrderID #2; it simply doesn't separate them that way.

The expression SUM(Distinct ShippingCost) is basically evaluated like this:

After Joining from Orders to OrderDetails, each group has the following Shipping cost values:

Customer ABC: 40,40,30,30,30,25
Customer DEF: 10
Since DISTINCT was asked for, it eliminates duplicate values from those lists:

Customer ABC: 40,40,30,30,30,25
Customer DEF: 10
And now it can evaluate the SUM() by adding up the remaining values:

Customer ABC: 40+30+25 = 95
Customer DEF: 10 = 10

If you aren't getting the concept, you still might not see the problem. In fact, at this point, many people never do. They see that SUM(x) returns huge numbers that cannot be right, so they tweak it and try SUM(DISTINCT x), and the values look much more reasonable, and they might even initially tie out perfectly, so off to production it goes. Yet, the SQL is incorrect; it is relying on the fact that currently no two orders for a customer have the same shipping cost.

Let's demonstrate by adding another order:

insert into Orders values (5, 'DEF', '2007-01-04', 10)
insert into OrderDetails values (9, 5, 'Item J', 125)

Running that simply adds another Order for Customer DEF, shipping cost of $10, with one OrderDetail item for $125. Now, let's execute that same SELECT again to see how this new Order affected our results:

select 
  o.Customer, count(*) as ItemCount, sum(od.Amount) as OrderAmount, 
  count(distinct o.OrderID) as OrderCount, 
  sum(distinct o.ShippingCost) as TotalShipping
from 
  Orders o
inner join 
  OrderDetails od on o.OrderID = od.OrderID
group by 
  Customer

Customer   ItemCount   OrderAmount           OrderCount  TotalShipping         
---------- ----------- --------------------- ----------- --------------------- 
ABC        6           725.0000              3           95.0000
DEF        3           475.0000              2           10.0000

(2 row(s) affected)

The ItemCount, OrderAmount and OrderCount columns look great. But the TotalShipping cost for DEF still shows $10! What happened!?

Can you figure it out? Remember how SUM(Distinct) works! It just takes distinct values passed to the function and eliminates duplicates. Both orders for DEF had a shipping cost of $10, and SUM(Distinct ShippingCost) doesn't care that the two $10 values are for different Orders, it just knows that the 10 is duplicated for the Customer, so it only uses the 10 once to calculate the SUM. Thus, it returns a value of 10 as the total shipping cost for those two orders, even though it should be 10+10=20. Our result is now wrong. The long and short of it is this: Never use SUM(Distinct) ! It doesn't usually make logical sense in most situations; there may be a time and place for it, but it is definitely not here.

Summarizing Derived Tables

So, how do we fix this? Well, like many SQL problems, the answer is simple: Do it one step at a time, don't try to join all of the tables together and just add SUM() and GROUP BY and DISTINCT almost randomly until things work; break it down logically step by step.

So, before worrying about totals per Customer, let's step back and focus on returning totals per Order. If we can return totals per Order first, then we can simply summarize those Order totals by Customer and we'll have the results we need. Let's summarize the OrderDetails table to return 1 row per Order, with the ItemCount and the total Order Amount:

select 
  orderID, count(*) as ItemCount, sum(Amount) as OrderAmount
from 
  orderDetails
group by 
  orderID

orderID     ItemCount   OrderAmount           
----------- ----------- --------------------- 
1           2           250.0000
2           3           375.0000
3           1           100.0000
4           2           350.0000
5           1           125.0000

(5 row(s) affected)

Nice and simple, easy to verify, things look good. Because we are grouping on OrderID, we can say that these results have a virtual primary key of OrderID -- that is, there will never be duplicate rows for the same Order. In fact, here's another basic rule to always remember:

The virtual primary key of a SELECT with a GROUP BY clause will always be the expressions stated in the GROUP BY.

We can now take that SQL statement and those results and encapsulate them in their own derived table. If we join from the Orders table to the previous SELECT as a derived table, we get:

select 
  o.orderID, o.Customer, o.ShippingCost, d.ItemCount, d.OrderAmount
from
  orders o
inner join
(

    select 
      orderID, count(*) as ItemCount, sum(Amount) as OrderAmount
    from 
      orderDetails
    group by 
      orderID
) d
  on o.orderID = d.orderID

orderID     Customer   ShippingCost          ItemCount   OrderAmount           
----------- ---------- --------------------- ----------- --------------------- 
1           ABC        40.0000               2           250.0000
2           ABC        30.0000               3           375.0000
3           ABC        25.0000               1           100.0000
4           DEF        10.0000               2           350.0000
5           DEF        10.0000               1           125.0000

(5 row(s) affected)

Let's examine those results. There are no duplicate rows or values anywhere; there is exactly one row per Order. This is because our derived table has a virtual primary key of OrderID, so joining from Orders to our derived table will never produce duplicates. This is a very useful and simple technique to avoid duplicates when relating a parent table to a child table: summarize the child table by the parent's primary key first in a derived table, and then join it to the parent table. The parent table’s rows will then never be duplicated and can be summarized accurately.

Now we have our total ItemCount per order, as well as our total OrderAmount per order. And we can see that if we sum these results up, our ShippingCost column will be fine, since it is never duplicated. No need for distinct. In fact, we can even use a regular COUNT(*) expression to get the total number of orders per customer!

So, we can simply add "GROUP BY Customer" to the previous SQL, calculate what we need with aggregate functions, and remove any columns (like OrderID) that we will not be summarizing. You might also notice that at this point, the total ItemCount per Customer is no longer a COUNT(*) expression; it is a simple SUM() of the ItemCount value returned from our derived table.

Here's the result:

select 
  o.Customer, count(*) as OrderCount, 
  sum(o.ShippingCost) as ShippingTotal, sum(d.ItemCount) as ItemCount, 
  sum(d.OrderAmount) as OrderAmount
from
  orders o
inner join
(
    select 
      orderID, count(*) as ItemCount, sum(Amount) as OrderAmount
    from 
      orderDetails
    group by 
      orderID
) d
  on o.orderID = d.orderID
group by 
  o.customer

Customer   OrderCount  ShippingTotal         ItemCount   OrderAmount           
---------- ----------- --------------------- ----------- --------------------- 
ABC        3           95.0000               6           725.0000
DEF        2           20.0000               3           475.0000

(2 row(s) affected)

And there you have it! We examined our data, logically considered the implications of our JOINS, broke the problem down into smaller parts, and ended up with a fairly simple solution that we know will be quick, efficient and accurate.

Adding More Tables a Summarized SELECT

To finish things up, suppose our schema also has a table of Customers:

Create table Customers
(
	Customer varchar(10) primary key,
	CustomerName varchar(100) not null,
	City varchar(100) not null,
	State varchar(2) not null
)

insert into Customers
select 'ABC','ABC Corporation','Boston','MA' union all
select 'DEF','The DEF Foundation','New York City','NY'

... and we wish to also return each customers' name, city and state in our previous results. One way to do this is to simply add the Customers table to our existing join, and then add the customer columns to the SELECT clause. However, unless you add all of the customer columns to the GROUP BY as well, you will get an error message indicating that you need to either group or summarize all columns you wish to display. We aren't trying to calculate a COUNT() or a SUM() of Name, City and State, so it doesn't make sense to wrap those columns in an aggregate expression. So, it appears that we must add them all to our GROUP BY clause to get the results we need:

select 
  o.Customer, c.customerName, c.City, c.State,
  count(*) as OrderCount, 
  sum(o.ShippingCost) as ShippingTotal, sum(d.ItemCount) as ItemCount, 
  sum(d.OrderAmount) as OrderAmount
from
  orders o
inner join
(
    select 
      orderID, count(*) as ItemCount, sum(Amount) as OrderAmount
    from 
      orderDetails
    group by 
      orderID
) d
  on o.orderID = d.orderID
inner join
  customers c on o.customer = c.customer
group by 
  o.customer, c.customerName, c.City, c.State

Customer   customerName         City            State OrderCount  ShippingTotal ItemCount OrderAmount
---------- -------------------- --------------- ----- ----------- ------------- --------- -----------
ABC        ABC Corporation      Boston          MA    3           95.0000               6    725.0000
DEF        The DEF Foundation   New York City   NY    2           20.0000               3    475.0000

(2 row(s) affected)

Technically, that works, but it seems silly to list all of those customer columns in the GROUP BY ... After all, we are just grouping on Customer, not on each of the customer's attributes, right?

What's interesting is that the solution is something we already talked about and the same technique applies: Since Customer has a one-to-many relation with Orders, we know that joining Customers to Orders will result in duplicate rows per Customer, and thus all columns in the Customer table are duplicated in the results. You might notice that this is exactly the same scenario that applies when joining Orders to OrderDetails. So, we handle this situation the same way! We simply summarize our Orders by Customer first, in a derived table, and then we join those results to the Customer table. This means that no columns from the Customer table will be dupicated at all, and there is no need to add them all to our GROUP BY expression. This keep our SQL clean, organized, and logically sound.

So, our final results now look like this:

select 
  c.Customer, c.customerName, c.City, c.State,
  o.OrderCount, o.ShippingTotal,
  o.ItemCount, o.OrderAmount
from
(  
  select 
    o.customer, count(*) as OrderCount, 
    sum(o.ShippingCost) as ShippingTotal, sum(d.ItemCount) as ItemCount, 
    sum(d.OrderAmount) as OrderAmount
  from
    orders o
  inner join
  (
    select 
      orderID, count(*) as ItemCount, sum(Amount) as OrderAmount
    from 
      orderDetails
    group by 
      orderID
  ) d
    on o.orderID = d.orderID
  group by o.customer
) o
inner join
  customers c on o.customer = c.customer

Customer   customerName         City            State OrderCount  ShippingTotal ItemCount OrderAmount
---------- -------------------- --------------- ----- ----------- ------------- --------- -----------
ABC        ABC Corporation      Boston          MA    3           95.0000               6    725.0000
DEF        The DEF Foundation   New York City   NY    2           20.0000               3    475.0000

(2 row(s) affected)

Conclusion

I hope this two part series helps a little bit with your understanding of GROUP BY queries. It is vital to identify and understand what the virtual primary key of a result set is when you join multiple tables, and to recognize which rows are duplicated or not. In addition, remember that COUNT(Distinct) can be useful, but SUM(Distinct) should very rarely, if ever, be used.

In general, if you find that values you need to SUM() have been duplicated, summarize the table causing those duplicates separately and join it in as a derived table. This will also allow you to break down your problem into smaller steps and test and validate the results of each step as you go.

GROUP BY is a very powerful feature, but is also misunderstood and abused, and the easiest way to leverage it is to carefully build your SQL from smaller, simpler parts into larger, more complicated solutions.