Counting Parents and Children with Count Distinct

By Mark Caldwell on 10 January 2006 | Tags: SELECT


The aggregate functions in SQL Server (min, max, sum, count, average, etc.) are great tools for reporting and business analysis. But sometimes, you need to tweak them just a little bit to get exactly the results you need. For example, if your manager came to you and asked for a report on how many sales have been made to your clients and how large they were, would you know how to get the data you need efficiently? Mark ran into something like this recently and here's the approach he took to solve the problem.

A typical, simple table structure for tracking sales orders includes a Customer table with name and address information, an OrderHeader table with the Customer's ID, Order Date and other details about the order as a whole, and an OrderDetail table with the detail line items for the order. To simplify our example, we will eliminate the separate Customer table and just focus on the OrderHeader and OrderDetail tables.

To set the stage, run the BuildSampleTables.sql script. Now that we have some sample data to work with, let's discuss our options. We're looking for three measurements. First, how many orders have been placed by each client? Second, how many total line items have been ordered by each client? And third, how much money do all those orders add up to?

You could run separate queries on the OrderHeader and OrderDetail tables, but that would be more work than is necessary, and besides, it makes you look like a rookie. Not to mention that it does not lend itself toward use in a SQL Reporting Services report, which you know they're going to ask for sooner or later. So, you figure there must be a way to answer it all in one query. It briefly crosses your mind that you can probably get the results using a subquery, but you know that can have serious performance issues. So let's see what else we can come up with.

You immediately recognize that you are looking for some record counts and sums, based on two related tables, organized by Customer, so we're probably looking at a GROUP BY statement and a JOIN. Let's start forming the basics and see how close we get to our goal. Our initial attempt might look like the following:

SELECT 
	H.CustomerName, 
	COUNT(H.OrderID) as OrderCount,
	COUNT(D.DetailID) as LineCount,
	SUM(D.LineAmount) as TotalAmount
FROM
	OrderHeader H
	JOIN OrderDetail D ON H.OrderID = D.OrderID
GROUP BY
	H.CustomerName

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

CustomerName              OrderCount  LineCount   TotalAmount 
------------------------- ----------- ----------- ----------- 
ABC Manufacturing                   9           9      387.93 
ACME Supplies                       2           2      115.81 
SQLTeam                            12          12      862.77

This is a good start, but as we look at the results, we see that the OrderCount and the LineCount are the same. Now that would imply that each order had only one line item which seems highly unlikely, and a quick review of the data proves it is not so. And on top of that, the Order Count looks unreasonably high. The reason for this is that when you join a child table to its parent, in a one-to-many relationship, the fields of the parent table are repeated for each child row, thus expanding the total number of rows to be that of the number of children, with repeating OrderIDs. You can confirm this by removing the GROUP BY and the aggregate functions.

Seeing the repeating OrderIDs gives you an idea -- if only there were some way to count only the distinct OrderHeader record IDs. And sure enough, that is exactly what we're going to do. We can insert the word DISTINCT inside the COUNT function of the header record OrderID, and we end up with:

SELECT 
	H.CustomerName, 
	COUNT(DISTINCT H.OrderID) as OrderCount,
	COUNT(D.DetailID) as LineCount,
	SUM(D.LineAmount) as TotalAmount
FROM
	OrderHeader H
	JOIN OrderDetail D ON H.OrderID = D.OrderID
GROUP BY
	H.CustomerName

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

CustomerName              OrderCount  LineCount   TotalAmount 
------------------------- ----------- ----------- ----------- 
ABC Manufacturing                   3           9      387.93 
ACME Supplies                       1           2      115.81 
SQLTeam                             2          12      862.77 

And now, we see that the OrderCount is accurate, and the other numbers remained the same. A simple and clean solution, just the way we like them. For added neatness, you could add an ORDER BY H.CustomerName clause to clean up the display, or at least do some sorting in your final report.


Related Articles

Joining to the Next Sequential Row (2 April 2008)

Writing Outer Joins in T-SQL (11 February 2008)

How to Use GROUP BY with Distinct Aggregates and Derived tables (31 July 2007)

How to Use GROUP BY in SQL Server (30 July 2007)

SQL Server 2005: Using OVER() with Aggregate Functions (21 May 2007)

Server Side Paging using SQL Server 2005 (4 January 2007)

Using XQuery, New Large DataTypes, and More (9 May 2006)

Computing the Trimmed Mean in SQL (29 June 2004)

Other Recent Forum Posts

Select a single row based on conditions in multiple rows (2h)

I want Help Managing Big Data Sets in T SQL Efficiently (13h)

SQL stored procedure to load the error and correct record based on some business rules (22h)

Query is running too long (1d)

Sql Query to check status change of an item (1d)

Can I create differential backups tied to a specifc Full backup instead of the most recent? (7d)

My informix Sql query retruns Null always (8d)

Vehicle availability query (9d)

- Advertisement -