Reader Challenge #1 Solutions (Part II)

By Bill Graziano on 4 June 2001 | Tags: Reader Challenges

This article is the second summary of responses to the Reader Challenge. I'm covering solutions by Michael Price (again), Ilya Zaltsman, John Nowak and Stephen Morris. These are the fastest and cleanest solutions. I'll also cover the first entry that returned the actual values. Thanks again for your entries and I'm busy looking for another query for the next Reader Challenge.

It might be helpful to review the original challenge and the first correct response I received. I had numerous people submit articles that were pretty similar to what I already published. They are (in order of submission) sriram raghunathan, JP, Marc Yabroff, Jon Scott-Sheldon, Allison Anderson, Balachandar Ganesan, Justin Bigelow (who got me started on this), Eugene Okhrits and John Henry. I also had late entries from sqlnr, Pedro Cardoso and Tim Graham that weren't radically different from the other entries. Merkin also submitted a partial solution but he's written an article for us so he wasn't "officially" eligible.

A Clean Solution

This is a pretty subjective category. I tried to identify a solution was easy for a novice SQL developer to read and understand. I think in the future I'll present this one first and work my way up to the more complex. I thought the "cleanest" solution came from Ilya Zaltsman. He writes:

Hey, Graz!

First of all, thanks for the site -- to me it is the best source of SQL information anywhere!!!

I thought I'd try my hand at answering the reader challege: Statistics... The subject I've been trying to forget (especially since I blew $350 in a casino last night)...

I can't comment on Ilya's gambling skills but his SQL skills seem to be pretty good. Here's the query he submitted:

DECLARE @Mean float 
DECLARE @StDev float 
DECLARE @LowerBound float 
DECLARE @UpperBound float 

--Get Mean and StDev
SELECT @Mean = AVG(Ind_Ret), 
@StDev = STDEV(Ind_Ret) FROM Samples 

----Get Upper and Lower bounds of the range 
SELECT @LowerBound = @Mean - (@StDev * 3), 
@UpperBound = @Mean + (@StDev * 3) 

----Count the outliers 
SELECT 
  (SELECT Count(Ind_Ret) as Outliers_Below 
    FROM Samples 
    WHERE Ind_Ret<@LowerBound) as Outliers_Below , 
  (SELECT Count(Ind_Ret) as Outliers_Below 
    FROM Samples 
    WHERE Ind_Ret>@UpperBound) as Outliers_Above

I think this lays out the steps that the query has to go through pretty thoroughly. Ilya also gets bonus points for using comments. This query generates three table scans which isn't all that bad. I've got solutions with two table scans but I've also got solutions with five table scans.

Returning the Values

John Nowak sent in the first solution that returned the actual values in a single SELECT statement. Here's his solution:

SELECT  so.ind_ret AS OverYakLimit, 
        su.ind_ret AS UnderYakLimit 
FROM samples s 
LEFT JOIN samples so 
ON s.ind_ret = so.ind_ret 
AND so.ind_ret >(SELECT ISNULL((AVG(Ind_Ret) + STDEV(Ind_Ret) * 3),0) 
                AS CountstdCountOverLimit 
                FROM Samples) 
LEFT JOIN samples su 
ON s.ind_ret = su.ind_ret 
AND su.ind_ret <(SELECT ISNULL((AVG(Ind_Ret) - STDEV(Ind_Ret) * 3),0) 
                AS CountstdCountOverLimit 
                FROM Samples) 
WHERE   (ISNULL(so.ind_ret,0) + ISNULL(su.ind_ret,0)) > 0 
ORDER BY so.ind_ret, su.ind_ret

It generates five table scans but does return all the values. Plus he gets bonus points for a few Yak references.

Two Table Scans

I don't think this query can be written in a single table scan. You need one full scan to determine the standard deviation and average. You need a second scan to identify records that are more than three standard deviations from the average. I got two solutions with two table scans from Michael Price and Ilya Zaltsman. Michael was first and writes (in his second email):

Well, when driving home last night I was thinking about this some more, and came up with a qry that only uses 2 table scans . . . Now maybe I can get this out of my head <g>, I don't think a single pass on the table is possible, but look forward to the other users responses to see if they came up with a way.

And here's the solution he came up with:

Select Sum( case 
              when Ind_Ret > NPLimitTop Then 1 
              Else 0 End ) As AboveNPLimit,
       Sum( case 
              when Ind_Ret < NPLimitBottom Then 1 
              Else 0 End ) As BelowNPLimit
  From Samples, (
                  Select Avg(Ind_Ret) + (Stdev(Ind_Ret) * 3) AS NPLimitTop,
                         Avg(Ind_Ret) - (Stdev(Ind_Ret) * 3) AS NPLimitBottom
                  From Samples
                ) RangeData

You'll notice his tricky use of a cartesian join to pull NPLimitTop and NPLimitBottom into his case statement. This is very similar to what he did in the first article I wrote. This solution only uses two table scans which is as fast as I think it can be done.

Ilya's solution with two table scans was different and I thought I'd publish it too.

SELECT 
  SUM(Below) as Outliers_Below,  -- Count outliers 
  SUM(Above) as Outliers_Above 
FROM 
  (SELECT 
     -- Break up high-end and low-end outliers into separate cols
     CASE  
       WHEN a.Ind_Ret > b.UpperBound THEN 1 
       ELSE 0 
     END as Above,  
     CASE 
       WHEN a.Ind_Ret < b.LowerBound THEN 1 
       ELSE 0 
     END as Below 
   FROM Samples as a, 
      -- Obtain range boundaries
     (SELECT AVG(Ind_Ret)+(3*STDEV(Ind_Ret)) as UpperBound,   
             AVG(Ind_Ret)-(3*STDEV(Ind_Ret)) as LowerBound 
      FROM Samples) as b 
      -- Eliminate non-outlier values
      WHERE (a.Ind_Ret > b.UpperBound OR a.Ind_Ret < b.LowerBound)) as Yaks

An Index?

Stephen Morris submitted an interesting solution. He built an index on the table. Stephen writes

hi Graz

got sucked in to your Reader Challenge. Added an index to get rid of the table scans & wrote a single select solution - see below

got to get back to work

Here's his statement:

CREATE CLUSTERED INDEX IX_Samples ON Samples
	(
	Ind_Ret
	) ON [PRIMARY]


SELECT 
	CASE 
		WHEN ind_ret - average > 0 
		THEN 'Above' 
		ELSE 'Below' 
	END as Direction
      , isnull(cast(ind_ret as varchar(20)), 'TOTAL') as Ind_ret
      , count(*) as CountofSamples
FROM SAMPLES
CROSS JOIN 
(
	SELECT  
	  avg(s.ind_ret) average
	, STDEV(s.Ind_Ret) stddev
	FROM SAMPLES s
) as AggFnResults
WHERE abs(ind_ret - average) > stddev * 3
GROUP BY 
	  CASE WHEN ind_ret - average > 0 THEN 'Above' ELSE 'Below' END
	, ind_ret 
WITH ROLLUP
ORDER BY 1,2 DESC

His query only generated two tables scans before adding the index so it was already pretty efficient. He does have some extra GROUP BY statements that generate some additional overhead though. After adding the index the Table Scans were converted to Clustered Index Scans. They seemed to take the exact same amount of processing time that the table scan did which is what I'd expected.

Obviously in a real environment there would have been multiple tests stored in the same table and indexes would make a huge difference. Of course, in a real environment proper indexes would have been built from the start and all the queries would have used them. So in this case using an index didn't really help or hurt this solution much. But it was interesting to see their effect on the query.

Thanks to everyone that entered. I think I'll do another one of these in a few weeks. I've got a couple of queries that would make good candidates. Thanks for "playing" and I hope you all give it a try next time.