Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 General SQL Server Forums
 New to SQL Server Programming
 Calculating lower and upper quartiles for box plot

Author  Topic 

towardabettercountry
Starting Member

26 Posts

Posted - 2013-01-22 : 12:24:06
Hi All,

I'm trying to calculate median, lower quartile, and upper quartile for a box plot chart using SQL.

Here is some sample data:

CREATE TABLE EStats
(
PersonID VARCHAR(30) NOT NULL,
Grade VARCHAR(25) NOT NULL,
CourseDate Date NOT NULL
)
;

INSERT INTO EStats
(
PersonID, Grade, CourseDate
)

VALUES
('100', '91', '2010-03-01'),
('101', '96', '2010-03-01'),
('102', '88', '2010-03-01'),
('103', '92', '2010-03-01'),
('104', '81', '2010-03-01'),
('105', '85', '2010-03-01'),
('106', '91', '2010-03-01'),
('107', '89', '2010-03-01'),
('108', '99', '2010-03-01'),
('109', '88', '2010-03-01'),
('110', '81', '2011-03-02'),
('111', '77', '2011-03-02'),
('112', '88', '2011-03-02'),
('113', '76', '2011-03-02'),
('114', '69', '2011-03-02'),
('115', '70', '2011-03-02'),
('116', '75', '2011-03-02'),
('117', '88', '2011-03-02'),
('118', '76', '2011-03-02'),
('119', '95', '2012-03-01'),
('120', '96', '2012-03-01'),
('121', '90', '2012-03-01'),
('122', '80', '2012-03-01'),
('123', '85', '2012-03-01'),
('124', '94', '2012-03-01'),
('125', '89', '2012-03-01'),
('126', '97', '2012-03-01'),
('127', '94', '2012-03-01'),
('128', '72', '2012-03-01'),
('129', '88', '2012-03-01'),
('130', '91', '2012-03-01')


This query calculates the median:

--SELECT * FROM EStats

SELECT
CourseDate,
AVG(CAST(Grade AS Numeric)) AS Median

FROM
(
SELECT
CourseDate,
Grade,
ROW_NUMBER() OVER (
PARTITION BY LEFT(CourseDate, 4)
ORDER BY Grade ASC) AS RowAsc,
ROW_NUMBER() OVER (
PARTITION BY LEFT(CourseDate, 4)
ORDER BY Grade DESC) AS RowDesc
FROM EStats
)x
WHERE
RowAsc IN (RowDesc, RowDesc - 1, RowDesc + 1)
GROUP BY CourseDate
ORDER BY CourseDate

However, I don't know how to calculate lower and upper quartiles (in order to put in a box plot or box-and-whisker chart). Can anyone help?

James K
Master Smack Fu Yak Hacker

3873 Posts

Posted - 2013-01-22 : 13:05:37
You should be able to use the NTILE function available in SQL 2005 and later. The MSDN page has an example of calculating quartile http://msdn.microsoft.com/en-us/library/ms175126.aspx
Go to Top of Page

towardabettercountry
Starting Member

26 Posts

Posted - 2013-01-22 : 14:19:14
Hi James,

Thanks for the reply. I've played around some with NTILE regarding this solution, and can't seem to get it to work. As I understand it the lower and upper quartiles for box plots are the median of the two halves. I wasn't sure how to get NTILE to deal with variables such as when there are an even amount of numbers.

Any help (especially an example!) would be great!!
Go to Top of Page

James K
Master Smack Fu Yak Hacker

3873 Posts

Posted - 2013-01-22 : 15:08:18
If the groups are not divisible by 4, the distribution will not be even. The MSDN example is NTILE(5) when there are 53 rows. First 3 will get 11 and the next two will get 10.

In your example for 2010, where you have 10 rows, what do you want the first quartile boundary to be? 2, 3, 2.5 or something else?
Go to Top of Page

towardabettercountry
Starting Member

26 Posts

Posted - 2013-01-22 : 15:23:29
Okay, I think I am moving toward a solution. The below is ugly and cumbersome, but its the best I've got right now. However, I noticed my median in off in some places. When I researched the problem, I found that the row_number in my inner query doesn't handle "ties" very well. in other words, it may have something like this:

rn grade
1 88
2 89
4 90
3 90
5 95

You can run this to see what I'm talking about:

SELECT
CourseDate,
Grade,
ROW_NUMBER() OVER (
PARTITION BY LEFT(CourseDate, 4)
ORDER BY Grade ASC) AS RowAsc,
ROW_NUMBER() OVER (
PARTITION BY LEFT(CourseDate, 4)
ORDER BY Grade DESC) AS RowDesc
FROM EStats

Notice for CourseDate 2010-03-01 the row_number works as follows:

RowAsc
10
9
8
6
7
5
3
4
2
1

Would you mind suggesting how to fix this?


Additionally, here is what I have that calculates everything for a boxplot. Once I get this row_number issue fixed, it should work.

WITH Q3 AS
(
SELECT
CourseDate,
AVG(CAST(Grade AS Numeric)) AS Median

FROM
(
SELECT
CourseDate,
Grade,
ROW_NUMBER() OVER (
PARTITION BY LEFT(CourseDate, 4)
ORDER BY Grade ASC) AS RowAsc,
ROW_NUMBER() OVER (
PARTITION BY LEFT(CourseDate, 4)
ORDER BY Grade DESC) AS RowDesc
FROM EStats
)x
WHERE
RowAsc IN (RowDesc, RowDesc - 1, RowDesc + 1)
GROUP BY CourseDate
--ORDER BY CourseDate
),

Q2 AS
(
SELECT
x.CourseDate,
AVG(CAST(Grade AS Numeric)) AS LowerQuartile

FROM
(
SELECT
Estats.CourseDate,
Estats.Grade,
ROW_NUMBER() OVER (
PARTITION BY LEFT(EStats.CourseDate, 4)
ORDER BY Grade ASC) AS RowAsc,
ROW_NUMBER() OVER (
PARTITION BY LEFT(Estats.CourseDate, 4)
ORDER BY Grade DESC) AS RowDesc
FROM EStats JOIN Q3 on EStats.CourseDate = Q3.CourseDate
WHERE EStats.Grade < Q3.Median
)x
WHERE
RowAsc IN (RowDesc, RowDesc - 1, RowDesc + 1)
GROUP BY x.CourseDate
),

Q4 AS
(
SELECT
x.CourseDate,
AVG(CAST(Grade AS Numeric)) AS UpperQuartile

FROM
(
SELECT
Estats.CourseDate,
Estats.Grade,
ROW_NUMBER() OVER (
PARTITION BY LEFT(EStats.CourseDate, 4)
ORDER BY Grade ASC) AS RowAsc,
ROW_NUMBER() OVER (
PARTITION BY LEFT(Estats.CourseDate, 4)
ORDER BY Grade DESC) AS RowDesc
FROM EStats JOIN Q3 on EStats.CourseDate = Q3.CourseDate
WHERE EStats.Grade > Q3.Median
)x
WHERE
RowAsc IN (RowDesc, RowDesc - 1, RowDesc + 1)
GROUP BY x.CourseDate
)

SELECT Q3.CourseDate, Q3.Median AS Median, Q2.LowerQuartile, Q4.UpperQuartile, MIN(EStats.Grade) AS Min, MAX(EStats.Grade) AS Max
FROM Q3
JOIN Q2 ON Q3.CourseDate = Q2.CourseDate
JOIN Q4 ON Q3.CourseDate = Q4.CourseDate
JOIN EStats ON Q3.CourseDate = EStats.CourseDate
GROUP BY Q3.CourseDate, Q3.Median, Q2.LowerQuartile, Q4.UpperQuartile
ORDER BY Q3.CourseDate
Go to Top of Page

James K
Master Smack Fu Yak Hacker

3873 Posts

Posted - 2013-01-22 : 17:18:41
Instead of row_number function can you try RANK function or DENSE_RANK function? They work the same way as row_number function, except, would assign the same rank for ties. DENSE_RANK would not leave any gaps, RANK would. Not quite sure which of these would work correctly for you when the value around the median has ties - you will have to experiment.
Go to Top of Page

towardabettercountry
Starting Member

26 Posts

Posted - 2013-01-22 : 17:54:11
Hi James,

Thanks again for hanging in there with me. Rank and Dense_Rank don't work because the numbering isn't sequential in the case of ties.
Go to Top of Page

James K
Master Smack Fu Yak Hacker

3873 Posts

Posted - 2013-01-22 : 19:11:45
To make it consistent for desc and ascending orders, there has to be some other tie-breaker, so the result is determinate.
Alternatively, one could do the following - here I am modifying the query that you posted in your last reply:
--- Original query
SELECT
CourseDate,
Grade,
ROW_NUMBER() OVER (
PARTITION BY LEFT(CourseDate, 4)
ORDER BY Grade ASC) AS RowAsc,
ROW_NUMBER() OVER (
PARTITION BY LEFT(CourseDate, 4)
ORDER BY Grade DESC) AS RowDesc
FROM EStats

---- Revised query
;WITH cte AS
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY LEFT(CourseDate, 4) ORDER BY Grade ASC) AS RN
FROM EStats
)
SELECT CourseDate,
Grade,
ROW_NUMBER() OVER(PARTITION BY LEFT(CourseDate, 4) ORDER BY RN ASC) AS
RowAsc,
ROW_NUMBER() OVER(PARTITION BY LEFT(CourseDate, 4) ORDER BY RN DESC) AS
RowDesc
FROM cte

Go to Top of Page

towardabettercountry
Starting Member

26 Posts

Posted - 2013-01-23 : 10:43:12
James,

That is fantastic! Works perfectly, thanks a TON!
Go to Top of Page

James K
Master Smack Fu Yak Hacker

3873 Posts

Posted - 2013-01-23 : 13:38:09
You are very welcome - glad to be of help.
Go to Top of Page
   

- Advertisement -