Author |
Topic |
towardabettercountry
Starting Member
26 Posts |
Posted - 2013-01-22 : 12:24:06
|
Hi All,I'm trying to calculate median, lower quartile, and upper quartile for a box plot chart using SQL.Here is some sample data:CREATE TABLE EStats ( PersonID VARCHAR(30) NOT NULL, Grade VARCHAR(25) NOT NULL, CourseDate Date NOT NULL);INSERT INTO EStats( PersonID, Grade, CourseDate)VALUES ('100', '91', '2010-03-01'), ('101', '96', '2010-03-01'), ('102', '88', '2010-03-01'), ('103', '92', '2010-03-01'), ('104', '81', '2010-03-01'), ('105', '85', '2010-03-01'), ('106', '91', '2010-03-01'), ('107', '89', '2010-03-01'), ('108', '99', '2010-03-01'), ('109', '88', '2010-03-01'), ('110', '81', '2011-03-02'), ('111', '77', '2011-03-02'), ('112', '88', '2011-03-02'), ('113', '76', '2011-03-02'), ('114', '69', '2011-03-02'), ('115', '70', '2011-03-02'), ('116', '75', '2011-03-02'), ('117', '88', '2011-03-02'), ('118', '76', '2011-03-02'), ('119', '95', '2012-03-01'), ('120', '96', '2012-03-01'), ('121', '90', '2012-03-01'), ('122', '80', '2012-03-01'), ('123', '85', '2012-03-01'), ('124', '94', '2012-03-01'), ('125', '89', '2012-03-01'), ('126', '97', '2012-03-01'), ('127', '94', '2012-03-01'), ('128', '72', '2012-03-01'), ('129', '88', '2012-03-01'), ('130', '91', '2012-03-01')This query calculates the median:--SELECT * FROM EStatsSELECT CourseDate, AVG(CAST(Grade AS Numeric)) AS Median FROM( SELECT CourseDate, Grade, ROW_NUMBER() OVER ( PARTITION BY LEFT(CourseDate, 4) ORDER BY Grade ASC) AS RowAsc, ROW_NUMBER() OVER ( PARTITION BY LEFT(CourseDate, 4) ORDER BY Grade DESC) AS RowDesc FROM EStats)xWHERE RowAsc IN (RowDesc, RowDesc - 1, RowDesc + 1)GROUP BY CourseDateORDER BY CourseDateHowever, I don't know how to calculate lower and upper quartiles (in order to put in a box plot or box-and-whisker chart). Can anyone help? |
|
James K
Master Smack Fu Yak Hacker
3873 Posts |
Posted - 2013-01-22 : 13:05:37
|
You should be able to use the NTILE function available in SQL 2005 and later. The MSDN page has an example of calculating quartile http://msdn.microsoft.com/en-us/library/ms175126.aspx |
|
|
towardabettercountry
Starting Member
26 Posts |
Posted - 2013-01-22 : 14:19:14
|
Hi James,Thanks for the reply. I've played around some with NTILE regarding this solution, and can't seem to get it to work. As I understand it the lower and upper quartiles for box plots are the median of the two halves. I wasn't sure how to get NTILE to deal with variables such as when there are an even amount of numbers.Any help (especially an example!) would be great!! |
|
|
James K
Master Smack Fu Yak Hacker
3873 Posts |
Posted - 2013-01-22 : 15:08:18
|
If the groups are not divisible by 4, the distribution will not be even. The MSDN example is NTILE(5) when there are 53 rows. First 3 will get 11 and the next two will get 10.In your example for 2010, where you have 10 rows, what do you want the first quartile boundary to be? 2, 3, 2.5 or something else? |
|
|
towardabettercountry
Starting Member
26 Posts |
Posted - 2013-01-22 : 15:23:29
|
Okay, I think I am moving toward a solution. The below is ugly and cumbersome, but its the best I've got right now. However, I noticed my median in off in some places. When I researched the problem, I found that the row_number in my inner query doesn't handle "ties" very well. in other words, it may have something like this:rn grade1 882 894 903 905 95You can run this to see what I'm talking about: SELECT CourseDate, Grade, ROW_NUMBER() OVER ( PARTITION BY LEFT(CourseDate, 4) ORDER BY Grade ASC) AS RowAsc, ROW_NUMBER() OVER ( PARTITION BY LEFT(CourseDate, 4) ORDER BY Grade DESC) AS RowDesc FROM EStatsNotice for CourseDate 2010-03-01 the row_number works as follows:RowAsc10986753421Would you mind suggesting how to fix this?Additionally, here is what I have that calculates everything for a boxplot. Once I get this row_number issue fixed, it should work.WITH Q3 AS( SELECT CourseDate, AVG(CAST(Grade AS Numeric)) AS Median FROM ( SELECT CourseDate, Grade, ROW_NUMBER() OVER ( PARTITION BY LEFT(CourseDate, 4) ORDER BY Grade ASC) AS RowAsc, ROW_NUMBER() OVER ( PARTITION BY LEFT(CourseDate, 4) ORDER BY Grade DESC) AS RowDesc FROM EStats )x WHERE RowAsc IN (RowDesc, RowDesc - 1, RowDesc + 1) GROUP BY CourseDate --ORDER BY CourseDate),Q2 AS( SELECT x.CourseDate, AVG(CAST(Grade AS Numeric)) AS LowerQuartile FROM ( SELECT Estats.CourseDate, Estats.Grade, ROW_NUMBER() OVER ( PARTITION BY LEFT(EStats.CourseDate, 4) ORDER BY Grade ASC) AS RowAsc, ROW_NUMBER() OVER ( PARTITION BY LEFT(Estats.CourseDate, 4) ORDER BY Grade DESC) AS RowDesc FROM EStats JOIN Q3 on EStats.CourseDate = Q3.CourseDate WHERE EStats.Grade < Q3.Median )x WHERE RowAsc IN (RowDesc, RowDesc - 1, RowDesc + 1) GROUP BY x.CourseDate),Q4 AS( SELECT x.CourseDate, AVG(CAST(Grade AS Numeric)) AS UpperQuartile FROM ( SELECT Estats.CourseDate, Estats.Grade, ROW_NUMBER() OVER ( PARTITION BY LEFT(EStats.CourseDate, 4) ORDER BY Grade ASC) AS RowAsc, ROW_NUMBER() OVER ( PARTITION BY LEFT(Estats.CourseDate, 4) ORDER BY Grade DESC) AS RowDesc FROM EStats JOIN Q3 on EStats.CourseDate = Q3.CourseDate WHERE EStats.Grade > Q3.Median )x WHERE RowAsc IN (RowDesc, RowDesc - 1, RowDesc + 1) GROUP BY x.CourseDate)SELECT Q3.CourseDate, Q3.Median AS Median, Q2.LowerQuartile, Q4.UpperQuartile, MIN(EStats.Grade) AS Min, MAX(EStats.Grade) AS MaxFROM Q3 JOIN Q2 ON Q3.CourseDate = Q2.CourseDate JOIN Q4 ON Q3.CourseDate = Q4.CourseDate JOIN EStats ON Q3.CourseDate = EStats.CourseDateGROUP BY Q3.CourseDate, Q3.Median, Q2.LowerQuartile, Q4.UpperQuartileORDER BY Q3.CourseDate |
|
|
James K
Master Smack Fu Yak Hacker
3873 Posts |
Posted - 2013-01-22 : 17:18:41
|
Instead of row_number function can you try RANK function or DENSE_RANK function? They work the same way as row_number function, except, would assign the same rank for ties. DENSE_RANK would not leave any gaps, RANK would. Not quite sure which of these would work correctly for you when the value around the median has ties - you will have to experiment. |
|
|
towardabettercountry
Starting Member
26 Posts |
Posted - 2013-01-22 : 17:54:11
|
Hi James,Thanks again for hanging in there with me. Rank and Dense_Rank don't work because the numbering isn't sequential in the case of ties. |
|
|
James K
Master Smack Fu Yak Hacker
3873 Posts |
Posted - 2013-01-22 : 19:11:45
|
To make it consistent for desc and ascending orders, there has to be some other tie-breaker, so the result is determinate. Alternatively, one could do the following - here I am modifying the query that you posted in your last reply:--- Original querySELECTCourseDate,Grade,ROW_NUMBER() OVER (PARTITION BY LEFT(CourseDate, 4)ORDER BY Grade ASC) AS RowAsc,ROW_NUMBER() OVER (PARTITION BY LEFT(CourseDate, 4)ORDER BY Grade DESC) AS RowDescFROM EStats---- Revised query;WITH cte AS( SELECT *, ROW_NUMBER() OVER (PARTITION BY LEFT(CourseDate, 4) ORDER BY Grade ASC) AS RN FROM EStats)SELECT CourseDate, Grade, ROW_NUMBER() OVER(PARTITION BY LEFT(CourseDate, 4) ORDER BY RN ASC) AS RowAsc, ROW_NUMBER() OVER(PARTITION BY LEFT(CourseDate, 4) ORDER BY RN DESC) AS RowDescFROM cte |
|
|
towardabettercountry
Starting Member
26 Posts |
Posted - 2013-01-23 : 10:43:12
|
James,That is fantastic! Works perfectly, thanks a TON! |
|
|
James K
Master Smack Fu Yak Hacker
3873 Posts |
Posted - 2013-01-23 : 13:38:09
|
You are very welcome - glad to be of help. |
|
|
|
|
|