Computing Percentiles in SQL ServerBy Guest Authors on 20 May 2004 | Tags: SELECT Say you score 570 on the verbal section of the SAT. How well did you do? According to the College Board, 570 falls into the 71st percentile. This means you did better than 71% of the other students who took the test. Percentiles are a useful way to present rankings among a large population. You may want to add percentiles to reports you produce with SQL. This article by Bob Newstadt shows how to calculate a percentile for a given group of scores. It also covers finding a score for a given percentile. By definition, a percentile is the score at or below which a particular percentage of scores fall. The code given here was tested on Microsoft SQL Server 2000. Using SQL may be the wrong tool for the job if the number of data points is very large. An alternative is to use Analysis Services which provides access to EXCEL worksheet functions for use in MDX queries. We’ll present solutions for two different problems. One is to find the pth percentile that a score falls in. The other problem is to find the score (the percentile) that characterizes a given percentage of the distribution. Find the pth percentile of a given scoreLet’s create a table variable containing a set of scores. Declare @TestScores table (StudentID int, Score int) insert @TestScores (StudentID, Score) Values (1, 20) insert @TestScores (StudentID, Score) Values (2, 03) insert @TestScores (StudentID, Score) Values (3, 40) insert @TestScores (StudentID, Score) Values (4, 45) insert @TestScores (StudentID, Score) Values (5, 50) insert @TestScores (StudentID, Score) Values (6, 20) insert @TestScores (StudentID, Score) Values (7, 90) insert @TestScores (StudentID, Score) Values (8, 20) insert @TestScores (StudentID, Score) Values (9, 11) insert @TestScores (StudentID, Score) Values (10, 30) Now let’s find the percentage of scores at or below the score 20. select CAST(SUM(CASE WHEN Score <= 20 THEN 1 ELSE 0 END) as float)/COUNT(*) as PthPercentileFor20 from @TestScores PthPercentileFor20 ------------------ 0.5 A score of 20 is the 50th percentile in the TestScores distribution. In other words, half of the students scored at or below 20. The SQL statement counts the number of scores below 20 then divides by the total number of scores in the table. The CAST forces the division to be a floating point division instead of an integer division. To simplify, let’s define a user defined function for floating point division. CREATE FUNCTION dbo.FDIV (@numerator float, @denominator float) RETURNS float AS BEGIN RETURN CASE WHEN @denominator = 0.0 THEN 0.0 ELSE @numerator / @denominator END END GO Using this function the code can be rewritten as: select dbo.FDIV(SUM(CASE WHEN Score <= 20 THEN 1 ELSE 0 END), COUNT(*)) as PthPercentileFor20 from @TestScores This code yields reliable results when the distribution is defined by many samples. InterpolatingConsider an extreme case where we only have 2 samples with values 3 and 90 defining the distribution. How does a score of 20 rank? The code above would yield 0.5. A “better” estimate of the rank is obtained by linear interpolation. rank = (20-3)/(90-3) = 17/87 = 0.1954 Let’s create a user defined function to do the linear interpolation. CREATE FUNCTION dbo.LERP (@value float, -- between low and high @low float, @high float, @newlow float, @newhigh float) RETURNS float -- between newlow and newhigh AS BEGIN RETURN CASE WHEN @value between @low and @high and @newlow <= @newhigh THEN @newlow+dbo.FDIV((@value-@low), @high-@low))*(@newhigh-@newlow) WHEN @value = @low and @newlow is not NULL THEN @newlow WHEN @value = @high and @newhigh is not NULL THEN @newhigh ELSE NULL END END GO Using this LERP function we can code the above formula as: select dbo.LERP(20,3,90,0.0,1.0) as rank Read this as: given a value 20 between 3 and 90 return a value having the same proportion between 0.0 and 1.0. The result returned is 0.1954. Find the percentile rank of a given scoreThe percentile rank is useful when the distribution is sparse (defined by relatively few samples). Interpolating between the pth percentile of the nearest samples provides a reasonable estimate of the pth percentile of a given score. This calculation is modeled after the percentilerank function in EXCEL. Note that the results may not agree with the first solution shown above. However, the percentilerank does have some nice properties. The percentilerank(min score) is 0.0, the percentilerank(max score) is 1.0, and the percentilerank(median score) is 0.5. declare @val float set @val = 25 select @val as val, dbo.LERP(@val, scoreLT, scoreGE, dbo.FDIV(countLT-1,countMinus1), dbo.FDIV(countLT,countMinus1)) as percentrank from ( select SUM(CASE WHEN Score < @val THEN 1 ELSE 0 END) as countLT, count(*)-1 as countMinus1, MAX(CASE WHEN Score < @val THEN Score END) as scoreLT, MIN(CASE WHEN Score >= @val THEN Score END) as scoreGE from @TestScores ) as x1 The derived table makes one scan over the data values to compute some aggregates. The outer select interpolates between the pth percentile of the nearest samples below and above the given value. The result is 25, 0.5. That means a score of 25 is at the 50th percentile, the median, of the distribution. Find the percentile (the score) that characterizes a given percentageThis calculation is modeled after the percentile function in EXCEL. It computes the inverse of the percentile rank function except when there are duplicate values in the samples which define the distribution. declare @pp float set @pp = .5 select @pp as factor, dbo.LERP(max(d), 0.0, 1.0, max(a.Score), max(b.Score)) as percentile from ( select floor(kf) as k, kf-floor(kf) as d from ( select 1+@pp*(count(*)-1) as kf from @TestScores ) as x1 ) as x2 join @TestScores a on (select count(*) from @TestScores aa where aa.Score < a.Score) < k join @TestScores b on (select count(*) from @TestScores bb where bb.Score < b.Score) < k+1 The derived table x2 converts the factor @pp to a position k and a fractional remainder d. Table a is used to find the kth score. Table b is used to find the k+1th score. The outer select interpolates between the kth score and k+1th score based on the remainder d. The max(d) is just a trick. We know derived table x2 will only produce one value for d. We communicate this to SQL by using max(d) to create one value for d in the general case. The result of this query is .5, 25 indicating that the 50th percentile (the median) of the distribution is the score 25. References[1] College Board. SAT Percentiles. [2] Analysis Services Excel functions. [3] NIST Engineering Statistics Handbook. Percentiles. Bob Newstadt is a seasoned project manager, application developer, and algorithm designer. For the past 8 years he has consulted on data quality initiatives and has lead development teams in the financial services industry. He is available to assist on your next SQL Server project. This article (c) Copyright 2004 Bob Newstadt and reprinted with permission.
|
- Advertisement - |