Please start any new threads on our new
site at http://forums.sqlteam.com. We've got lots of great SQL Server
experts to answer whatever question you can come up with.

Our new SQL Server Forums are live!
Come on over! We've restricted the ability to create new threads on these forums.

For a set of data points (x, y), this algorithm can be used to fit the data to any of the following curves:

1. Straight line (linear regresion); y = A + b*x 2. Exponential curve; y = A*EXP(b*x); nb a > 0 3. Logarithmic curve; y = A + b*LN(x) 4. Power curve; y = A*x^b; nb a > 0

The coefficient of determination is R2 (how well does the curve fit)

-- Prepare test data
CREATE TABLE cf
(
x decimal(38, 10),
y decimal(38, 10)
)
-- Calculate Linear regression
INSERT cf
SELECT 40.5, 104.5 UNION ALL
SELECT 38.6, 102 UNION ALL
SELECT 37.9, 100 UNION ALL
SELECT 36.2, 97.5 UNION ALL
SELECT 35.1, 95.5 UNION ALL
SELECT 34.6, 94
SELECT 'Linear regression' AS Type, A, b, R2
FROM dbo.fnCurveFitting(1)
UNION ALL
SELECT 'Bestfit = ' + CAST(Type AS VARCHAR), A, b, R2
FROM dbo.fnBestFit()
-- Calculate Exponential regression
DELETE
FROM cf
INSERT cf
SELECT .72, 2.16 UNION ALL
SELECT 1.31, 1.61 UNION ALL
SELECT 1.95, 1.16 UNION ALL
SELECT 2.58, .85 UNION ALL
SELECT 3.14, .5
SELECT 'Exponential regression' AS Type, A, b, R2
FROM dbo.fnCurveFitting(1)
UNION ALL
SELECT 'Bestfit = ' + CAST(Type AS VARCHAR), A, b, R2
FROM dbo.fnBestFit()
-- Calculate Logarithmic regression
DELETE
FROM cf
INSERT cf
SELECT 3, 1.5 UNION ALL
SELECT 4, 9.3 UNION ALL
SELECT 6, 23.4 UNION ALL
SELECT 10, 45.8 UNION ALL
SELECT 12, 60.1
SELECT 'Logarithmic regression' AS Type, A, b, R2
FROM dbo.fnCurveFitting(1)
UNION ALL
SELECT 'Bestfit = ' + CAST(Type AS VARCHAR), A, b, R2
FROM dbo.fnBestFit()
-- Calculate Power regression
DELETE
FROM cf
INSERT cf
SELECT 10, .95 UNION ALL
SELECT 12, 1.05 UNION ALL
SELECT 15, 1.25 UNION ALL
SELECT 17, 1.41 UNION ALL
SELECT 20, 1.73 UNION ALL
SELECT 22, 2 UNION ALL
SELECT 25, 2.53 UNION ALL
SELECT 27, 2.98 UNION ALL
SELECT 30, 3.85 UNION ALL
SELECT 32, 4.59 UNION ALL
SELECT 35, 6.02
SELECT 'Power regression' AS Type, A, b, R2
FROM dbo.fnCurveFitting(1)
UNION ALL
SELECT 'Bestfit = ' + CAST(Type AS VARCHAR), A, b, R2
FROM dbo.fnBestFit()
DROP TABLE cf

Here are the functions. When using a type that is not valid, the function defaults to linear regression.

CREATE FUNCTION dbo.fnCurveFitting
(
@Type TINYINT
)
RETURNS @p TABLE (A DECIMAL(38, 10), b DECIMAL(38, 10), R2 DECIMAL(38, 10))
AS
/*
Type = 1 Linear y = a + b*x
Type = 2 Exponential y = a*e^(b*x) nb a > 0
Type = 3 Logarithmic y = a + b*ln(x)
Type = 4 Power y = a*x^b nb a > 0
*/
BEGIN
DECLARE @n DECIMAL(38, 10),
@x DECIMAL(38, 10),
@x2 DECIMAL(38, 10),
@y DECIMAL(38, 10),
@xy DECIMAL(38, 10),
@y2 DECIMAL(38, 10),
@d DECIMAL(38, 10),
@a DECIMAL(38, 10),
@b DECIMAL(38, 10),
@r2 DECIMAL(38, 10)
SELECT @n = COUNT(*),
@x = CASE
WHEN @Type = 2 THEN SUM(x)
WHEN @Type = 3 THEN SUM(LOG(x))
WHEN @Type = 4 THEN SUM(LOG(x))
ELSE SUM(x)
END,
@x2 = CASE
WHEN @Type = 2 THEN SUM(x * x)
WHEN @Type = 3 THEN SUM(LOG(x) * LOG(x))
WHEN @Type = 4 THEN SUM(LOG(x) * LOG(x))
ELSE SUM(x * x)
END,
@y = CASE
WHEN @Type = 2 THEN SUM(LOG(y))
WHEN @Type = 3 THEN SUM(y)
WHEN @Type = 4 THEN SUM(LOG(y))
ELSE SUM(y)
END,
@xy = CASE
WHEN @Type = 2 THEN SUM(x * LOG(y))
WHEN @Type = 3 THEN SUM(LOG(x) * y)
WHEN @Type = 4 THEN SUM(LOG(x) * LOG(y))
ELSE SUM(x * y)
END,
@y2 = CASE
WHEN @Type = 2 THEN SUM(LOG(y) * LOG(y))
WHEN @Type = 3 THEN SUM(y * y)
WHEN @Type = 4 THEN SUM(LOG(y) * LOG(y))
ELSE SUM(y * y)
END,
@d = @n * @x2 - @x * @x
FROM cf
IF @d = 0
RETURN
SELECT @a = (@x2 * @y - @x * @xy) / @d,
@b = (@n * @xy - @x * @y) / @d,
@r2 = (@a * @y + @b * @xy - @y * @y / @n) / (@y2 - @y * @y / @n)
INSERT @p
SELECT CASE
WHEN @Type = 2 THEN EXP(@a)
WHEN @Type = 3 THEN @a
WHEN @Type = 4 THEN EXP(@a)
ELSE @a
END,
@b,
@r2
RETURN
END
CREATE FUNCTION dbo.fnBestFit
(
)
RETURNS @p TABLE (Type TINYINT, A DECIMAL(38, 10), b DECIMAL(38, 10), R2 DECIMAL(38, 10))
AS
BEGIN
INSERT @p
SELECT 1,
A,
b,
R2
FROM dbo.fnCurveFitting(1)
INSERT @p
SELECT 2,
A,
b,
R2
FROM dbo.fnCurveFitting(2)
INSERT @p
SELECT 3,
A,
b,
R2
FROM dbo.fnCurveFitting(3)
INSERT @p
SELECT 4,
A,
b,
R2
FROM dbo.fnCurveFitting(4)
DELETE
FROM @p
WHERE R2 <> (SELECT MAX(R2) FROM @p)
RETURN
END

This algorithm is providential for me. Im a student experimenting path discovering on Internet. I succeeded to put some traceroute data into a mysql database. And now, i tried to find a way to find shortests paths...

This is wonderful. It looks like it will be greatly helpful toward some trend calculation that I hope to do. I think I can use this as is, but I would like to learn from and understand it better. So, apologies in advance for the noob questions that follow...

Can anyone point to places in the code if/where intermediate data (such as slope, intercept, or "c") are calculated?

I think it's clear that r2 is R squared. But I've never seen "a" as a variable name in this sort of equation. So that makes me wonder if "b" is just a variable that was needed, or if it is the "b" that is commonly used in calculating regressions.

Thanks for the code, and thanks in advance for any further explanation.

One more follow-up... I'm using MS SQL Server 2005. In my dataset, my y-column includes negative numbers. I am getting the error: "A domain error occurred."

I believe this is due to the use of "LOG(y)", where LOG() is expecting a positive number. Is there a common strategy for dealing with this issue?

Are you using the BestFit function? Well... That hasn't been optimized for this situations. However I have a SQLCLR that has! See http://www.developerworkshop.net/software.html There are scripts and samples too.

I guess I should get more sleep... I appear to add confusion with every visit. Anyway, here is a thread that describes how to get around issues one may encounter with enabling CLR:

Bit of a long shot this one, but I have stumbled on this forum post. I have a requirement to replicate some of the Excel Solver functionality within SQL. Ideally I wanted a purely TSQL function, but this looks unlikely now, so I'm also looking at CLR options.

What I need to do is analyse some financial fund data (36 price values) against somewhere between 2-5 benchmarks (also 36 price values each) and determine the sensitivities (weights) of the benchmarks that best follow the fund. In other words find a set of benchmark weights which minimizes the tracking error between the resulting benchmark and the fund. It's explained perfectly here: - http://www.andreassteiner.net/performanceanalysis/?External_Performance_Analysis:Style_Analysis

This is done with a few clicks in Excel, which is why it's so frustrating that I can't find a SQL Server contained solution. However, I do appreciate that it's a quadratic problem, so may not be so easily portable to SQL Server. I've have looked at the Frontline Solver (http://www.solver.com) and building a C# Dll, but I'd rather avoid that if possible.

Anyone got any thoughts or experience that might assist?

Swepeso, I am so grateful for your post. IT saved my life. In my college project , I have to use similar kind of functionality for forecasting of events. But again, I will be more helpful if I get any more insight on Polynomial regression .. I am not able to find fool-proof and suitable method like yours. Your expertise is needed.