More Efficient History Table Joins

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

SQL Server 2000 Forums

Transact-SQL (2000)

More Efficient History Table Joins

Author

Topic

AjarnMark
SQL Slashing Gunting Master

3246 Posts

Posted - 2005-07-20 : 14:42:01

After searching the forums (and enduring a temporary SQLTeam outage

) I have not found anything in the last 60 days to address this. I'm seeking input on a more efficient way to JOIN my history tables. Here's the short story: I'm building custom reports on a commercial application. Therefore, I have no option to restructure the database. In order to protect the interested parties, I have created an example focusing on the specific issue, which is the JOIN.

The question is: Is there a better way to join the history table (such as a technique that does not use a subquery)? Unfortunately, my actual query will be referencing nigh-on to a dozen of these suckers, and I'm just a little nervous on general principles. Actual performance is acceptable for now, but we're early yet in the database's life and record counts are low.

Here's the example:

CREATE TABLE #LineItem (
	ItemID	int NOT NULL,
	ItemDate	datetime NOT NULL,
	ProductID	int NOT NULL
)

CREATE TABLE #ProductHistory (
	ProductID int NOT NULL,
	UpdateDate datetime NOT NULL,
	Price	money
)
GO

INSERT INTO #LineItem (ItemID, ItemDate, ProductID) VALUES (1, '20050401', 1)
INSERT INTO #LineItem (ItemID, ItemDate, ProductID) VALUES (2, '20050401', 2)
INSERT INTO #LineItem (ItemID, ItemDate, ProductID) VALUES (3, '20050515', 1)

INSERT INTO #ProductHistory (ProductID, UpdateDate, Price) VALUES (1, '20050101', 7.99)
INSERT INTO #ProductHistory (ProductID, UpdateDate, Price) VALUES (1, '20050320', 10.00)
INSERT INTO #ProductHistory (ProductID, UpdateDate, Price) VALUES (1, '20050514', 15.00)
INSERT INTO #ProductHistory (ProductID, UpdateDate, Price) VALUES (1, '20050720', 22.22)
INSERT INTO #ProductHistory (ProductID, UpdateDate, Price) VALUES (2, '20050101', 18.00)
INSERT INTO #ProductHistory (ProductID, UpdateDate, Price) VALUES (2, '20050219', 19.50)
INSERT INTO #ProductHistory (ProductID, UpdateDate, Price) VALUES (2, '20050401', 25.50)
INSERT INTO #ProductHistory (ProductID, UpdateDate, Price) VALUES (2, '20050623', 37.77)
GO

SELECT
	L.ItemID,
	CONVERT(varchar(12), L.ItemDate, 101) AS ItemDate,
	P.Price
FROM
	#LineItem L
	JOIN #ProductHistory P ON L.ProductID = P.ProductID
		AND P.UpdateDate = 
			( SELECT MAX(UpdateDate)
			  FROM #ProductHistory innerP
			  WHERE innerP.ProductID = L.ProductID
			    AND innerP.UpdateDate <= L.ItemDate )

GO

DROP TABLE #LineItem
DROP TABLE #ProductHistory

The expected result is:

ItemID ItemDate Price
----------- ------------ ---------------------
1 04/01/2005 10.0000
2 04/01/2005 25.5000
3 05/15/2005 15.0000

---------------------------
EmeraldCityDomains.com

TG
Master Smack Fu Yak Hacker

6065 Posts

Posted - 2005-07-20 : 15:08:56

Here's an option (same results without the correlated subquery)


select	b.ItemID
	,convert(varchar,b.ItemDate,101) ItemDate
	,Price
from	#ProductHistory a
join	(
	select	ItemID
		,ItemDate
		,a.ProductID
		,max(updateDate) UpdateDate
	from	#lineitem a
	join	#producthistory b
			on a.productid = b.productid
	where	UpdateDate <= ItemDate
	group by ItemID
		,ItemDate
		,a.ProductID
	) b
		on a.ProductID = b.ProductID
		and a.UpdateDate = b.UpdateDate

If the reporting doesn't need to be real time and the tables get huge, you could go to either summary tables rebuilt on some acceptable schedule or a warehousing scenario where every <timeperiod> the next period of data is added to the ware house.

Be One with the Optimizer
TG

rockmoose
SQL Natt Alfen

3279 Posts

Posted - 2005-07-20 : 18:11:52

Here is another option:

SELECT
	L.ItemID,
	CONVERT(varchar(12), L.ItemDate, 101) AS ItemDate,
	priceInterval.Price
FROM
	#LineItem L
	JOIN
	(
		SELECT
			frmd.ProductId
			,frmd.UpdateDate
			,(	SELECT MIN(tod.UpdateDate)-1
				FROM #ProductHistory tod
				WHERE frmd.ProductId = tod.ProductId
				AND frmd.UpdateDate < tod.UpdateDate ) AS ValidTo
			,frmd.Price
		FROM
			#ProductHistory frmd
	) AS priceInterval
	ON L.ProductId = priceInterval.ProductId
	AND L.ItemDate BETWEEN priceInterval.UpdateDate AND ISNULL(priceInterval.ValidTo,L.ItemDate))

As a plain Join it will most probably do worse than the other options.
You could however maintain the ValidTo column in the history tables themselves, (with triggers).
The part in red is basically just a history table with to - from dates.

rockmoose

byrmol
Shed Building SQL Farmer

1591 Posts

Posted - 2005-07-20 : 18:48:15

>> Therefore, I have no option to restructure the database
Can you create views or TVF's or indexes if needed?

DavidM

A front-end is something that tries to violate a back-end.

Kristen
Test

22859 Posts

Posted - 2005-07-21 : 01:57:30

Any way to create [even by adding another table etc.] a "Flag" column on #ProductHistory for the row, for a given product, which is the most recently dated (and enforce it with, say, a trigger?)

Kristen

rockmoose
SQL Natt Alfen

3279 Posts

Posted - 2005-07-21 : 05:15:15

^^^^^^^^^^^^^^^^^^^^^^
Did not read the specs
----------------------

Tell them that joining to the history tables and retrieving the correct price for each lineItme is slow by design (SBD)

.

rockmoose

AjarnMark
SQL Slashing Gunting Master

3246 Posts

Posted - 2005-07-21 : 11:58:38

quote:
Originally posted by byrmol

>> Therefore, I have no option to restructure the database
Can you create views or TVF's or indexes if needed?

DavidM

A front-end is something that tries to violate a back-end.

I must need more coffee because I can't figure out what a TVF is right now. But in answer to your question, yes, I can add objects to the database so long as I don't blow up existing processing. I have written several stored procedures and functions for our custom reporting needs.

Are you suggesting creating a view that would retrieve only the desired set of history records? Or what did you have in mind?

---------------------------
EmeraldCityDomains.com

AjarnMark
SQL Slashing Gunting Master

3246 Posts

Posted - 2005-07-21 : 12:03:00

quote:
Originally posted by Kristen

Any way to create [even by adding another table etc.] a "Flag" column on #ProductHistory for the row, for a given product, which is the most recently dated (and enforce it with, say, a trigger?)

Kristen

I could, unfortunately I don't want "the most recent" record. I want the one that was the most recent at the time the record in the other table was created.

Think of it like an order entry system where, instead of creating a snapshot of the order with its pricing at the time it was created like they should have done, the "developer" just kept a history of all price changes with dates. So the only way to know what price you charged for an item is to join it to the history table and figure out what price was in effect at the time the order was created.

---------------------------
EmeraldCityDomains.com

AjarnMark
SQL Slashing Gunting Master

3246 Posts

Posted - 2005-07-21 : 12:38:28

TG and rockmoose, would you like to add any commentary on what you like about your approaches over mine? They may not be correlated subqueries, which I agree in general are slower than non-correlated ones, but interestingly when I convert the temp tables to real tables in my sandbox and then paste all three of our solutions into one window and show the execution plan, the "Query cost (relative to the batch)" shows:

Mine: 29.37%
TG: 36.69%
rockmoose: 33.94%

Of course these could all be radically skewed due to the small table sizes and lack of indexing since I literally just converted the example to use real tables instead of temps.

---------------------------
EmeraldCityDomains.com

emilg
Starting Member

13 Posts

Posted - 2005-07-21 : 12:43:36

The intresting fact is that the EXECUTION PLAN shows 44% for the SUBQUERY version and 55% for the JOIN version of the query.

Try running both in the same batch on your database, from query analyzer with SHOW EXECUTION PLAN ON, and see what the execution plan sais.

The results might be different than mine due to indexes and statistics, but this is a good start in optimising the query.

--Emil

rockmoose
SQL Natt Alfen

3279 Posts

Posted - 2005-07-21 : 12:58:26

Could You convert the part in red from my example to a real table, and join to that iof the derived version,
and run a comparison test again ?
(I did compare plans (each said 0%), but I have no data to speak of and can't say anything desicive about Your situation)

However,if You can "denormalize for speed", creating the aditional column on the history table,
or creating a precalculated table with that column, I think would be the easiest to maintain solution.
And "the biggest" bang for the buck;
(relatively easy to maintain, small impact, facilitate queries, improve performance)

rockmoose

Kristen
Test

22859 Posts

Posted - 2005-07-21 : 13:02:37

"I don't want "the most recent" record"

Ah, sorry. I'll read the question next time!

Kristen

TG
Master Smack Fu Yak Hacker

6065 Posts

Posted - 2005-07-21 : 13:10:31

>>TG and rockmoose, would you like to add any commentary on what you like about your approaches over mine?

I try not to "like" one way over an other, I was just offering an alternative. I wouldn't make a decision until testing was done on real DDL loaded with data. Then let the results determine the decision.

I still like the idea of summary tables or warehousing the data with a time dimension. Can make any of the reporting queries way faster than anything we could come up with against the raw tables.

Be One with the Optimizer
TG

AjarnMark
SQL Slashing Gunting Master

3246 Posts

Posted - 2005-07-21 : 17:08:33

Unfortunately we don't have much data in this yet, even in production, and so I suspect that any test results will be skewed by the small sizes of the tables. Given that nobody has smacked me upside the head yet, saying, "You should use method x which is unquestionably better than your approach" I'll stick with what I've got for now, but will know to look back here if/when we have performance issues as the data size grows.

TG, I agree that a separate reporting database or warehouse would be a great solution, but unfortunately, work demands and project resources do not lend themselves to that approach at this time. Again, something I'll keep in mind for future reference.

Thanks for the help!

---------------------------
EmeraldCityDomains.com

TG
Master Smack Fu Yak Hacker

6065 Posts

Posted - 2005-07-21 : 17:18:51

Don't forget about the idea of generating a few million rows for testing in your testbed or sandbox or whatever we call it these days.

good luck

Be One with the Optimizer
TG

rockmoose
SQL Natt Alfen

3279 Posts

Posted - 2005-07-21 : 17:42:03

quote:
Originally posted by TG

Don't forget about the idea of generating a few million rows for testing in your testbed or sandbox or whatever we call it these days.

good luck

Be One with the Optimizer
TG

Concur,
work done now might save work later.
(Or be a waste of time...)

rockmoose

AjarnMark
SQL Slashing Gunting Master

3246 Posts

Posted - 2005-07-21 : 19:11:43

Yes, good point. I'll have to give some serious thought to best way to produce a good, large set of test data for this monster. Rockmoose, I like your caveat.

BTW, TG, what I was referring to earlier as my sandbox is a database on my local SQL Server that is used for testing ideas, especially table structures, that does not need to be cleaned up. It's a database that I can do anything I want to, including DROPPING it at any time without any consequences to actual dev/test/live environments. But hey, that's some of the "fun" of IT... Reusing acronyms and jargon to mean completely different things and then wondering why we can't seem to communicate clearly with others.

---------------------------
EmeraldCityDomains.com

byrmol
Shed Building SQL Farmer

1591 Posts

Posted - 2005-07-21 : 19:21:35

>>I must need more coffee because I can't figure out what a TVF is right now.

No, I need to stop being lazy on the keyboard...Table Valued Function. Easier to type than "Table Valued UDF". The irony....

I think your original solution is the easiest to understand and would probably stick to that until things hit fans...

I was thinking more about making the reporting queries easier to understand/maintain..By moving the scalar subquery value into the "core" table (LineItem), we have a central location for "history logic"...


create view LineItem_Historical
as
SELECT L.ItemID, L.ItemDate, L.ProductID
	, ( SELECT MAX(UpdateDate)
  FROM ProductHistory innerP
  WHERE innerP.ProductID = L.ProductID
    AND innerP.UpdateDate <= L.ItemDate ) as PriceDate --Repeat other historical values
FROM LineItem L
go

SELECT L.ItemID, L.ItemDate, L.ProductID, P.Price
from LineItem_Historical AS L
INNER join ProductHistory P ON L.ProductID = P.ProductID AND P.UpdateDate = L.PriceDate

Hopefully SQL Server would be smart enough to use the identical plan to your original query..

DavidM

A front-end is something that tries to violate a back-end.

AjarnMark
SQL Slashing Gunting Master

3246 Posts

Posted - 2005-07-22 : 13:12:51

David, I really like the view idea for a couple of reasons. One is that it makes the main SELECT easier to read. It also simplifies the reusability of this concept, and we will definitely need to reuse it. Thanks for the push.

More irony is that the grand SELECT statement I'm working on will probably be placed inside a TVF.

---------------------------
EmeraldCityDomains.com

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources