Revisitation of an old problem

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

SQL Server 2000 Forums

SQL Server Development (2000)

Revisitation of an old problem

Author

Topic

SamC
White Water Yakist

3467 Posts

Posted - 2005-01-31 : 20:02:31

I'm sure there's a generic "name" for this problem, but I can't put my finger on it.

Given a table

CREATE TABLE MyTable (
SomeSetOfKeys INT ,
KeyDate DATETIME )

I need to order the keys by KeyDate, and on each row, have access to KeyDate and KeyDate(+1) - the next Date.

My understanding of the least expensive way to do this is:

SELECT A.SomeSetOfKeys, A.KeyDate, MIN(B.KeyDate) As NextKeyDate
FROM MyTable A
INNER JOIN MyTable B ON B.SomeSetOfKeys = A.SomeSetOfKeys
AND B.KeyDate > A.KeyDate
GROUP BY SomeSetOfKeys, KeyDate

This can be an expensive way to get the job done, even with optimized clustered indexes, if the number of rows is large.

Any other tricks to improve performance???

eyechart
Master Smack Fu Yak Hacker

3575 Posts

Posted - 2005-01-31 : 20:49:02

yes, this code is like a correlated subquery. Does the optimizer force nested loops when this executes? That could be bad (as you point out) if your table is large.

seems like a fun puzzle. I'll have to grab a pint of guinness or two and ponder it.

-ec

spirit1
Cybernetic Yak Master

11752 Posts

Posted - 2005-01-31 : 20:51:19

maybe:
SELECT A.SomeSetOfKeys, A.KeyDate, (select top 1 KeyDate from MyTable where SomeSetOfKeys = A.SomeSetOfKeys
and KeyDate > A.KeyDate) As NextKeyDate
FROM MyTable A

but if this is faster i'd be very surprised...

Go with the flow & have fun! Else fight the flow

eyechart
Master Smack Fu Yak Hacker

3575 Posts

Posted - 2005-01-31 : 22:05:23

neither one of these solutions actually works as posted. I tweaked spirit's to make it work (eliminated a join)

here it is:

SELECT A.SomeSetOfKeys, A.KeyDate, (select top 1 KeyDate from test where KeyDate > A.KeyDate) As NextKeyDate
FROM test A

I tried a solution joining on a virtual rownum, but that was much slower than spirit's solution. I won't even bother posting it.

My test case only had ~7000 rows too. I have no idea what would happen if you had millions of rows.

-ec

eyechart
Master Smack Fu Yak Hacker

3575 Posts

Posted - 2005-01-31 : 22:19:21

btw here is the little test case I created:


CREATE TABLE [dbo].[test] (
	[somesetofkeys] [int] IDENTITY (50, 5) NOT NULL ,
	[keydate] [datetime] NULL 
) ON [PRIMARY]
GO

ALTER TABLE [dbo].[test] WITH NOCHECK ADD 
	CONSTRAINT [PK_test] PRIMARY KEY  CLUSTERED 
	(
		[somesetofkeys]
	)  ON [PRIMARY] 
GO

ALTER TABLE [dbo].[test] ADD 
	CONSTRAINT [DF_test_keydate] DEFAULT (getdate()) FOR [keydate]
GO

 CREATE  INDEX [test2] ON [dbo].[test]([keydate], [somesetofkeys]) ON [PRIMARY]
GO

I inserted into this table some test data in a loop (I put a waitfor delay in the loop in order to get different getdate() values).

I tweaked Sam's code to work (eliminate join) and here is showplan when executed:


StmtText                                                                                                                                                          
--------------------------------------- 
SELECT A.SomeSetOfKeys, A.KeyDate, MIN(B.KeyDate) As NextKeyDate
FROM test A
INNER JOIN test B ON B.KeyDate > A.KeyDate
GROUP BY a.SomeSetOfKeys, a.KeyDate

(1 row(s) affected)

StmtText                                                                                                                                             
---------------------------------------------- 
  |--Parallelism(Gather Streams)
       |--Stream Aggregate(GROUP BY:([A].[somesetofkeys]) DEFINE:([Expr1002]=MIN([partialagg1003]), [A].[keydate]=ANY([A].[keydate])))
            |--Parallelism(Repartition Streams, PARTITION COLUMNS:([A].[somesetofkeys]), ORDER BY:([A].[somesetofkeys] ASC))
                 |--Stream Aggregate(GROUP BY:([A].[somesetofkeys]) DEFINE:([partialagg1003]=MIN([B].[keydate]), [A].[keydate]=ANY([A].[keydate])))
                      |--Nested Loops(Inner Join, OUTER REFERENCES:([A].[keydate]) WITH PREFETCH)
                           |--Clustered Index Scan(OBJECT:([DBA].[dbo].[test].[PK_test] AS [A]), ORDERED FORWARD)
                           |--Index Seek(OBJECT:([DBA].[dbo].[test].[test2] AS [B]), SEEK:([B].[keydate] > [A].[keydate]) ORDERED FORWARD)

(7 row(s) affected)

Here is the statistics io and statistics time output for Sam's query


SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.
SQL Server parse and compile time: 
   CPU time = 13 ms, elapsed time = 13 ms.

(7279 row(s) affected)

Table 'test'. Scan count 7282, logical reads 72521, physical reads 0, read-ahead reads 0.

SQL Server Execution Times:
   CPU time = 31250 ms,  elapsed time = 15761 ms.

SQL Server Execution Times:
   CPU time = 31250 ms,  elapsed time = 15761 ms.

SQL Server Execution Times:
   CPU time = 31250 ms,  elapsed time = 15762 ms.
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

Here is the same info for spirit's code

StmtText                                                                                                                     
----------------------------------------------------------- 
SELECT A.SomeSetOfKeys, A.KeyDate, (select top 1 KeyDate from test where KeyDate > A.KeyDate) As NextKeyDate
FROM test A

(1 row(s) affected)

StmtText                                                                                                                           
------------------------------------------------------------ 
  |--Compute Scalar(DEFINE:([test].[keydate]=[test].[keydate]))
       |--Nested Loops(Left Outer Join, OUTER REFERENCES:([A].[keydate]))
            |--Index Scan(OBJECT:([DBA].[dbo].[test].[test2] AS [A]))
            |--Hash Match(Cache, HASH:([A].[keydate]), RESIDUAL:([A].[keydate]=[A].[keydate]))
                 |--Top(1)
                      |--Index Seek(OBJECT:([DBA].[dbo].[test].[test2]), SEEK:([test].[keydate] > [A].[keydate]) ORDERED FORWARD)

here is the statistics io and statistics time info:


SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 3 ms.

(7280 row(s) affected)

Table 'test'. Scan count 7281, logical reads 14594, physical reads 0, read-ahead reads 0.

SQL Server Execution Times:
   CPU time = 120 ms,  elapsed time = 773 ms.

SQL Server Execution Times:
   CPU time = 120 ms,  elapsed time = 773 ms.

SQL Server Execution Times:
   CPU time = 120 ms,  elapsed time = 774 ms.
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

I just noticed that the two queries differ by one row. Not sure why, time to hit the pub though.

-ec

SamC
White Water Yakist

3467 Posts

Posted - 2005-01-31 : 23:49:42

quote:
Originally posted by eyechart

CREATE INDEX [test2] ON [dbo].[test]([keydate], [somesetofkeys]) ON [PRIMARY]

What led you to add somesetofkeys after keydate when somesetofkeys is already a clustered index? (I think this may be my performance problem.)

eyechart
Master Smack Fu Yak Hacker

3575 Posts

Posted - 2005-02-01 : 01:36:33

quote:

What led you to add somesetofkeys after keydate when somesetofkeys is already a clustered index? (I think this may be my performance problem.)

I think I was experimenting a bit. I had tried several combinations of indexes and wanted to see if a covered index helped any.

this was really for the test that I was trying using a pseudo rownum. I essentially did a self join on some derived tables using this rownum (the 2nd table being rownum -1) to get the offset you were looking for. The performance was better than your query, but not as good as spirit's. It was cool looking though. :)

I can post that code if you are interested, but it doesn't seem to be a path worth exploring at this point.

I'll retest your query without the covered index to see what the outcome is.

-ec

eyechart
Master Smack Fu Yak Hacker

3575 Posts

Posted - 2005-02-01 : 01:55:48

I re-ran without using the covered index and got the following numbers:


SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.
SQL Server parse and compile time: 
   CPU time = 39 ms, elapsed time = 39 ms.

(7279 row(s) affected)

Table 'Worktable'. Scan count 3547, logical reads 64049, physical reads 0, read-ahead reads 0.
Table 'test'. Scan count 4, logical reads 75, physical reads 0, read-ahead reads 0.
Table 'Worktable'. Scan count 3731, logical reads 66993, physical reads 0, read-ahead reads 0.

SQL Server Execution Times:
   CPU time = 53830 ms,  elapsed time = 27344 ms.

SQL Server Execution Times:
   CPU time = 53830 ms,  elapsed time = 27344 ms.

SQL Server Execution Times:
   CPU time = 53830 ms,  elapsed time = 27345 ms.
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

here is the execution plan:


StmtText                                                                                                                                                          
------------------------------------------------------- 
SELECT A.SomeSetOfKeys, A.KeyDate, MIN(B.KeyDate) As NextKeyDate
FROM test A
INNER JOIN test B ON B.KeyDate > A.KeyDate
GROUP BY a.SomeSetOfKeys, a.KeyDate

(1 row(s) affected)

StmtText                                                                                                                                             
------------------------------------------------------------------- 
  |--Parallelism(Gather Streams)
       |--Stream Aggregate(GROUP BY:([A].[somesetofkeys]) DEFINE:([Expr1002]=MIN([partialagg1003]), [A].[keydate]=ANY([A].[keydate])))
            |--Parallelism(Repartition Streams, PARTITION COLUMNS:([A].[somesetofkeys]), ORDER BY:([A].[somesetofkeys] ASC))
                 |--Stream Aggregate(GROUP BY:([A].[somesetofkeys]) DEFINE:([partialagg1003]=MIN([B].[keydate]), [A].[keydate]=ANY([A].[keydate])))
                      |--Nested Loops(Inner Join, WHERE:([B].[keydate]>[A].[keydate]))
                           |--Clustered Index Scan(OBJECT:([DBA].[dbo].[test].[PK_test] AS [A]), ORDERED FORWARD)
                           |--Table Spool
                                |--Clustered Index Scan(OBJECT:([DBA].[dbo].[test].[PK_test] AS [B]))

(8 row(s) affected)

It is clearly much faster using the composite index, at least in this test case.

-ec

Stoad
Freaky Yak Linguist

1983 Posts

Posted - 2005-02-01 : 02:26:52

> I just noticed that the two queries differ by one row.

ec, of course, it must be one row less -- the maximal KeyDate joins to nothing
and its row gets lost (because of INNER (not LEFT) JOIN).

> SELECT A.SomeSetOfKeys, A.KeyDate,
> (select top 1 KeyDate from test where KeyDate > A.KeyDate) As NextKeyDate
> FROM test A

and I don't see ORDER BY in (select top 1 ... ).

eyechart
Master Smack Fu Yak Hacker

3575 Posts

Posted - 2005-02-01 : 04:16:28

quote:
Originally posted by Stoad

> I just noticed that the two queries differ by one row.

ec, of course, it must be one row less -- the maximal KeyDate joins to nothing
and its row gets lost (because of INNER (not LEFT) JOIN).

> SELECT A.SomeSetOfKeys, A.KeyDate,
> (select top 1 KeyDate from test where KeyDate > A.KeyDate) As NextKeyDate
> FROM test A

and I don't see ORDER BY in (select top 1 ... ).

You are right on both accounts. I see that now after a visit to the pub.

Actually, both queries could benefit from an order by. Good catch on spirit's code though. TOP without an ORDER BY is asking for trouble.

-ec

SamC
White Water Yakist

3467 Posts

Posted - 2005-02-01 : 08:35:44

I get a 13 second execution time in this proc, and maybe that's the best it can do.

I've isolated the join in the proc that we're discussing in this thread. It is slightly more complicated because there are 3 keys to join...


SELECT	-- Join each row to find the next required module timestamp
	P1.CourseID, P1.UserID, P1.ModuleID, DATEDIFF(ss, P1.URLDate, MIN(P2.URLDate)) As Seconds
	FROM	dbo.PageHits P1
	INNER JOIN dbo.PageHits P2
		ON	P2.URLDate	> P1.URLDate
		AND	P2.CourseID	= P1.CourseID
		AND	P2.UserID	= P1.UserID
		AND	P2.ModuleID	> 0
	WHERE	P1.CourseID	= 5932
	AND	P1.ModuleID	> 0
	GROUP BY P1.CourseID, P1.UserID, P1.ModuleID, P1.URLDate

Now, this isn't that bad. Lemme 'splain:
The table holds entries for all courses, all users, and records every module in the course visited with a DATETIME stamp. We want to calculate the time spent in the course for every user, so the time differential is what we are after in this thread.

"SomeSetOfKeys" is: CourseID, UserID. I chose to eliminate any row where ModuleID=0 (these are optional modules) which makes the calculation find a time spent on a "required module" basis.

That wasn't so hard.

My execution plan may be as good as it gets?

  |--Compute Scalar(DEFINE:([Expr1003]=datediff(second, [P1].[UrlDate], [Expr1002])))
       |--Stream Aggregate(GROUP BY:([P1].[UserID], [P1].[ModuleID], [P1].[UrlDate]) DEFINE:([Expr1002]=MIN([P2].[UrlDate]), [P1].[CourseID]=ANY([P1].[CourseID])))
            |--Nested Loops(Inner Join, OUTER REFERENCES:([P1].[UserID], [P1].[UrlDate]))
                 |--Clustered Index Seek(OBJECT:([nimc].[dbo].[pagehits].[IDX_PateHitsCluster2] AS [P1]), SEEK:([P1].[CourseID]=5932),  WHERE:([P1].[ModuleID]>0) ORDERED FORWARD)
                 |--Index Spool(SEEK:([P2].[UserID]=[P1].[UserID] AND [P2].[UrlDate] > [P1].[UrlDate]))
                      |--Clustered Index Seek(OBJECT:([nimc].[dbo].[pagehits].[IDX_PateHitsCluster2] AS [P2]), SEEK:([P2].[CourseID]=5932),  WHERE:([P2].[ModuleID]>0) ORDERED FORWARD)

The graphic execution plan shows an "index spool/eager spool". Maybe I'm not readhing the SHOWPLAN_TEXT right, but I don't see the eager spool there.

Anyone know why these spools are called "eager"?

CREATE TABLE dbo.PageHits2 (
	P_ID	INT NOT NULL IDENTITY (1,1) ,
	[UrlDate] [datetime] NOT NULL  DEFAULT (getdate()),
	[UserID] [int] NOT NULL ,
	[CourseID] [int] NULL ,
	[ModuleID] [int] NULL ,
	CONSTRAINT [IDX_PateHitsCluster2] UNIQUE CLUSTERED 
	(
		CourseID,
		UserID,
		ModuleID,
		URLDate
	) WITH  FILLFACTOR = 80  ON [PRIMARY] 
) ON [PRIMARY]
GO

CREATE INDEX IDX_PageHitsURLDate2 on dbo.PageHits2 (URLDate, CourseID, UserID, ModuleID)

Seventhnight
Master Smack Fu Yak Hacker

2878 Posts

Posted - 2005-02-01 : 10:47:12

Sam.... let me ask you something about this

In regards to

quote:
"SomeSetOfKeys" is: CourseID, UserID. I chose to eliminate any row where ModuleID=0 (these are optional modules) which makes the calculation find a time spent on a "required module" basis.

Is that true?? If I am in course1 (required) and I then got to course2 (not required) and I then go to course2 (required) and I then go to course4 (required)

Isn't it true that the time spent on a course is the differential to the next chronological course, and not the next 'required' course. The required should not matter for determining time spent. (IMHO)

With the index I think I would use spirit's solution as you are only looking up 1 value per row.

Good Luck

Corey

"If the only tool you have is a hammer, the whole world looks like a nail." - Mark Twain

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources