Delete Puzzler - SQL Server Forums

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

SQL Server 2000 Forums

SQL Server Development (2000)

Delete Puzzler

Author

Topic

SamC
White Water Yakist

3467 Posts

Posted - 2003-11-09 : 12:59:20

Damian recently posted a solution for deleting older records from a Table.

I've had a nagging concern that there may be a "less costly" way to find and delete older records without requiring an INNER JOIN . Before chasing a new solution, a check of the execution plan on the INNER JOIN solution show a surprising two table scans (sad to see since the table is FULLY indexed). This can't be right, but it may be the *bad* result of doing an INNER JOIN using a " < " condition?

Here's the query as it is used today...

DELETE A
FROM dbo.Feedback2 A
INNER JOIN dbo.Feedback2 B ON A.CourseID=B.CourseID AND A.UserID=B.UserID AND A.FormID=B.FormID AND A.CreateDate < B.CreateDate

Moving on in hopes that a revised query won't use a table scan... An alternative DELETE (conceptually) would delete all records of a given catagory where the record's "CreateDate" is less than MAX(CreateDate) for that catagory.

DELETE FROM dbo.Feedback2
WHERE catagory = this catagory AND CreateDate < MAX(CreateDate this catagory)

Of course this won't work as is, and using a MAX would require GROUP BY somwhere.

This thought process leads to a formulation for the recordset to SAVE (not DELETE)

SELECT CourseID, UserID, FormID, MAX(CreateDate) as CreateDate
FROM dbo.Feedback2
GROUP BY CourseID, UserID, FormID

Deleting the unwanted records is then

DELETE A
FROM dbo.Feedback2 A
INNER JOIN (
SELECT CourseID, UserID, FormID, MAX(CreateDate) as CreateDate
FROM dbo.Feedback2
GROUP BY CourseID, UserID, FormID
) B on B.CourseID=A.CourseID AND B.UserID = A.UserID AND B.FormID = A.FormID AND B.CreateDate > A.CreateDate

Comparing the execution plan of both solutions shows disappointing table scans in both queries. This suggests both queries are winners, but I have a problem with the table statistics forcing the table scans. Is there a way to reset statistics on a table which might lead to a better execution plan?

Sam

Stoad
Freaky Yak Linguist

1983 Posts

Posted - 2003-11-09 : 18:03:13

How many rows in the table?
Are you seeking for a better performance solution or trying to
get more eyes-pleasant execution plan?
How about simple:
delete from t where dt<>(select max(dt) from t tt where tt.uid=t.uid) ??
PS Common mistake: to take "it all" for a kind of Real Science...

SamC
White Water Yakist

3467 Posts

Posted - 2003-11-09 : 18:18:17

Right now, only 300 rows. It's going to grow large over time however. I need efficent execution, not a good-looking query. Right now Damian's INNER JOIN is as good (or better) than anything else, but it's hard to compare when table scans show up in the execution plan.

Could the compiler be choosing a table scan because it's more efficent than an index seek for a few hundred rows?

Sam

Stoad
Freaky Yak Linguist

1983 Posts

Posted - 2003-11-09 : 18:54:24

I think it is so. Too few rows.

Stoad
Freaky Yak Linguist

1983 Posts

Posted - 2003-11-09 : 19:00:55

quote:
it's hard to compare when table scans show up in the execution plan

Sorry, Sam, but I think that Experiment is the best Judge.

ehorn
Master Smack Fu Yak Hacker

1632 Posts

Posted - 2003-11-09 : 19:16:09

quote:

(sad to see since the table is FULLY indexed

SamC, Can you post Feedback2 DDL and indexes?

Stoad
Freaky Yak Linguist

1983 Posts

Posted - 2003-11-09 : 19:51:04

ehorn,

do you mean the DateTime column is indexed on its own?

ehorn
Master Smack Fu Yak Hacker

1632 Posts

Posted - 2003-11-09 : 19:52:36

I am trying to better understand the indexes you have on the Feedback2 tabel as you had mentioned it is "fully indexed"..

SamC
White Water Yakist

3467 Posts

Posted - 2003-11-09 : 19:55:28

[code]CREATE TABLE [Feedback2] (
[UserID] [int] NOT NULL ,
[CourseID] [int] NOT NULL ,
[CreateDate] [datetime] NOT NULL ,
[FormID] [int] NOT NULL ,
[Question] [nvarchar] (1000) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL ,
[ChoiceID] [int] NOT NULL ,
[ChoiceText] [nvarchar] (2000) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL
) ON [PRIMARY]
GO
CREATE INDEX [IDX_Feedback1] ON [dbo].[Feedback2]([UserID]) ON [PRIMARY]
CREATE INDEX [IDX_Feedback2] ON [dbo].[Feedback2]([CourseID]) ON [PRIMARY]
CREATE INDEX [IDX_Feedback3] ON [dbo].[Feedback2]([FormID]) ON [PRIMARY]
CREATE INDEX [IDX_Feedback4] ON [dbo].[Feedback2]([ChoiceID]) ON [PRIMARY]
CREATE INDEX [IDX_Feedback5] ON [dbo].[Feedback2]([CreateDate]) ON [PRIMARY]
GO
[/code]

ehorn
Master Smack Fu Yak Hacker

1632 Posts

Posted - 2003-11-09 : 23:54:15

One option is to force an index...
You can add a primary key, such as:

FeedBack2ID INT (IDENTITY(0,1) PRIMARY KEY

spooling costs but scales better than the table scans

DELETE A
FROM FeedBack2 A
INNER JOIN FeedBack2 B ON A.CourseID=B.CourseID AND A.UserID=B.UserID AND A.FormID=B.FormID AND A.CreateDate < B.CreateDate

quote:
This thought process leads to a formulation for the recordset to SAVE (not DELETE)

Which leads to another option to use the nonclustered indexes to build the alias (d) and the clustered index (FeedBack2ID) to perform the delete.
This option still produces the spooling but also scales much better than the scan and makes use of all the indexes.

delete 	FeedBack2 
where Feedback2ID not in
(
	select Feedback2ID from FeedBack2 c 
	join 		(
				select 	
					distinct a.courseid,
					a.userid,
					a.formid,
					(select max(createdate) from FeedBack2 b where b.courseid=a.courseid and b.userid = a.userid and b.formid=a.formid) createdate
				from FeedBack2 a
			) d
	on c.courseid=d.courseid and c.userid = d.userid and c.formid=d.formid and c.createdate = d.createdate
)

How to prevent those spools???

SamC
White Water Yakist

3467 Posts

Posted - 2003-11-10 : 07:05:24

quote:
One option is to force an index...
You can add a primary key, such as:

FeedBack2ID INT (IDENTITY(0,1) PRIMARY KEY

spooling costs but scales better than the table scans

DELETE A
FROM FeedBack2 A
INNER JOIN FeedBack2 B ON A.CourseID=B.CourseID AND A.UserID=B.UserID AND A.FormID=B.FormID AND A.CreateDate < B.CreateDate

Why would adding a primary key (which isn't used in the query) cause the compiler not to generate table scans?

ehorn
Master Smack Fu Yak Hacker

1632 Posts

Posted - 2003-11-10 : 10:02:08

Sam, I do not understand this...

My assumption is that the optimizer cannot assume any uniqueness on the self-join w/o a unique constraint. But it does seem to always use the Clustered index for additional "optimization" ??? information. As You and Stoad discussed, maybe the optimizer sees the scan as the best solution, although, regardless of how large the table gets, it appears to never use the non-clustered indexes for the hash match. Perhaps the Gurus can better explain what is happening here or suggest a more appropriate course.

SamC
White Water Yakist

3467 Posts

Posted - 2003-11-10 : 10:43:35

I regret naming this thread "Delete puzzler" since the indexing (not the query) seems to be generating the poor execution plan.

The DDL is posted above. If anyone suggests a better index scheme, I'll give it a try.

Sam

ehorn
Master Smack Fu Yak Hacker

1632 Posts

Posted - 2003-11-10 : 11:15:47

Sam,

Are the existing indexes (UserID,CourseID,FormID,Createdate) serving some purpose for other application functionality (ie other types of searching)?

SamC
White Water Yakist

3467 Posts

Posted - 2003-11-10 : 11:27:31

The only other operations are INSERTs, and there's a report which uses this query:

SELECT	CASE WHEN ChoiceID > 0 THEN Total ELSE NULL END AS Total ,
	CAST(CASE WHEN ChoiceID > 0 AND GrandTotal > 0 THEN CAST(100.0 * Total / GrandTotal AS NUMERIC(3,0)) ELSE NULL END AS VARCHAR) + '%' AS [Percent]  ,
	Question 
	FROM	(
		SELECT	FormID, 0 as ChoiceID,
			'<b>' + MAX(Question) + '</b>' AS Question, 
			COUNT(*) as Total , -- List of all the questions
			1 as GrandTotal
			FROM	dbo.Feedback2
			WHERE	CourseID = @CourseID
			AND	ChoiceID > 0 -- Select just the "Choice" Responses
			AND	CreateDate BETWEEN @StartDate AND @FinishDate
			GROUP BY FormID 
		UNION ALL
		SELECT	F.FormID, ChoiceID, ChoiceText, COUNT(*) AS Total , MAX(GrandTotal) AS GrandTotal -- FormID, Choice and Total for each choosable item
			FROM	dbo.Feedback2 F
			LEFT OUTER JOIN (SELECT FormID, COUNT(*) AS GrandTotal
						FROM dbo.Feedback2
						WHERE	CourseID=@CourseID
						AND	ChoiceID > 0
						AND	CreateDate BETWEEN @StartDate AND @FinishDate
						GROUP BY FormID) GT ON GT.FormID = F.FormID
			WHERE	CourseID = @CourseID
			AND	ChoiceID > 0 -- Select just the "Choice" Responses
			AND	CreateDate BETWEEN @StartDate AND @FinishDate
			GROUP BY F.FormID, ChoiceID, ChoiceText
	) X
	ORDER BY FormID ASC, ChoiceID ASC

This query produces a summary of the ChoiceIDs (radio buttons) user's selected on certain FormIDs, totals 'em up, calculates a percentage.

It's not pretty.

Sam

ehorn
Master Smack Fu Yak Hacker

1632 Posts

Posted - 2003-11-10 : 11:44:09

Sorry for so many questions. Are the indexes being used for the insert?
As you probably know indexes slow down insert,updates and deletes. If the indexes are not be effectively used you may consider removing them with the exception of maybe createdate and a primary key. I tested the delete and it was much faster with no spooling. Not sure how this will effect insert but it may ultimately be a design tradeoff.

X002548
Not Just a Number

15586 Posts

Posted - 2003-11-10 : 13:00:03

Sam,

I was playing aroung with this, when I realized the Delete will never delete anything...anyway, back toi the index thingee...

I swear I saw a one of the plans using parallesim...but can't get it to repeat it...Takes about 15 minutes to run, and it needs to be run in stages...but it's interesting...what's eager spool?


CREATE TABLE Feedback2 (
	  UserID 	int 		NOT NULL
	, CourseID 	int 		NOT NULL
	, CreateDate 	datetime 	NOT NULL DEFAULT Getdate()
	, FormID 	int 		NOT NULL
	, Question 	varchar (1000)
	, ChoiceID 	int
	, ChoiceText 	varchar (2000)
)
GO

 CREATE  INDEX IDX_Feedback1 ON dbo.Feedback2(UserID)
 CREATE  INDEX IDX_Feedback2 ON dbo.Feedback2(CourseID)
 CREATE  INDEX IDX_Feedback3 ON dbo.Feedback2(FormID)
 CREATE  INDEX IDX_Feedback4 ON dbo.Feedback2(ChoiceID)
 CREATE  INDEX IDX_Feedback5 ON dbo.Feedback2(CreateDate)
 CREATE  INDEX IDX_Feedback6 ON dbo.Feedback2(UserID, CourseID, FormID, CreateDate)
GO

CREATE TABLE Feedback3 (
	  UserID 	int 		NOT NULL
	, CourseID 	int 		NOT NULL
	, CreateDate 	datetime 	NOT NULL DEFAULT Getdate()
	, FormID 	int 		NOT NULL
	, Question 	varchar (1000)
	, ChoiceID 	int
	, ChoiceText 	varchar (2000)
)
GO

 CREATE  INDEX IDX_Feedback1 ON dbo.Feedback3(UserID)
 CREATE  INDEX IDX_Feedback2 ON dbo.Feedback3(CourseID)
 CREATE  INDEX IDX_Feedback3 ON dbo.Feedback3(FormID)
 CREATE  INDEX IDX_Feedback4 ON dbo.Feedback3(ChoiceID)
 CREATE  INDEX IDX_Feedback5 ON dbo.Feedback3(CreateDate)
 CREATE  INDEX IDX_Feedback6 ON dbo.Feedback3(UserID, CourseID, FormID, CreateDate)
GO

--SET SHOWPLAN_TEXT OFF
--SET SHOWPLAN_ALL OFF
--GO

SET NOCOUNT ON

DECLARE @x int, @y int, @z int

SELECT @x = 1, @y = 1, @z = 1

WHILE @x < 100
	BEGIN
		WHILE @y < 5
			BEGIN
				WHILE @z < 2
					BEGIN
						INSERT INTO Feedback2 (UserId, CourseId, FormId)
						SELECT @x, @y, @z
						SELECT @z = @z + 1
					END
					SELECT @z = 1, @y = @y + 1
			END
			SELECT @y = 1, @x = @x + 1
	END

    SELECT COUNT(*) FROM Feedback2

SELECT @x = 1, @y = 1, @z = 1

WHILE @x < 200
	BEGIN
		WHILE @y < 100
			BEGIN
				WHILE @z < 10
					BEGIN
						INSERT INTO Feedback3 (UserId, CourseId, FormId)
						SELECT @x, @y, @z
						SELECT @z = @z + 1
					END
					SELECT @z = 1, @y = @y + 1
			END
			SELECT @y = 1, @x = @x + 1
	END

    SELECT COUNT(*) FROM Feedback3
GO


    DELETE A
      FROM dbo.Feedback3 A
INNER JOIN (  SELECT CourseID, UserID, FormID, MAX(CreateDate) as CreateDate
		FROM dbo.Feedback3
	    GROUP BY CourseID, UserID, FormID
	    ) B 
	ON B.CourseID   = A.CourseID 
       AND B.UserID     = A.UserID 
       AND B.FormID     = A.FormID 
       AND B.CreateDate > A.CreateDate

SELECT 'Rows Deleted: ' + CONVERT(varchar(15),@@ROWCOUNT)


    DELETE A
      FROM dbo.Feedback2 A
INNER JOIN (  SELECT CourseID, UserID, FormID, MAX(CreateDate) as CreateDate
		FROM dbo.Feedback2
	    GROUP BY CourseID, UserID, FormID
	    ) B 
	ON B.CourseID   = A.CourseID 
       AND B.UserID     = A.UserID 
       AND B.FormID     = A.FormID 
       AND B.CreateDate > A.CreateDate


SELECT 'Rows Deleted: ' + CONVERT(varchar(15),@@ROWCOUNT)
GO

SET NOCOUNT OFF
DROP TABLE FeedBack2
DROP TABLE FeedBack3
GO

Brett

8-)

SamC
White Water Yakist

3467 Posts

Posted - 2003-11-10 : 13:49:47

Hi Brett,

Searching on "Eager spool", yours is the only post that speaks of it.

Sam

X002548
Not Just a Number

15586 Posts

Posted - 2003-11-10 : 14:05:54

Check out the plan on the secons one...

Table Spool/ Eager Spool...

Never seen it before...

And I'm still confused on how I saw parallelism...and now can't recreate it...

damn hangover...

Brett

8-)

SamC
White Water Yakist

3467 Posts

Posted - 2003-11-10 : 16:03:42

Hi Brett,

I ran your test scripts (thanks) and found that there's no duplicate data to delete - so that explains why 0 rows were deleted.

I see the eager spool. I can imagine the SQL design team thinking up these names... (apethetic spool? no...)

The execution plan for

DELETE	A
	FROM	dbo.Feedback2 A
	INNER JOIN dbo.Feedback2 B ON A.CourseID=B.CourseID AND A.UserID=B.UserID AND A.FormID=B.FormID AND A.CreateDate < B.CreateDate

is much shorter, but has table scans, not eager spools when I paste it into the script you wrote. Same DB and table, but it runs WAY faster than the test query in your script. (20 ms versus 2100 ms) using GETDATE to time 'em like so..

declare @T1 DATETIME
SET @t1 = GETDATE()

DELETE A
      FROM dbo.Feedback3 A
INNER JOIN (  SELECT CourseID, UserID, FormID, MAX(CreateDate) as CreateDate
		FROM dbo.Feedback3
	    GROUP BY CourseID, UserID, FormID
	    ) B 
	ON B.CourseID   = A.CourseID 
       AND B.UserID     = A.UserID 
       AND B.FormID     = A.FormID 
       AND B.CreateDate > A.CreateDate

Print DATEDIFF(MS, @t1, getdate())

SET @t1 = GETDATE()
DELETE	A
	FROM	dbo.Feedback2 A
	INNER JOIN dbo.Feedback2 B ON A.CourseID=B.CourseID AND A.UserID=B.UserID AND A.FormID=B.FormID AND A.CreateDate < B.CreateDate

Print DATEDIFF(MS, @t1, getdate())

Moral: No matter how good it looks, you may be look'en at the wrong thing altogether.

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources