Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 SQL Server 2000 Forums
 SQL Server Development (2000)
 Delete Puzzler

Author  Topic 

SamC
White Water Yakist

3467 Posts

Posted - 2003-11-09 : 12:59:20
Damian recently posted a solution for deleting older records from a Table.

I've had a nagging concern that there may be a "less costly" way to find and delete older records without requiring an INNER JOIN . Before chasing a new solution, a check of the execution plan on the INNER JOIN solution show a surprising two table scans (sad to see since the table is FULLY indexed). This can't be right, but it may be the *bad* result of doing an INNER JOIN using a " < " condition?

Here's the query as it is used today...

DELETE A
FROM dbo.Feedback2 A
INNER JOIN dbo.Feedback2 B ON A.CourseID=B.CourseID AND A.UserID=B.UserID AND A.FormID=B.FormID AND A.CreateDate < B.CreateDate


Moving on in hopes that a revised query won't use a table scan... An alternative DELETE (conceptually) would delete all records of a given catagory where the record's "CreateDate" is less than MAX(CreateDate) for that catagory.

DELETE FROM dbo.Feedback2
WHERE catagory = this catagory AND CreateDate < MAX(CreateDate this catagory)


Of course this won't work as is, and using a MAX would require GROUP BY somwhere.

This thought process leads to a formulation for the recordset to SAVE (not DELETE)

SELECT CourseID, UserID, FormID, MAX(CreateDate) as CreateDate
FROM dbo.Feedback2
GROUP BY CourseID, UserID, FormID

Deleting the unwanted records is then

DELETE A
FROM dbo.Feedback2 A
INNER JOIN (
SELECT CourseID, UserID, FormID, MAX(CreateDate) as CreateDate
FROM dbo.Feedback2
GROUP BY CourseID, UserID, FormID

) B on B.CourseID=A.CourseID AND B.UserID = A.UserID AND B.FormID = A.FormID AND B.CreateDate > A.CreateDate

Comparing the execution plan of both solutions shows disappointing table scans in both queries. This suggests both queries are winners, but I have a problem with the table statistics forcing the table scans. Is there a way to reset statistics on a table which might lead to a better execution plan?

Sam

Stoad
Freaky Yak Linguist

1983 Posts

Posted - 2003-11-09 : 18:03:13
How many rows in the table?
Are you seeking for a better performance solution or trying to
get more eyes-pleasant execution plan?
How about simple:
delete from t where dt<>(select max(dt) from t tt where tt.uid=t.uid) ??
PS Common mistake: to take "it all" for a kind of Real Science...
Go to Top of Page

SamC
White Water Yakist

3467 Posts

Posted - 2003-11-09 : 18:18:17
Right now, only 300 rows. It's going to grow large over time however. I need efficent execution, not a good-looking query. Right now Damian's INNER JOIN is as good (or better) than anything else, but it's hard to compare when table scans show up in the execution plan.

Could the compiler be choosing a table scan because it's more efficent than an index seek for a few hundred rows?

Sam
Go to Top of Page

Stoad
Freaky Yak Linguist

1983 Posts

Posted - 2003-11-09 : 18:54:24
I think it is so. Too few rows.
Go to Top of Page

Stoad
Freaky Yak Linguist

1983 Posts

Posted - 2003-11-09 : 19:00:55
quote:
it's hard to compare when table scans show up in the execution plan
Sorry, Sam, but I think that Experiment is the best Judge.
Go to Top of Page

ehorn
Master Smack Fu Yak Hacker

1632 Posts

Posted - 2003-11-09 : 19:16:09
quote:

(sad to see since the table is FULLY indexed


SamC, Can you post Feedback2 DDL and indexes?
Go to Top of Page

Stoad
Freaky Yak Linguist

1983 Posts

Posted - 2003-11-09 : 19:51:04
ehorn,

do you mean the DateTime column is indexed on its own?
Go to Top of Page

ehorn
Master Smack Fu Yak Hacker

1632 Posts

Posted - 2003-11-09 : 19:52:36
I am trying to better understand the indexes you have on the Feedback2 tabel as you had mentioned it is "fully indexed"..
Go to Top of Page

SamC
White Water Yakist

3467 Posts

Posted - 2003-11-09 : 19:55:28
[code]CREATE TABLE [Feedback2] (
[UserID] [int] NOT NULL ,
[CourseID] [int] NOT NULL ,
[CreateDate] [datetime] NOT NULL ,
[FormID] [int] NOT NULL ,
[Question] [nvarchar] (1000) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL ,
[ChoiceID] [int] NOT NULL ,
[ChoiceText] [nvarchar] (2000) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL
) ON [PRIMARY]
GO
CREATE INDEX [IDX_Feedback1] ON [dbo].[Feedback2]([UserID]) ON [PRIMARY]
CREATE INDEX [IDX_Feedback2] ON [dbo].[Feedback2]([CourseID]) ON [PRIMARY]
CREATE INDEX [IDX_Feedback3] ON [dbo].[Feedback2]([FormID]) ON [PRIMARY]
CREATE INDEX [IDX_Feedback4] ON [dbo].[Feedback2]([ChoiceID]) ON [PRIMARY]
CREATE INDEX [IDX_Feedback5] ON [dbo].[Feedback2]([CreateDate]) ON [PRIMARY]
GO
[/code]
Go to Top of Page

ehorn
Master Smack Fu Yak Hacker

1632 Posts

Posted - 2003-11-09 : 23:54:15
One option is to force an index...
You can add a primary key, such as:
FeedBack2ID INT (IDENTITY(0,1) PRIMARY KEY

spooling costs but scales better than the table scans
DELETE A
FROM FeedBack2 A
INNER JOIN FeedBack2 B ON A.CourseID=B.CourseID AND A.UserID=B.UserID AND A.FormID=B.FormID AND A.CreateDate < B.CreateDate

quote:
This thought process leads to a formulation for the recordset to SAVE (not DELETE)
Which leads to another option to use the nonclustered indexes to build the alias (d) and the clustered index (FeedBack2ID) to perform the delete.
This option still produces the spooling but also scales much better than the scan and makes use of all the indexes.
delete 	FeedBack2 
where Feedback2ID not in
(
select Feedback2ID from FeedBack2 c
join (
select
distinct a.courseid,
a.userid,
a.formid,
(select max(createdate) from FeedBack2 b where b.courseid=a.courseid and b.userid = a.userid and b.formid=a.formid) createdate
from FeedBack2 a
) d
on c.courseid=d.courseid and c.userid = d.userid and c.formid=d.formid and c.createdate = d.createdate
)


How to prevent those spools???
Go to Top of Page

SamC
White Water Yakist

3467 Posts

Posted - 2003-11-10 : 07:05:24
quote:
One option is to force an index...
You can add a primary key, such as:

FeedBack2ID INT (IDENTITY(0,1) PRIMARY KEY


spooling costs but scales better than the table scans

DELETE A
FROM FeedBack2 A
INNER JOIN FeedBack2 B ON A.CourseID=B.CourseID AND A.UserID=B.UserID AND A.FormID=B.FormID AND A.CreateDate < B.CreateDate


Why would adding a primary key (which isn't used in the query) cause the compiler not to generate table scans?

Go to Top of Page

ehorn
Master Smack Fu Yak Hacker

1632 Posts

Posted - 2003-11-10 : 10:02:08
Sam, I do not understand this...

My assumption is that the optimizer cannot assume any uniqueness on the self-join w/o a unique constraint. But it does seem to always use the Clustered index for additional "optimization" ??? information. As You and Stoad discussed, maybe the optimizer sees the scan as the best solution, although, regardless of how large the table gets, it appears to never use the non-clustered indexes for the hash match. Perhaps the Gurus can better explain what is happening here or suggest a more appropriate course.
Go to Top of Page

SamC
White Water Yakist

3467 Posts

Posted - 2003-11-10 : 10:43:35
I regret naming this thread "Delete puzzler" since the indexing (not the query) seems to be generating the poor execution plan.

The DDL is posted above. If anyone suggests a better index scheme, I'll give it a try.

Sam
Go to Top of Page

ehorn
Master Smack Fu Yak Hacker

1632 Posts

Posted - 2003-11-10 : 11:15:47
Sam,

Are the existing indexes (UserID,CourseID,FormID,Createdate) serving some purpose for other application functionality (ie other types of searching)?
Go to Top of Page

SamC
White Water Yakist

3467 Posts

Posted - 2003-11-10 : 11:27:31
The only other operations are INSERTs, and there's a report which uses this query:

SELECT	CASE WHEN ChoiceID > 0 THEN Total ELSE NULL END AS Total ,
CAST(CASE WHEN ChoiceID > 0 AND GrandTotal > 0 THEN CAST(100.0 * Total / GrandTotal AS NUMERIC(3,0)) ELSE NULL END AS VARCHAR) + '%' AS [Percent] ,
Question
FROM (
SELECT FormID, 0 as ChoiceID,
'<b>' + MAX(Question) + '</b>' AS Question,
COUNT(*) as Total , -- List of all the questions
1 as GrandTotal
FROM dbo.Feedback2
WHERE CourseID = @CourseID
AND ChoiceID > 0 -- Select just the "Choice" Responses
AND CreateDate BETWEEN @StartDate AND @FinishDate
GROUP BY FormID
UNION ALL
SELECT F.FormID, ChoiceID, ChoiceText, COUNT(*) AS Total , MAX(GrandTotal) AS GrandTotal -- FormID, Choice and Total for each choosable item
FROM dbo.Feedback2 F
LEFT OUTER JOIN (SELECT FormID, COUNT(*) AS GrandTotal
FROM dbo.Feedback2
WHERE CourseID=@CourseID
AND ChoiceID > 0
AND CreateDate BETWEEN @StartDate AND @FinishDate
GROUP BY FormID) GT ON GT.FormID = F.FormID
WHERE CourseID = @CourseID
AND ChoiceID > 0 -- Select just the "Choice" Responses
AND CreateDate BETWEEN @StartDate AND @FinishDate
GROUP BY F.FormID, ChoiceID, ChoiceText
) X
ORDER BY FormID ASC, ChoiceID ASC


This query produces a summary of the ChoiceIDs (radio buttons) user's selected on certain FormIDs, totals 'em up, calculates a percentage.

It's not pretty.

Sam
Go to Top of Page

ehorn
Master Smack Fu Yak Hacker

1632 Posts

Posted - 2003-11-10 : 11:44:09
Sorry for so many questions. Are the indexes being used for the insert?
As you probably know indexes slow down insert,updates and deletes. If the indexes are not be effectively used you may consider removing them with the exception of maybe createdate and a primary key. I tested the delete and it was much faster with no spooling. Not sure how this will effect insert but it may ultimately be a design tradeoff.
Go to Top of Page

X002548
Not Just a Number

15586 Posts

Posted - 2003-11-10 : 13:00:03
Sam,

I was playing aroung with this, when I realized the Delete will never delete anything...anyway, back toi the index thingee...

I swear I saw a one of the plans using parallesim...but can't get it to repeat it...Takes about 15 minutes to run, and it needs to be run in stages...but it's interesting...what's eager spool?


CREATE TABLE Feedback2 (
UserID int NOT NULL
, CourseID int NOT NULL
, CreateDate datetime NOT NULL DEFAULT Getdate()
, FormID int NOT NULL
, Question varchar (1000)
, ChoiceID int
, ChoiceText varchar (2000)
)
GO

CREATE INDEX IDX_Feedback1 ON dbo.Feedback2(UserID)
CREATE INDEX IDX_Feedback2 ON dbo.Feedback2(CourseID)
CREATE INDEX IDX_Feedback3 ON dbo.Feedback2(FormID)
CREATE INDEX IDX_Feedback4 ON dbo.Feedback2(ChoiceID)
CREATE INDEX IDX_Feedback5 ON dbo.Feedback2(CreateDate)
CREATE INDEX IDX_Feedback6 ON dbo.Feedback2(UserID, CourseID, FormID, CreateDate)
GO

CREATE TABLE Feedback3 (
UserID int NOT NULL
, CourseID int NOT NULL
, CreateDate datetime NOT NULL DEFAULT Getdate()
, FormID int NOT NULL
, Question varchar (1000)
, ChoiceID int
, ChoiceText varchar (2000)
)
GO

CREATE INDEX IDX_Feedback1 ON dbo.Feedback3(UserID)
CREATE INDEX IDX_Feedback2 ON dbo.Feedback3(CourseID)
CREATE INDEX IDX_Feedback3 ON dbo.Feedback3(FormID)
CREATE INDEX IDX_Feedback4 ON dbo.Feedback3(ChoiceID)
CREATE INDEX IDX_Feedback5 ON dbo.Feedback3(CreateDate)
CREATE INDEX IDX_Feedback6 ON dbo.Feedback3(UserID, CourseID, FormID, CreateDate)
GO

--SET SHOWPLAN_TEXT OFF
--SET SHOWPLAN_ALL OFF
--GO

SET NOCOUNT ON

DECLARE @x int, @y int, @z int

SELECT @x = 1, @y = 1, @z = 1

WHILE @x < 100
BEGIN
WHILE @y < 5
BEGIN
WHILE @z < 2
BEGIN
INSERT INTO Feedback2 (UserId, CourseId, FormId)
SELECT @x, @y, @z
SELECT @z = @z + 1
END
SELECT @z = 1, @y = @y + 1
END
SELECT @y = 1, @x = @x + 1
END

SELECT COUNT(*) FROM Feedback2

SELECT @x = 1, @y = 1, @z = 1

WHILE @x < 200
BEGIN
WHILE @y < 100
BEGIN
WHILE @z < 10
BEGIN
INSERT INTO Feedback3 (UserId, CourseId, FormId)
SELECT @x, @y, @z
SELECT @z = @z + 1
END
SELECT @z = 1, @y = @y + 1
END
SELECT @y = 1, @x = @x + 1
END

SELECT COUNT(*) FROM Feedback3
GO


DELETE A
FROM dbo.Feedback3 A
INNER JOIN ( SELECT CourseID, UserID, FormID, MAX(CreateDate) as CreateDate
FROM dbo.Feedback3
GROUP BY CourseID, UserID, FormID
) B
ON B.CourseID = A.CourseID
AND B.UserID = A.UserID
AND B.FormID = A.FormID
AND B.CreateDate > A.CreateDate

SELECT 'Rows Deleted: ' + CONVERT(varchar(15),@@ROWCOUNT)


DELETE A
FROM dbo.Feedback2 A
INNER JOIN ( SELECT CourseID, UserID, FormID, MAX(CreateDate) as CreateDate
FROM dbo.Feedback2
GROUP BY CourseID, UserID, FormID
) B
ON B.CourseID = A.CourseID
AND B.UserID = A.UserID
AND B.FormID = A.FormID
AND B.CreateDate > A.CreateDate


SELECT 'Rows Deleted: ' + CONVERT(varchar(15),@@ROWCOUNT)
GO

SET NOCOUNT OFF
DROP TABLE FeedBack2
DROP TABLE FeedBack3
GO




Brett

8-)
Go to Top of Page

SamC
White Water Yakist

3467 Posts

Posted - 2003-11-10 : 13:49:47
Hi Brett,

Searching on "Eager spool", yours is the only post that speaks of it.

Sam
Go to Top of Page

X002548
Not Just a Number

15586 Posts

Posted - 2003-11-10 : 14:05:54
Check out the plan on the secons one...

Table Spool/ Eager Spool...

Never seen it before...

And I'm still confused on how I saw parallelism...and now can't recreate it...

damn hangover...



Brett

8-)
Go to Top of Page

SamC
White Water Yakist

3467 Posts

Posted - 2003-11-10 : 16:03:42
Hi Brett,

I ran your test scripts (thanks) and found that there's no duplicate data to delete - so that explains why 0 rows were deleted.

I see the eager spool. I can imagine the SQL design team thinking up these names... (apethetic spool? no...)

The execution plan for
DELETE	A
FROM dbo.Feedback2 A
INNER JOIN dbo.Feedback2 B ON A.CourseID=B.CourseID AND A.UserID=B.UserID AND A.FormID=B.FormID AND A.CreateDate < B.CreateDate


is much shorter, but has table scans, not eager spools when I paste it into the script you wrote. Same DB and table, but it runs WAY faster than the test query in your script. (20 ms versus 2100 ms) using GETDATE to time 'em like so..

declare @T1 DATETIME
SET @t1 = GETDATE()

DELETE A
FROM dbo.Feedback3 A
INNER JOIN ( SELECT CourseID, UserID, FormID, MAX(CreateDate) as CreateDate
FROM dbo.Feedback3
GROUP BY CourseID, UserID, FormID
) B
ON B.CourseID = A.CourseID
AND B.UserID = A.UserID
AND B.FormID = A.FormID
AND B.CreateDate > A.CreateDate

Print DATEDIFF(MS, @t1, getdate())

SET @t1 = GETDATE()
DELETE A
FROM dbo.Feedback2 A
INNER JOIN dbo.Feedback2 B ON A.CourseID=B.CourseID AND A.UserID=B.UserID AND A.FormID=B.FormID AND A.CreateDate < B.CreateDate

Print DATEDIFF(MS, @t1, getdate())


Moral: No matter how good it looks, you may be look'en at the wrong thing altogether.
Go to Top of Page
   

- Advertisement -