Help with a massive deletion - SQL Server Forums

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

SQL Server 2000 Forums

SQL Server Development (2000)

Help with a massive deletion

Author

Topic

joelpravin
Starting Member

3 Posts

Posted - 2006-02-25 : 07:53:44

I currently have a scenario where I am cleaning up a table that currently has ~ 24 million records. I need to delete about 21.5 million records from this table.
there are 6 columns of concern, and lets call the table DynamicData.

DynamicData consists of
id uniqueidentifier,
c1 uniqueidentifier,
c2 uniqueidentifier,
c3 varchar(50),
c4 ntext,
c5 uniqueidentifier

c1 has a clustered index on it.
id has nonclustered index on it

The statement in question is as follows:

delete from DynamicData with (rowlock)
where c1 is null
or
(
c2 is null and
c3 is null and
c4 is null and
c5 is null
)

Running this query causes the TransactionLog to grow from 2 GB to about 50 GB and takes ~ 23 hours in our 4 processor/ 3 GB Ram environment.

I have rewritten the previous script to try to optimize the query. It now is as follows:

CREATE TABLE temp_DynamicData(id uniqueidentifier)
CREATE UNIQUE CLUSTERED INDEX XPKNULLDynamicData ON temp_DynamicData(id)

INSERT INTO temp_DynamicData
select id from DynamicData (nolock)
where
c2 is null and
c3 is null and
c4 is null and
c5 is null

delete from DynamicData with (rowlock)
where c1 is null

DELETE dd with (tablock)
FROM DynamicData dd
INNER JOIN temp_NullDynamicData t (nolock) ON t.id=dd.id

I have created a seperate table to store the values to be deleted, and have done a join on the Dynamic Data id ( which has an index on it). I have changed the rowlock to tablock to minimize paging. Can anyone validate whether the above changes would be beneficial in optimizing the original delete script. If further changes are required can anyone propose what should be done.

My major areas of concern is that I dont want the TransactionLog to grow so large, and I would like to cut down the time this query takes to run. This is all in the context of a Conversion script, so no users will be using the database while this script runs.

Thanks in advance

Kristen
Test

22859 Posts

Posted - 2006-02-25 : 08:15:50

I reckon you have a couple of choices.

As you want to delete 21.5/24ths of your data - about 90% - you would probably be better off with:

Copy the rows you want to keep to a temporary table
Drop the original table
Rename the temporary table to original name
Reinstate any indexes

This is slightly more tedious if you need to preserve FKs (have to drop them first, then reinstate them)

If this is 24/7 and people need to use the table, or there are Delete TRIGGERS and/or Cascading-deletes on the table, then that will be a bad plan!

Second approach is to delete the rows in a loop. Delete a modest number each time round the loop. Keep looping until no more rows exist to delete.

This can be aborted at any time with rollback restricted to the current batch only. I reckon it will be oodles faster too, as it will be putting less strain (e.g. numerous extensions) on the Tlog file.

You will need to either backup the TLog frequently during this process (to stop it extending to a vast size), or change the RECOVERY MODEL to SIMPLE whilst this is running, and back to FULL again after it finished.

If your recovery model is already SIMPLE then skip this step, and you are good to go, if not AND you are 24/7 changing the recovery model is probably a bad idea as you will lose you ability to recovery to point-in-time during the delete operation - might be an acceptable risk though!

If this 24/7 you should also put a WAIT for 5 seconds or so inside the loop so that during each iteration other connected users get "their chance"

Here's an example that I use (needs local variables DECLARing):


SELECT	@intRowCount = 1,	-- Force first iteration
	@intErrNo = 0,
	@intLoops = 300		-- maximum loops to make (too prevent running for too long)

SELECT	@intRowsToDelete = COUNT(*)	-- Number of rows to be deleted
FROM	dbo.MyTable
WHERE	... MyDeleteCriteria ...

WHILE @intRowCount > 0 AND @intErrNo = 0 AND @intLoops > 0
BEGIN
	SELECT	@dtLoop = GetDate()
	SELECT	@strSQL = 

	SET ROWCOUNT @DEL_ROWCOUNT	-- number of delete rows / iteration

	DELETE	D
	FROM	dbo.MyTable AS D
	WHERE	... MyDeleteCriteria ...
	SELECT	@intErrNo = @@ERROR, @intRowCount = @@ROWCOUNT

	SET ROWCOUNT 0	-- Reset batch size to "all"

	SELECT	@intRowsToDelete = @intRowsToDelete - @intRowCount,
		@intLoops = @intLoops - 1

-- Debugging usage only:
PRINT 'Deleted: ' + CONVERT(varchar(20), @intRowCount) 
    + ', Elapsed: ' + CONVERT(varchar(20), DATEDIFF(Second, @dtLoop, GetDate())) + ' seconds,'
    + ' remaining to delete=' + CONVERT(varchar(20), @intRowsToDelete)
    + ', Loops left=' + CONVERT(varchar(20), @intLoops)

	WAITFOR DELAY '000:00:05'	-- 5 seconds for other users to gain access
END

Kristen

joelpravin
Starting Member

3 Posts

Posted - 2006-02-25 : 08:48:15

Thanks for that Kristen.
The option for going through the deletion in chunks seems the best option for my scenario. A question for the code you posted,

the SET ROWCOUNT @DEL_ROWCOUNT
is that supposed to be
SET ROWCOUNT @intRowCount?

Thanks in advance

Joel

Kristen
Test

22859 Posts

Posted - 2006-02-25 : 12:08:34

Sorry, didn't define any of the variables as they were defined "elsewhere" in the code I cut it from.

@DEL_ROWCOUNT is an INT and should be set to the number of rows you want to delete in each batch. Or hardwire it:

SET ROWCOUNT 100000

@intRowCount is a different variable that holds the number of rows deleted in the previous batch - when it gets to Zero the job is done.

We use a variable for @DEL_ROWCOUNT because we change it dynamically in the delete loop according to the load on the server - to make it delete fewer rows per loop iteration if the server gets busy for some reason. I should have explained that - but I suggest its best that you ignore that for your one-off application though!

Kristen

pphillips001
Starting Member

23 Posts

Posted - 2010-12-15 : 05:00:42

I know I am standing on the shoulders of giants here (all credit to Kristen).

I needed a routine to constantly whittle away at a 250Gb transaction table. The db is 24/7 in production so stealth was a primary concern. So I reduced the delete chunk size right down and had the routine monitor how many active tasks were running. If the upper threshold was met then the routine effectively hibernated until the db was doing nothing again.

I disabled the loop exit clause as well - to keep the routine running.

Here's my adopted version :

-- Transaction Log Syphon
-- http://www.sqlteam.com/forums/post.asp?method=Reply&TOPIC_ID=62356&FORUM_ID=5
-- Original Code : Kristen
-- Developed by : P Phillips 12/2010

DECLARE @intRowsToDelete int
DECLARE @intRowCount int
DECLARE @intErrNo int
DECLARE @intLoops int
DECLARE @dtLoop datetime
DECLARE @intProcCount int
DECLARE @txtStatus varchar(255)

SELECT
@intRowCount = 1, -- Force first iteration
@intErrNo = 0,
@intLoops = 300 -- maximum loops to make (too prevent running for too long)

RAISERROR ('Work out how many rows to delete',0,1) WITH NOWAIT

SELECT
@intRowsToDelete = COUNT(*) -- Number of rows to be deleted
FROM
dbo.YOUR_TABLE
WHERE
... where clause ...

SET @txtStatus = CONVERT(varchar(20),@intRowsToDelete) + ' rows found.'
RAISERROR (@txtStatus,0,1) WITH NOWAIT

RAISERROR ('Find out how many processes are running',0,1) WITH NOWAIT

SELECT @intProcCount = count(*)
FROM sys.dm_exec_requests req
CROSS APPLY sys.dm_exec_sql_text(sql_handle) AS sqltext

RAISERROR ( 'Starting delete loop',0,1) WITH NOWAIT

WHILE @intRowCount > 0 AND @intErrNo = 0 --AND @intLoops > 0
BEGIN
SELECT @dtLoop = GetDate()

SET ROWCOUNT 1000 -- number of delete rows / iteration
DELETE
D
FROM
dbo.YOUR_TABLE AS D WITH (ROWLOCK)
WHERE
... where clause ...

SELECT
@intErrNo = @@ERROR, @intRowCount = @@ROWCOUNT

SET ROWCOUNT 0 -- Reset batch size to "all"

SELECT
@intRowsToDelete = @intRowsToDelete - @intRowCount,
@intLoops = @intLoops - 1

SELECT @intProcCount = count(*)
FROM sys.dm_exec_requests req
CROSS APPLY sys.dm_exec_sql_text(sql_handle) AS sqltext

WAITFOR DELAY '000:00:03' -- 3 seconds for other users to gain access

WHILE @intProcCount > 10 -- hibernation threshold ie how many active tasks need to be running before sleeping
BEGIN
WAITFOR DELAY '000:00:03' -- 3 seconds for other users to gain access

RAISERROR ( 'Exceed Process Limit - Hibernating',0,1) WITH NOWAIT

SELECT @intProcCount = count(*)
FROM sys.dm_exec_requests req
CROSS APPLY sys.dm_exec_sql_text(sql_handle) AS sqltext

END

-- Debugging usage only:

SET @txtStatus = 'Deleted: ' + CONVERT(varchar(20), @intRowCount)
+ ', Elapsed: ' + CONVERT(varchar(20), DATEDIFF(Second, @dtLoop, GetDate())) + ' seconds,'
+ ' remaining to delete=' + CONVERT(varchar(20), @intRowsToDelete)
+ ', Loops left=' + CONVERT(varchar(20), @intLoops)
+ ', Active Procs=' + CONVERT(varchar(20), @intProcCount)

RAISERROR (@txtStatus,0,1) WITH NOWAIT

END

P. Phillips

===========================
There is no reality - only perspective.

Kristen
Test

22859 Posts

Posted - 2010-12-15 : 05:22:51

I think that the

SET ROWCOUNT 1000 -- number of delete rows / iteration

should be adjustable according to how many rows can actually be deleted - variable up/down depending on how long each iteration takes - i.e. how "intrusive" it is to other concurrent processes. It is very unlikely that there will be any difference between 1,000 and 10,000 rows deleted - but there might be! but the "cost" of finding each batch to process may be more significant that the actual delete, particularly if the batch size is small.

I also think that


DELETE	
D
FROM	
dbo.YOUR_TABLE AS D WITH (ROWLOCK)
WHERE	
... where clause ...

is also very expensive, and the deletes may be random (across the table) - particularly if a secondary index is used to locate the candidate records, rather than the clustered index.

So we have moved on

and now prefer to get a complete list of PK IDs to be deleted into a temporary table, up-front, ordered by Clustered Index, and then delete ranges of them in a loop:

Create #TEMP with IDENTITY and MyPK columns, with Primary Key on ID


INSERT INTO #TEMP(MyPK)
SELECT MyPK
FROM MyTable
WHERE	
... where clause for DELETE ...
ORDER BY MyPK -- Needs to be the sequence for the Clustered Index

and then each delete loop will be


DELETE D
FROM MyTable AS D
     JOIN #TEMP AS T
         ON T.MyPK = D.MyPK
WHERE T.ID >= @Start
      AND T.ID < @Start + @BatchSize
SELECT @Start = @Start + @BatchSize
... loop ...

this "locates" a @BatchSize block of records to be deleted much more efficiently than re-applying the WHERE clause for Delete to the whole table and, being in Clustered Index Order, may more efficiently identify an adjacent block of records to be deleted making the actual process more efficient (that assumes that stale rows are begin deleted, and "all old data is 'stale'" - rather than the "stale" data being randomly distributed - although even then this is probably more efficient).

However, I would draw the line at putting the PKs for 250GB of data into #TEMP! so that, itself, may have to be restricted in size per loop iteration.

Lastly, we increase the Tlog backup frequency during our housekeeping (which includes stale data delete and also index rebuild etc.) from 15 minutes to 2 minutes to prevent excessive TLog growth. Our maintenance runs when the site is, usually, very quiet so tends to be allowed to monopolise the server (the @BatchSize tends to safely increase, rather than be reduced because the operation is found to be slow / intrusive)

pphillips001
Starting Member

23 Posts

Posted - 2010-12-15 : 06:17:45

Both an excellent and elegant solution.

Unfortunately I'm dealing with an un-indexed heap, so sorting offers little benefit.

===========================
There is no reality - only perspective.

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources