Please start any new threads on our new
site at https://forums.sqlteam.com. We've got lots of great SQL Server
experts to answer whatever question you can come up with.
Author |
Topic |
joelpravin
Starting Member
3 Posts |
Posted - 2006-02-25 : 07:53:44
|
I currently have a scenario where I am cleaning up a table that currently has ~ 24 million records. I need to delete about 21.5 million records from this table. there are 6 columns of concern, and lets call the table DynamicData.DynamicData consists ofid uniqueidentifier,c1 uniqueidentifier,c2 uniqueidentifier,c3 varchar(50),c4 ntext,c5 uniqueidentifierc1 has a clustered index on it.id has nonclustered index on itThe statement in question is as follows:delete from DynamicData with (rowlock)where c1 is nullor(c2 is null and c3 is null and c4 is null and c5 is null)Running this query causes the TransactionLog to grow from 2 GB to about 50 GB and takes ~ 23 hours in our 4 processor/ 3 GB Ram environment.I have rewritten the previous script to try to optimize the query. It now is as follows:CREATE TABLE temp_DynamicData(id uniqueidentifier)CREATE UNIQUE CLUSTERED INDEX XPKNULLDynamicData ON temp_DynamicData(id)INSERT INTO temp_DynamicDataselect id from DynamicData (nolock)wherec2 is null and c3 is null and c4 is null and c5 is nulldelete from DynamicData with (rowlock)where c1 is nullDELETE dd with (tablock)FROM DynamicData ddINNER JOIN temp_NullDynamicData t (nolock) ON t.id=dd.idI have created a seperate table to store the values to be deleted, and have done a join on the Dynamic Data id ( which has an index on it). I have changed the rowlock to tablock to minimize paging. Can anyone validate whether the above changes would be beneficial in optimizing the original delete script. If further changes are required can anyone propose what should be done.My major areas of concern is that I dont want the TransactionLog to grow so large, and I would like to cut down the time this query takes to run. This is all in the context of a Conversion script, so no users will be using the database while this script runs. Thanks in advance |
|
Kristen
Test
22859 Posts |
Posted - 2006-02-25 : 08:15:50
|
I reckon you have a couple of choices.As you want to delete 21.5/24ths of your data - about 90% - you would probably be better off with:Copy the rows you want to keep to a temporary tableDrop the original tableRename the temporary table to original nameReinstate any indexesThis is slightly more tedious if you need to preserve FKs (have to drop them first, then reinstate them)If this is 24/7 and people need to use the table, or there are Delete TRIGGERS and/or Cascading-deletes on the table, then that will be a bad plan!Second approach is to delete the rows in a loop. Delete a modest number each time round the loop. Keep looping until no more rows exist to delete.This can be aborted at any time with rollback restricted to the current batch only. I reckon it will be oodles faster too, as it will be putting less strain (e.g. numerous extensions) on the Tlog file.You will need to either backup the TLog frequently during this process (to stop it extending to a vast size), or change the RECOVERY MODEL to SIMPLE whilst this is running, and back to FULL again after it finished.If your recovery model is already SIMPLE then skip this step, and you are good to go, if not AND you are 24/7 changing the recovery model is probably a bad idea as you will lose you ability to recovery to point-in-time during the delete operation - might be an acceptable risk though!If this 24/7 you should also put a WAIT for 5 seconds or so inside the loop so that during each iteration other connected users get "their chance"Here's an example that I use (needs local variables DECLARing):SELECT @intRowCount = 1, -- Force first iteration @intErrNo = 0, @intLoops = 300 -- maximum loops to make (too prevent running for too long)SELECT @intRowsToDelete = COUNT(*) -- Number of rows to be deletedFROM dbo.MyTableWHERE ... MyDeleteCriteria ...WHILE @intRowCount > 0 AND @intErrNo = 0 AND @intLoops > 0BEGIN SELECT @dtLoop = GetDate() SELECT @strSQL = SET ROWCOUNT @DEL_ROWCOUNT -- number of delete rows / iteration DELETE D FROM dbo.MyTable AS D WHERE ... MyDeleteCriteria ... SELECT @intErrNo = @@ERROR, @intRowCount = @@ROWCOUNT SET ROWCOUNT 0 -- Reset batch size to "all" SELECT @intRowsToDelete = @intRowsToDelete - @intRowCount, @intLoops = @intLoops - 1-- Debugging usage only:PRINT 'Deleted: ' + CONVERT(varchar(20), @intRowCount) + ', Elapsed: ' + CONVERT(varchar(20), DATEDIFF(Second, @dtLoop, GetDate())) + ' seconds,' + ' remaining to delete=' + CONVERT(varchar(20), @intRowsToDelete) + ', Loops left=' + CONVERT(varchar(20), @intLoops) WAITFOR DELAY '000:00:05' -- 5 seconds for other users to gain accessEND Kristen |
 |
|
joelpravin
Starting Member
3 Posts |
Posted - 2006-02-25 : 08:48:15
|
Thanks for that Kristen.The option for going through the deletion in chunks seems the best option for my scenario. A question for the code you posted,the SET ROWCOUNT @DEL_ROWCOUNTis that supposed to be SET ROWCOUNT @intRowCount?Thanks in advanceJoel |
 |
|
Kristen
Test
22859 Posts |
Posted - 2006-02-25 : 12:08:34
|
Sorry, didn't define any of the variables as they were defined "elsewhere" in the code I cut it from.@DEL_ROWCOUNT is an INT and should be set to the number of rows you want to delete in each batch. Or hardwire it:SET ROWCOUNT 100000@intRowCount is a different variable that holds the number of rows deleted in the previous batch - when it gets to Zero the job is done.We use a variable for @DEL_ROWCOUNT because we change it dynamically in the delete loop according to the load on the server - to make it delete fewer rows per loop iteration if the server gets busy for some reason. I should have explained that - but I suggest its best that you ignore that for your one-off application though!Kristen |
 |
|
pphillips001
Starting Member
23 Posts |
Posted - 2010-12-15 : 05:00:42
|
I know I am standing on the shoulders of giants here (all credit to Kristen).I needed a routine to constantly whittle away at a 250Gb transaction table. The db is 24/7 in production so stealth was a primary concern. So I reduced the delete chunk size right down and had the routine monitor how many active tasks were running. If the upper threshold was met then the routine effectively hibernated until the db was doing nothing again.I disabled the loop exit clause as well - to keep the routine running.Here's my adopted version :-- Transaction Log Syphon-- http://www.sqlteam.com/forums/post.asp?method=Reply&TOPIC_ID=62356&FORUM_ID=5-- Original Code : Kristen-- Developed by : P Phillips 12/2010DECLARE @intRowsToDelete intDECLARE @intRowCount intDECLARE @intErrNo intDECLARE @intLoops intDECLARE @dtLoop datetimeDECLARE @intProcCount intDECLARE @txtStatus varchar(255)SELECT @intRowCount = 1, -- Force first iteration @intErrNo = 0, @intLoops = 300 -- maximum loops to make (too prevent running for too long)RAISERROR ('Work out how many rows to delete',0,1) WITH NOWAITSELECT @intRowsToDelete = COUNT(*) -- Number of rows to be deletedFROM dbo.YOUR_TABLEWHERE ... where clause ...SET @txtStatus = CONVERT(varchar(20),@intRowsToDelete) + ' rows found.'RAISERROR (@txtStatus,0,1) WITH NOWAITRAISERROR ('Find out how many processes are running',0,1) WITH NOWAITSELECT @intProcCount = count(*)FROM sys.dm_exec_requests reqCROSS APPLY sys.dm_exec_sql_text(sql_handle) AS sqltextRAISERROR ( 'Starting delete loop',0,1) WITH NOWAITWHILE @intRowCount > 0 AND @intErrNo = 0 --AND @intLoops > 0BEGIN SELECT @dtLoop = GetDate() SET ROWCOUNT 1000 -- number of delete rows / iteration DELETE D FROM dbo.YOUR_TABLE AS D WITH (ROWLOCK) WHERE ... where clause ... SELECT @intErrNo = @@ERROR, @intRowCount = @@ROWCOUNT SET ROWCOUNT 0 -- Reset batch size to "all" SELECT @intRowsToDelete = @intRowsToDelete - @intRowCount, @intLoops = @intLoops - 1 SELECT @intProcCount = count(*) FROM sys.dm_exec_requests req CROSS APPLY sys.dm_exec_sql_text(sql_handle) AS sqltext WAITFOR DELAY '000:00:03' -- 3 seconds for other users to gain access WHILE @intProcCount > 10 -- hibernation threshold ie how many active tasks need to be running before sleeping BEGIN WAITFOR DELAY '000:00:03' -- 3 seconds for other users to gain access RAISERROR ( 'Exceed Process Limit - Hibernating',0,1) WITH NOWAIT SELECT @intProcCount = count(*) FROM sys.dm_exec_requests req CROSS APPLY sys.dm_exec_sql_text(sql_handle) AS sqltext END -- Debugging usage only: SET @txtStatus = 'Deleted: ' + CONVERT(varchar(20), @intRowCount) + ', Elapsed: ' + CONVERT(varchar(20), DATEDIFF(Second, @dtLoop, GetDate())) + ' seconds,' + ' remaining to delete=' + CONVERT(varchar(20), @intRowsToDelete) + ', Loops left=' + CONVERT(varchar(20), @intLoops) + ', Active Procs=' + CONVERT(varchar(20), @intProcCount) RAISERROR (@txtStatus,0,1) WITH NOWAITENDP. Phillips===========================There is no reality - only perspective. |
 |
|
Kristen
Test
22859 Posts |
Posted - 2010-12-15 : 05:22:51
|
I think that theSET ROWCOUNT 1000 -- number of delete rows / iterationshould be adjustable according to how many rows can actually be deleted - variable up/down depending on how long each iteration takes - i.e. how "intrusive" it is to other concurrent processes. It is very unlikely that there will be any difference between 1,000 and 10,000 rows deleted - but there might be! but the "cost" of finding each batch to process may be more significant that the actual delete, particularly if the batch size is small.I also think thatDELETE DFROM dbo.YOUR_TABLE AS D WITH (ROWLOCK)WHERE ... where clause ... is also very expensive, and the deletes may be random (across the table) - particularly if a secondary index is used to locate the candidate records, rather than the clustered index.So we have moved on and now prefer to get a complete list of PK IDs to be deleted into a temporary table, up-front, ordered by Clustered Index, and then delete ranges of them in a loop:Create #TEMP with IDENTITY and MyPK columns, with Primary Key on IDINSERT INTO #TEMP(MyPK)SELECT MyPKFROM MyTableWHERE ... where clause for DELETE ...ORDER BY MyPK -- Needs to be the sequence for the Clustered Index and then each delete loop will beDELETE DFROM MyTable AS D JOIN #TEMP AS T ON T.MyPK = D.MyPKWHERE T.ID >= @Start AND T.ID < @Start + @BatchSizeSELECT @Start = @Start + @BatchSize... loop ... this "locates" a @BatchSize block of records to be deleted much more efficiently than re-applying the WHERE clause for Delete to the whole table and, being in Clustered Index Order, may more efficiently identify an adjacent block of records to be deleted making the actual process more efficient (that assumes that stale rows are begin deleted, and "all old data is 'stale'" - rather than the "stale" data being randomly distributed - although even then this is probably more efficient).However, I would draw the line at putting the PKs for 250GB of data into #TEMP! so that, itself, may have to be restricted in size per loop iteration.Lastly, we increase the Tlog backup frequency during our housekeeping (which includes stale data delete and also index rebuild etc.) from 15 minutes to 2 minutes to prevent excessive TLog growth. Our maintenance runs when the site is, usually, very quiet so tends to be allowed to monopolise the server (the @BatchSize tends to safely increase, rather than be reduced because the operation is found to be slow / intrusive) |
 |
|
pphillips001
Starting Member
23 Posts |
Posted - 2010-12-15 : 06:17:45
|
Both an excellent and elegant solution.Unfortunately I'm dealing with an un-indexed heap, so sorting offers little benefit.===========================There is no reality - only perspective. |
 |
|
|
|
|
|
|