Please start any new threads on our new
site at https://forums.sqlteam.com. We've got lots of great SQL Server
experts to answer whatever question you can come up with.
Author |
Topic |
zerolan
Starting Member
3 Posts |
Posted - 2011-01-14 : 05:36:26
|
To all you DB/SQL experts out there - I hope you can shed some light on my problems.I try to make it short.The DB has got a very large table, about 60m rows, 11GB in size, 2.9GB index size.Each day new data has to be merged into the table, I do that in 64MB batches which I bulk insertinto a temporary table and then merge into the large target table. Each of those batches contains roughly half a million rows, of which some will be new to the db (resulting in an insert) and some already exist (resulting in an update).I do the operation above via a merge using 2 bigint indexes (which are the same in the temporary table as well the large target table).The issue is simply that the time used for the merge statement to complete, grows as the db grows. and I mean not justby a little bit, it grows ALOT.Starting with an empty db, a merge as described above takes about 11seconds. Now with the size as described above, it takesover 10 minutes(!). Im aware I cant keep insert times constant but this is a vast drop. I put the table already onto its own filegroup spanningacross 4 disks, no real difference.I found out that it seems to be the inserts of new rows within the merge which cause the performance issue. When I run the same batch twice, or any other batch which I know I have already processed before, it only takes about 2 minutes (as all records already exists in the db and therefore only an update needs to be performed).I also set the table to autogrow (by 10%) so it doesnt constantly grow in 1MB steps. Recovery model is set to simple to keep the logfile small.I also checked the query plan and no table scans are done, but the query plan looks very complex (for a fairly simple merge).The most expensive items are those 3: "Table spool", "RID Lookup" and "Index Seek".Its running on a failry beefy machine (3GHz, 8GB mem).Am I doing something fundamentally wrong or do I have to simply accept that SQL becomes very slow at this size? (I rather believe its the former).I mean adding 500k new rows to a 60m rows table shouldnt take 11 minutes, its only 60mb of data to add.Any ideas?I look forward to your responsesZL |
|
dataguru1971
Master Smack Fu Yak Hacker
1464 Posts |
Posted - 2011-01-14 : 06:14:16
|
What is the structure of the table..is it very wide? are the inserts into the int column sequential? is there a clustered index ? Change autogrow from 10% to a fixed size, or allocate more space to the table. Autogrow takes I/O (but isn't the problem). Is the merge doing an update also ? Poor planning on your part does not constitute an emergency on my part. |
|
|
zerolan
Starting Member
3 Posts |
Posted - 2011-01-14 : 06:45:28
|
Hey Dataguru, thx for the reply.To your questions: the table has 10 columns,bigint, int, datetime, datetime, varchar(1024), varchar(512), bigintbigint, bigint, bigint.2 of the bigints are indexed (individually) and used in the merge clause.Prime key is ID but not really used. I tried to use the 2 bigints as a combined, clustered prime key, but it made no difference to performance so I thought its better to not use it as prime key to avoid to much page splitting going on. (correct me if Im wrong on this one).yes the merge does update all rows which already exissts and insertsthose which dont exists yet. When I run a batch which consists only ofrows which already exists in the large table, the merge updates all those rows, which is ALOT faster (2mins). however, on a batch where the data is not yet in the table, the merge will insert and the merge time is hughe (10mins+ and growing with table size).Running a little bit out of ideas as what else to try tbh. |
|
|
Kristen
Test
22859 Posts |
Posted - 2011-01-14 : 06:58:41
|
Are the indexes rebuilt regular, and do they have recently updates statistics?Do any of the UPDATE statements actually change nothing (i.e. the data in the import can be the SAME as existing data)? If so I would exclude them from the UPDATE with a complex WHERE clause to reduce the amount of null-effect updates (which bloats LOG files too) |
|
|
Sachin.Nand
2937 Posts |
Posted - 2011-01-14 : 07:01:31
|
Table spools can be very expensive.What about tempdb? Is it on its own independent disk ?Inefficient tempdb can cause spools to be very expensive.PBUH |
|
|
dataguru1971
Master Smack Fu Yak Hacker
1464 Posts |
Posted - 2011-01-14 : 07:13:46
|
If you have clustered indexes, and are updating records..that takes time. So does inserting a record into them if they aren't sequentially ordered.If you have nonclustered indexes, try dropping them, doing the merge, then recreating them..How many of the records are actually being updated--do you use the $output clause to check? it might be useful to see (the indexes would have to be updated as well of course, and as said, clustered would possibly require re-ordering of pages(or something like that, forgive the terminology if wrong)..)I also have noticed, that despite the efficiency of merge, I had much better speed of result when I dropped the index, selected ALL records into a temp table and reinserted. Also fast was a Delete matching, insert all from stage (with no indexes at time of delete or insert). I am still working with it, but I can't do much with the final result structure yet ( I can drop and recreate the indexes in place now, but can't change them permanently other than that yet)I had one where it updates a 29 MM row table from a stage table of 12 million (I don't like the operation, but had to duplicate it...we are building a proper add,update, delete staging arena shortly) and insert new records. Took forever. After much experimentation with various index combinations, the fastest result was to delete matching records from the destination, and insert all staged records including new ones, then rebuilding the indexes. this takes maybe 8 minutes total every day, as opposed to 45-50 minutes with the merge and leaving the indexes--despite a good query plan. I have a nonclustered index on the merge columns. I tried combinations of covering indexes using Include and a number of other things.My experience is trial and error, I am no index or tuning expert like some here..but as you, I couldn't tolerate the poor performance so went simply with the time element until I could properly develop a faster merge. Poor planning on your part does not constitute an emergency on my part. |
|
|
Sachin.Nand
2937 Posts |
Posted - 2011-01-14 : 07:28:41
|
quote: If you have nonclustered indexes, try dropping them, doing the merge, then recreating them..
I would recommend not to do that.Its better if you just disable them.Dropping them an recreating will cause the Tlog file to bloat even in simple recovery model.also of the 8GB memory how much is set for sql sever ?PBUH |
|
|
dataguru1971
Master Smack Fu Yak Hacker
1464 Posts |
Posted - 2011-01-14 : 07:31:06
|
Good point on the disable indexes...I had thought of that, but figured rebuilding them might be necessary anyway since a few of the updated columns (in my case) are used in the indexes. Poor planning on your part does not constitute an emergency on my part. |
|
|
zerolan
Starting Member
3 Posts |
Posted - 2011-01-14 : 07:35:11
|
Great to hear from you guys.Kristen: Statistics are set to update automatically.I havent rebuild the index yet, the db is only 2 days old :)The update clause actually does change something, it updates one of the ints and datetime field of any existing row.Sachin, thank you, good point regarding the tmp db, I shall add anotherdisk for it :)Btw here is the query I use: merge FullURLs as targetusing #tmpimport as sourceon source.URLExLo = target.URLEExLo and source.URLExHi = target.URLEExHiwhen matched thenupdate set SeenLast = CURRENT_TIMESTAMP, target.TimesSeen = target.TimesSeen+source.TimesSeenwhen not matched theninsert ( TimesSeen, SeenFirst, SeenLast, URL, [File], URLEExLo, URLEExHi, DomainExLo, DomainExHi ) VALUES( 1, CURRENT_TIMESTAMP, CURRENT_TIMESTAMP, source.FullURL, source.[File], source.URLExLo, source.URLExHi, fnMD5ExLo( fnMD5Ex(Domain) ), fnMD5ExHi( fnMD5Ex(Domain) ) )output $action into #tempresult;The URLEExLo and URLEExHi are the 2 bigints which are individuallyindexed in the target as well temp table.@dataguru:As you see, I use the option clause to log what has been done.On average, its half half, half of the rows exist already, halfare new and need inserting (250k avg each).Re the idea of reinserting everything. I see this table grow to a fewhundred gig in size eventually, Im not sure how efficient it wouldbe to shuffle all of that into a temp table and back again? |
|
|
dataguru1971
Master Smack Fu Yak Hacker
1464 Posts |
Posted - 2011-01-14 : 07:37:00
|
Actually, correct me if I am wrong, but won't disabling the index prevent any updates from being possible? The data remains accessible, but can't be changed...I could be wrong..time to google me some info. Poor planning on your part does not constitute an emergency on my part. |
|
|
dataguru1971
Master Smack Fu Yak Hacker
1464 Posts |
Posted - 2011-01-14 : 07:40:59
|
quote: @dataguru:As you see, I use the option clause to log what has been done.On average, its half half, half of the rows exist already, halfare new and need inserting (250k avg each).Re the idea of reinserting everything. I see this table grow to a fewhundred gig in size eventually, Im not sure how efficient it wouldbe to shuffle all of that into a temp table and back again?
That might be managed by partitions and switching? would depend ..if it is that large, might be useful to explore partitioning..if you can segregate somehow to leave some partitions out of the equation. I agree though...if it doubles in size, the operation might be a bit too hefty. Like I said, I am continuing to work on my scenario as well to be more effective and future-efficient. Poor planning on your part does not constitute an emergency on my part. |
|
|
|
|
|
|
|