Merge performance on Very Large Tables

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

SQL Server 2008 Forums

Other SQL Server 2008 Topics

Merge performance on Very Large Tables

Author

Topic

zerolan
Starting Member

3 Posts

Posted - 2011-01-14 : 05:36:26

To all you DB/SQL experts out there - I hope you can shed some light on my problems.

I try to make it short.
The DB has got a very large table, about 60m rows, 11GB in size, 2.9GB index size.
Each day new data has to be merged into the table, I do that in 64MB batches which I bulk insert
into a temporary table and then merge into the large target table.
Each of those batches contains roughly half a million rows, of which some will be new to the db (resulting in an insert)
and some already exist (resulting in an update).

I do the operation above via a merge using 2 bigint indexes (which are the same in the temporary table as well the large target table).
The issue is simply that the time used for the merge statement to complete, grows as the db grows. and I mean not just
by a little bit, it grows ALOT.
Starting with an empty db, a merge as described above takes about 11seconds. Now with the size as described above, it takes
over 10 minutes(!). Im aware I cant keep insert times constant but this is a vast drop. I put the table already onto its own filegroup spanning
across 4 disks, no real difference.

I found out that it seems to be the inserts of new rows within the merge which cause the performance issue.
When I run the same batch twice, or any other batch which I know I have already processed before, it only takes about 2 minutes (as all records already exists
in the db and therefore only an update needs to be performed).
I also set the table to autogrow (by 10%) so it doesnt constantly grow in 1MB steps.
Recovery model is set to simple to keep the logfile small.
I also checked the query plan and no table scans are done, but the query plan looks very complex (for a fairly simple merge).
The most expensive items are those 3: "Table spool", "RID Lookup" and "Index Seek".
Its running on a failry beefy machine (3GHz, 8GB mem).

Am I doing something fundamentally wrong or do I have to simply accept that SQL becomes very slow at this size? (I rather believe its the former).
I mean adding 500k new rows to a 60m rows table shouldnt take 11 minutes, its only 60mb of data to add.

Any ideas?

I look forward to your responses
ZL

dataguru1971
Master Smack Fu Yak Hacker

1464 Posts

Posted - 2011-01-14 : 06:14:16

What is the structure of the table..is it very wide? are the inserts into the int column sequential? is there a clustered index ? Change autogrow from 10% to a fixed size, or allocate more space to the table. Autogrow takes I/O (but isn't the problem). Is the merge doing an update also ?

Poor planning on your part does not constitute an emergency on my part.

zerolan
Starting Member

3 Posts

Posted - 2011-01-14 : 06:45:28

Hey Dataguru, thx for the reply.

To your questions: the table has 10 columns,
bigint, int, datetime, datetime, varchar(1024), varchar(512), bigint
bigint, bigint, bigint.

2 of the bigints are indexed (individually) and used in the merge clause.
Prime key is ID but not really used. I tried to use the 2 bigints as a combined, clustered prime key, but it made no difference to performance so I thought its better to not use it as prime key to avoid to much page splitting going on. (correct me if Im wrong on this one).

yes the merge does update all rows which already exissts and inserts
those which dont exists yet. When I run a batch which consists only of
rows which already exists in the large table, the merge updates all those rows, which is ALOT faster (2mins). however, on a batch where the data is not yet in the table, the merge will insert and the merge time is hughe (10mins+ and growing with table size).

Running a little bit out of ideas as what else to try tbh.

Kristen
Test

22859 Posts

Posted - 2011-01-14 : 06:58:41

Are the indexes rebuilt regular, and do they have recently updates statistics?

Do any of the UPDATE statements actually change nothing (i.e. the data in the import can be the SAME as existing data)? If so I would exclude them from the UPDATE with a complex WHERE clause to reduce the amount of null-effect updates (which bloats LOG files too)

Sachin.Nand

2937 Posts

Posted - 2011-01-14 : 07:01:31

Table spools can be very expensive.What about tempdb? Is it on its own independent disk ?Inefficient tempdb can cause spools to be very expensive.

PBUH

dataguru1971
Master Smack Fu Yak Hacker

1464 Posts

Posted - 2011-01-14 : 07:13:46

If you have clustered indexes, and are updating records..that takes time. So does inserting a record into them if they aren't sequentially ordered.
If you have nonclustered indexes, try dropping them, doing the merge, then recreating them..

How many of the records are actually being updated--do you use the $output clause to check? it might be useful to see (the indexes would have to be updated as well of course, and as said, clustered would possibly require re-ordering of pages(or something like that, forgive the terminology if wrong)..)

I also have noticed, that despite the efficiency of merge, I had much better speed of result when I dropped the index, selected ALL records into a temp table and reinserted. Also fast was a Delete matching, insert all from stage (with no indexes at time of delete or insert). I am still working with it, but I can't do much with the final result structure yet ( I can drop and recreate the indexes in place now, but can't change them permanently other than that yet)

I had one where it updates a 29 MM row table from a stage table of 12 million (I don't like the operation, but had to duplicate it...we are building a proper add,update, delete staging arena shortly) and insert new records. Took forever. After much experimentation with various index combinations, the fastest result was to delete matching records from the destination, and insert all staged records including new ones, then rebuilding the indexes. this takes maybe 8 minutes total every day, as opposed to 45-50 minutes with the merge and leaving the indexes--despite a good query plan. I have a nonclustered index on the merge columns.

I tried combinations of covering indexes using Include and a number of other things.

My experience is trial and error, I am no index or tuning expert like some here..but as you, I couldn't tolerate the poor performance so went simply with the time element until I could properly develop a faster merge.

Poor planning on your part does not constitute an emergency on my part.

Sachin.Nand

2937 Posts

Posted - 2011-01-14 : 07:28:41

quote:
If you have nonclustered indexes, try dropping them, doing the merge, then recreating them..

I would recommend not to do that.Its better if you just disable them.Dropping them an recreating will cause the Tlog file to bloat even in simple recovery model.

also of the 8GB memory how much is set for sql sever ?

PBUH

dataguru1971
Master Smack Fu Yak Hacker

1464 Posts

Posted - 2011-01-14 : 07:31:06

Good point on the disable indexes...I had thought of that, but figured rebuilding them might be necessary anyway since a few of the updated columns (in my case) are used in the indexes.

Poor planning on your part does not constitute an emergency on my part.

zerolan
Starting Member

3 Posts

Posted - 2011-01-14 : 07:35:11

Great to hear from you guys.

Kristen: Statistics are set to update automatically.
I havent rebuild the index yet, the db is only 2 days old :)
The update clause actually does change something, it updates one of the ints and datetime field of any existing row.

Sachin, thank you, good point regarding the tmp db, I shall add another
disk for it :)

Btw here is the query I use:
merge FullURLs as target
using #tmpimport as source
on source.URLExLo = target.URLEExLo and source.URLExHi = target.URLEExHi
when matched then
update set SeenLast = CURRENT_TIMESTAMP, target.TimesSeen = target.TimesSeen+source.TimesSeen
when not matched then
insert ( TimesSeen, SeenFirst, SeenLast, URL, [File], URLEExLo, URLEExHi, DomainExLo, DomainExHi )
VALUES( 1, CURRENT_TIMESTAMP, CURRENT_TIMESTAMP, source.FullURL, source.[File], source.URLExLo, source.URLExHi, fnMD5ExLo( fnMD5Ex(Domain) ), fnMD5ExHi( fnMD5Ex(Domain) ) )
output $action into #tempresult;

The URLEExLo and URLEExHi are the 2 bigints which are individually
indexed in the target as well temp table.

@dataguru:
As you see, I use the option clause to log what has been done.
On average, its half half, half of the rows exist already, half
are new and need inserting (250k avg each).
Re the idea of reinserting everything. I see this table grow to a few
hundred gig in size eventually, Im not sure how efficient it would
be to shuffle all of that into a temp table and back again?

dataguru1971
Master Smack Fu Yak Hacker

1464 Posts

Posted - 2011-01-14 : 07:37:00

Actually, correct me if I am wrong, but won't disabling the index prevent any updates from being possible? The data remains accessible, but can't be changed...I could be wrong..time to google me some info.

Poor planning on your part does not constitute an emergency on my part.

dataguru1971
Master Smack Fu Yak Hacker

1464 Posts

Posted - 2011-01-14 : 07:40:59

quote:
@dataguru:
As you see, I use the option clause to log what has been done.
On average, its half half, half of the rows exist already, half
are new and need inserting (250k avg each).
Re the idea of reinserting everything. I see this table grow to a few
hundred gig in size eventually, Im not sure how efficient it would
be to shuffle all of that into a temp table and back again?

That might be managed by partitions and switching? would depend ..if it is that large, might be useful to explore partitioning..if you can segregate somehow to leave some partitions out of the equation. I agree though...if it doubles in size, the operation might be a bit too hefty. Like I said, I am continuing to work on my scenario as well to be more effective and future-efficient.

Poor planning on your part does not constitute an emergency on my part.

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources