Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 General SQL Server Forums
 Database Design and Application Architecture
 large table performance

Author  Topic 

matti
Starting Member

4 Posts

Posted - 2010-12-27 : 10:34:14
Hi, I created table with 10 columns, one column(RecordID INT) as primary key, identity, clustered index. There is about 70 millions of records in this table now and some performance issues are arising in queries.

This select runs fast: SELECT* FROM MyTable WHERE RecordID = MyID

But this is another story: SELECT* FROM MyTable WHERE RecordID IN (some large set of IDs, e.g 2000 values).

The latter one takes up to 120 seconds, which is extremly slow.
Im confused because I thought this is the most simple query and there is no way I could tune it or make it more simple. Im looking up the clustered index and afaik it is the most simple lookup that can be made. Am I wrong or is it just caused by such large number of rows? Is there anything I can do except using partitions on the table?

Thank you for any advice!

nigelrivett
Master Smack Fu Yak Hacker

3385 Posts

Posted - 2010-12-27 : 10:49:05
>> Is there anything I can do except using partitions on the table?
What makes you think that will help?

Have you updated statistic and recompiled plans?
Have a look at the query plan - I guess it's doing a scan.
Do you need the select * or can you restrict the number of columns and add a covering index

Shouldn't make a difference but try
SELECT t.*
FROM MyTable t
join (some large set of IDs, e.g 2000 values) a
on t.RecordID = a.RecordID

also try
select (some large set of IDs, e.g 2000 values)
into #a
create unique clustered index ix on #a RecordID
SELECT t.*
FROM MyTable t
join #a a
on t.RecordID = a.RecordID


==========================================
Cursors are useful if you don't know sql.
SSIS can be used in a similar way.
Beer is not cold and it isn't fizzy.
Go to Top of Page

tkizer
Almighty SQL Goddess

38200 Posts

Posted - 2010-12-27 : 10:59:35
70 million rows isn't all that big.

Post the execution plan of the "slow" query.

Tara Kizer
Microsoft MVP for Windows Server System - SQL Server
http://weblogs.sqlteam.com/tarad/

Subscribe to my blog
Go to Top of Page

Kristen
Test

22859 Posts

Posted - 2010-12-27 : 12:27:28
"IN (some large set of IDs, e.g 2000 values)."

Frequently slow IME.

Use JOIN (if there is only one matching row) or EXISTS (if there are multiple matching rows) as they are, IME, usually faster than "IN". make sure the columns used in the JOIN / EXISTS are indexed.
Go to Top of Page

matti
Starting Member

4 Posts

Posted - 2010-12-28 : 07:46:21
Thank you all for your comments.

To make it as simple as possible Im selecting only the RecordID from MyTable

At first I used table variable as storage table for set of IDs:

DECLARE @34 TABLE([RecordID] INT PRIMARY KEY CLUSTERED)

and joined with MyTable. The select took the same time as before. So JOIN is not faster over IN in this case.

Then I created #tmp table and joined with MyTable again and this was crucial improvement. From 120 seconds at best for the first case, now it takes only 15 seconds. Still it is too long. I would like to get it down to 5 sec. Do you think it is possible?

This is the code:

CREATE TABLE #tmp([RecordID] INT PRIMARY KEY CLUSTERED)

INSERT INTO #tmp MyIDs

SELECT o.RecordID
FROM [dbo].[Detail] o
INNER JOIN #tmp ON #tmp.RecordID = o.RecordID


Here is execution plan: http://www.4freeimagehost.com/show.php?i=aadc2754eae3.png

As for the partitions, it is possible to divide my large table into partitions by month and then the select is supposed to scan only one partition(there would be additional date condition).
Go to Top of Page

Kristen
Test

22859 Posts

Posted - 2010-12-29 : 09:49:26
I can't see that selected 2,000 rows from a table is going to take 15 seconds ... or 5 seconds for that matter ... provided you have indexes and their statistics are up to date.

Can't tell from the picture you posted, but it looks like you are selecting on the Clustered Index on the main table.

So maybe the time is taken up with something else? Whatever you are doing in the "INSERT INTO #tmp MyIDs" step perhaps?
Go to Top of Page

tkizer
Almighty SQL Goddess

38200 Posts

Posted - 2010-12-29 : 11:00:52
The time isn't on the insert as the clustered index seek on the main table shows 100% of the cost.

Tara Kizer
Microsoft MVP for Windows Server System - SQL Server
http://weblogs.sqlteam.com/tarad/

Subscribe to my blog
Go to Top of Page

Kristen
Test

22859 Posts

Posted - 2010-12-29 : 13:41:02
Indeed, but the Query Plan shown is only for the SELECT I think?

Doesn't seem reasonable to me that a JOIN between an indexed temp table containing 2,000 records and the Clustered Index INT on the main table would take 5 seconds or more.

Thus I wonder if the timing includes other steps - such as Splitting a CSV or somesuch ...

(Or the index on the main table is badly fragmented perhaps)
Go to Top of Page

tkizer
Almighty SQL Goddess

38200 Posts

Posted - 2010-12-29 : 13:47:57
It's only for the select, but the batch cost is 96%. And since 100% of that 96% batch cost is on the clustered index seek for the main table, the problem lies there.

Tara Kizer
Microsoft MVP for Windows Server System - SQL Server
http://weblogs.sqlteam.com/tarad/

Subscribe to my blog
Go to Top of Page

Kristen
Test

22859 Posts

Posted - 2010-12-29 : 16:53:38
I missed the 96% of batch cost, thanks.

Still doesn't feel right to me that selecting 2,000 rows on an INT Clustered Index key would take so long. So maybe fragmentation?

e.g.

SELECT *
FROM
(
SELECT [Database] = db.name,
[Object ID] = ps.OBJECT_ID,
[Name] = OBJECT_NAME(ps.OBJECT_ID),
[Index ID] = ps.index_id,
[Index Name] = I.[name],
[Partition No.] = ps.partition_number,
[Fragmentation] = ps.avg_fragmentation_in_percent,
[Page count] = ps.page_count
FROM sys.dm_db_index_physical_stats (db_id(db_name()), NULL, NULL , NULL, N'Limited') AS ps
JOIN sys.databases AS db
ON db.database_id = ps.database_id
JOIN sys.indexes AS I
ON I.object_id = ps.object_id
AND I.index_id = ps.index_id
WHERE ps.index_id > 0
AND OBJECT_NAME(ps.OBJECT_ID) = 'MyTableName'
) AS X
ORDER BY
[Name], [Index ID]
OPTION (MaxDop 1)
Go to Top of Page

nigelrivett
Master Smack Fu Yak Hacker

3385 Posts

Posted - 2010-12-29 : 20:43:51
% batch cost doesn't translate to time.
It might do but it's possible to have 90% cost and 10% time.
Treat it with caution.

==========================================
Cursors are useful if you don't know sql.
SSIS can be used in a similar way.
Beer is not cold and it isn't fizzy.
Go to Top of Page

tkizer
Almighty SQL Goddess

38200 Posts

Posted - 2010-12-29 : 22:49:39
Do you have an article regarding this or an example? I've never seen it not correlate to time.

Tara Kizer
Microsoft MVP for Windows Server System - SQL Server
http://weblogs.sqlteam.com/tarad/

Subscribe to my blog
Go to Top of Page

nigelrivett
Master Smack Fu Yak Hacker

3385 Posts

Posted - 2010-12-29 : 23:03:17
Take a query - run it twice
Will have 50% cost for each.
If there is enough memory for the data to be held the first query will be doing disk reads the second won't and will be much fasster.

All part of the art of building efficient systems. Make sure you do as much processing as possible on data while it is in memory.
I've seen people spend a lot of time optimising things due to cost when they actually don't improve the time spent executing the batch. It's why I always say to not rely on the query plan alone - also log the start and end time of statements in batch systems.

==========================================
Cursors are useful if you don't know sql.
SSIS can be used in a similar way.
Beer is not cold and it isn't fizzy.
Go to Top of Page

tkizer
Almighty SQL Goddess

38200 Posts

Posted - 2010-12-29 : 23:17:09
Ah yes, well all things have to be equal as a starting point. Make sure your data is in cache or wiped from cache to make accurate comparisons.

Tara Kizer
Microsoft MVP for Windows Server System - SQL Server
http://weblogs.sqlteam.com/tarad/

Subscribe to my blog
Go to Top of Page

nigelrivett
Master Smack Fu Yak Hacker

3385 Posts

Posted - 2010-12-29 : 23:25:09
That example has one starting point and the assumption that cache will be empty.

Problem is that if you are running an SP then probably when it starts the data won't be in cache but as it runs statements data will be read and available for a time. If you don't take this into account you can spend time optimising things that aren't necessary and also miss out on a lot of efficiency to be gained by ordering the processing.

Anyway cost and time are not related. I worked with someone who was adamant he could predict the time taken by any statement from the cost - even after I showed him timings to disprove it.


==========================================
Cursors are useful if you don't know sql.
SSIS can be used in a similar way.
Beer is not cold and it isn't fizzy.
Go to Top of Page

matti
Starting Member

4 Posts

Posted - 2011-01-03 : 11:24:15
Allways Im running this statement before I do any measurements on my select:
DBCC FREEPROCCACHE DBCC DROPCLEANBUFFERS


4% of the batch is for the initial commands:

CREATE TABLE #tmp([RecordID] INT PRIMARY KEY CLUSTERED)
INSERT INTO #tmp MyIDs

and 96% is for the main select(takes 15 seconds or more). So the time is measured only for the main select.

The fragmentation of the clustered primary key index on RecordID is 4%. Page Count for this index is 400 000. Index maintenance is performed every day.

I have tried these "improvements" to no avail:

1.) SELECT o.RecordID FROM [Detail] inner join #tmp ON #tmp.RecordID = o.RecordID OPTION(MERGE JOIN).

2.) SELECT #tmp.RecordID FROM #tmp inner join [Detail] ON #tmp.RecordID = o.RecordID OPTION(FORCE ORDER).

3.) SELECT o.RecordID FROM [Detail] WITH(NOLOCK) inner join #tmp ON #tmp.RecordID = o.RecordID.

4.) I applied cursor on #tmp table, and for each record from cursor i tried to find RecordID in [Detail] table(i tried this only because the select is executed really fast if there is condition: WHERE RecordID = SomeRecordID). This last shot was really the slowest one.

To sum up all your comments:

- 70 millions of rows is not that much for properly designed table(which i think it is, because there is only one primary key clustered index)
- this table has primary key clustered index on RecordID and there is no apparent reason for my select to take so long as long as the selected value and column used in condition\join clause is only RecordID.
- table indexes are rebuilded every day and my measurements are performed on table with clustered index(RecordID) fragmented up to 4%


Now Im going to check situation on hardware level if there is any bottleneck. As long as this database is pretty much in use and server I\O rise up to 10MB\s providing data to many applications, this could be an issue. I will provide you with results this week.

Thank you all for your comments
Go to Top of Page

Kristen
Test

22859 Posts

Posted - 2011-01-03 : 11:34:27
I think the hardware is unlikely to be the limiting factor. But ... I share you conclusions ... and thus checking the hardware would at least allow it to be excluded as a possible cause (or, indeed, for it to BECOME the cause )
Go to Top of Page

matti
Starting Member

4 Posts

Posted - 2011-01-20 : 11:55:19
At first, thank you all for your advice, I learned a lot from your comments.

The problem i had was an hardware issue. There is a large number of databases on the server. I allways checked processor and memory usage and only ocassionaly took a look at disk I\O. Moreover disk I\O was not critical high(up to 20MB\sec at peak), but the bottleneck was large number of random accesses. As long as we found this out, we moved some databases at new hard drive and everything works fine.

Go to Top of Page
   

- Advertisement -