How to handle huge database tables

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

General SQL Server Forums

Database Design and Application Architecture

How to handle huge database tables

Author

Topic

R
Constraint Violating Yak Guru

328 Posts

Posted - 2010-06-30 : 12:22:19

Hi

I'm not sure if this is posted in the correct forum topic area. Please advise if not and I'll re-post. I'm just looking for some general advice.

I have an online application that uses SQL 2005 and ASP.NET. Some database tables get a lot more use than other tables, and today i was asked how 'full' these tables need to get before the server starts struggling to process requests. Currently the database tables are very small (less than 10,000 rows), so to test the question I duplicated the table into a test database and inserted 15,000,000 rows into it (making sure to duplicate the table index as well).

Suddenly SELECT queries that take a fraction of a second on the main database table now take over a minute on the test data. This is okay for me, but not for my clients who have no patience and would think the application has stopped working for them.

So I'm looking for advice please to see what DBA's generally do when databases become very large? I know I could upgrade the server to a more powerful spec, but I've heard buzzwords like "data-warehousing" etc and wondered whether this is relevant here?

Please feel free to share your methods with me so I can go away and research it. Thank you in advance!

tkizer
Almighty SQL Goddess

38200 Posts

Posted - 2010-06-30 : 12:43:00

15 million rows in a table is small these days. You are probably missing indexes on your table if your queries are performing slow.

Could you show us the DDL for the table, including indexes? Could you show us the queries you are testing?

How often do you run update statistics? How about rebuilding or reorganized the indexes?

Tara Kizer
Microsoft MVP for Windows Server System - SQL Server
http://weblogs.sqlteam.com/tarad/

Subscribe to my blog

X002548
Not Just a Number

15586 Posts

Posted - 2010-06-30 : 13:04:52

And post the SELECT

Do you know how to script the DDL???

Brett

8-)

Hint: Want your questions answered fast? Follow the direction in this link
http://weblogs.sqlteam.com/brettk/archive/2005/05/25/5276.aspx

Add yourself!
http://www.frappr.com/sqlteam

R
Constraint Violating Yak Guru

328 Posts

Posted - 2010-06-30 : 15:59:15

quote:
Originally posted by tkizer

15 million rows in a table is small these days. You are probably missing indexes on your table if your queries are performing slow.

Hi and thanks for your replies. Here is the DDL code:


CREATE TABLE [dbo].[tbl_Test](
	[ID] [int] IDENTITY(1,1) NOT NULL,
	[sID] [int] NOT NULL, -- foreign key
	[sgID] [int] NOT NULL, -- foreign key
	[userID] [int] NOT NULL, -- foreign key
	[EventID] [int] NOT NULL, -- foreign key
	[newSkillLevel] [numeric](5, 2) NULL,
	[targetScore] [numeric](5, 2) NOT NULL,
	[comments] [nvarchar](max),
	[adminComments] [nvarchar](max),
	[dateAttained] [datetime] NOT NULL DEFAULT (getutcdate()),
 CONSTRAINT [PK_tbl_Test_1] PRIMARY KEY CLUSTERED 
(
[ID] ASC
) WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]

And here are the two indexes that exist:


ALTER TABLE [dbo].[tbl_Test] ADD  CONSTRAINT [PK_tbl_Test_1] PRIMARY KEY CLUSTERED 
(
[ID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]

CREATE NONCLUSTERED INDEX [sgID_sID_uID] ON [dbo].[tbl_Test] 
(
	[sgID] ASC,
	[sID] ASC,
	[userID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]

Please note that I haven't copied the primary key tables/columns from my original database because it would have taken a long time to recreate primary key data for each, so this is a more simplified example.

The SELECT statements vary quite a lot based on what data is needed, but the WHERE clause always has the following content:


WHERE
  (sID = @sID) AND
  (sgID = @sgID) AND
  (userID = @userID OR @userID IS NULL) AND
  (eventID = @eventID OR @eventID IS NULL)

Because the database is small, I haven't rebuilt the indexes. When I inserted the test rows, would I have needed to rebuild afterwards? I would have thought the index would have been in a good state directly afterwards?

It gives me some reassurance that 15 million rows is relatively small! I look forward to reading more of your advice!

tkizer
Almighty SQL Goddess

38200 Posts

Posted - 2010-06-30 : 16:01:23

Are there joins in your query referencing the foreign key tables? If so, are those columns indexed? SQL Server does not add indexes on foreign keys. SQL Server does do it for primary keys though. I make it a standard practice to add an index to all of my foreign keys.

Tara Kizer
Microsoft MVP for Windows Server System - SQL Server
http://weblogs.sqlteam.com/tarad/

Subscribe to my blog

GilaMonster
Master Smack Fu Yak Hacker

4507 Posts

Posted - 2010-06-30 : 16:05:53

http://sqlinthewild.co.za/index.php/2009/03/19/catch-all-queries/

--
Gail Shaw
SQL Server MVP

R
Constraint Violating Yak Guru

328 Posts

Posted - 2010-07-01 : 04:05:24

quote:
Originally posted by tkizer

Are there joins in your query referencing the foreign key tables? If so, are those columns indexed? SQL Server does not add indexes on foreign keys. SQL Server does do it for primary keys though. I make it a standard practice to add an index to all of my foreign keys.

That's good advice, I was not aware that indexes on foreign keys are not added. My code doesn't always join to the primary key columns/tables, but my index in the above table does include all the foreign key tables.

quote:
Originally posted by GilaMonster

[url]http://sqlinthewild.co.za/index.php/2009/03/19/catch-all-queries/[/url]

Hi Gail. Within the replies to your post there is this reply:

quote:

Instead of using this form:

WHERE (key1 = @key1 OR @key1 IS NULL)
AND (key2 = @key2 OR @key2 IS NULL)
AND (key3 = @key3 OR @key3 IS NULL

you should consider using this:

WHERE (key1 = @key1 AND @key1 IS NOT NULL)
OR (key2 = @key2 AND @key2 IS NOT NULL)
OR (key3 = @key3 AND @key3 IS NOT NULL)

...I have found in my testing that this solution performs very well; indexes ARE used, performance is great, and dynamic code is not required.

Is this something that you agree with? I'd have to test out this code on my test table, but I assume it does produce the same output, using fewer resources? Or, is the code in your original post actually better:


WHERE 
	(ProductID = CASE WHEN @Product IS NULL THEN ProductID ELSE @Product END) AND

GilaMonster
Master Smack Fu Yak Hacker

4507 Posts

Posted - 2010-07-01 : 07:05:06

Read the comments below that as well. It's not something I've tested (no time recently) but a later commenter noted that the revised query form had different behaviour from the original.

As for the CASE in my original post, I explained right below how that behaves - table scan. A table scan on 15 million rows is not going to be pretty.

--
Gail Shaw
SQL Server MVP

R
Constraint Violating Yak Guru

328 Posts

Posted - 2010-07-01 : 09:24:28

Hi Gail

Sorry if I've missed something really obvious, but I don't follow your answer. I appreciate that CASE is not good, as I read in your article. However, the article comment I quoted about doesn't use CASE.

quote:

Read the comments below that as well.

The only comments below seem to refer to dynamic SQL, which I'm not using. I'm just loading up my laptop now with the test database on, to try out the alternative approach that I quoted above...

Taking my post a step further, suppose I utilise the correct indexes and get the SELECTs tuned up, but the table grows to, say, 100,000,000 rows and things are still too slow. Providing I haven't retired and am living in paradise, what is the next step in improving efficiency? Can the database be spead across multiple machines?

Again I'm sorry for my lack of knowledge here. This is the first time I've addressd the subject of huge database management.

R
Constraint Violating Yak Guru

328 Posts

Posted - 2010-07-01 : 09:43:56

Actually I just ran the queries and they produce totally different results, which makes perfect sense when I actually read the code

R
Constraint Violating Yak Guru

328 Posts

Posted - 2010-07-01 : 10:13:39

One thing is confusing me slightly. When I run a stored procedure against my table (which only searches the three foreign key (indexed) columns), in the Execution Plan I see this:


Nested Loops
(Inner Join)
Cost: 22%

Index Scan
[dbo].[tbl_Test]
Cost: 78%

Key Lookup
[dbo].[tbl_Test]
Cost: 0%

Surely if the stored procedure is only searching indexed columns, shouldn't it use an Index Seek rather than a Scan? Also why is there an Inner join mentioned, when their is no join in the stored procedure?

The query took 55 seconds to search 15,000,000 rows, but I'd like to reduce that if possible without a hardware upgrade!

GilaMonster
Master Smack Fu Yak Hacker

4507 Posts

Posted - 2010-07-01 : 10:30:09

It's because of the catch-all query. SQL cannot index seek with a query of that form. Index scan is the best that it can do.

The join is between the data retrieved from the index scan and the data (columns needed b ut not in the used index) retrieved by the key lookup.

Remove/replace those (@Variable IS NULL OR Column = @Variable) predicates and you'll almost certainly get a much better performance. Queries of that form DO NOT perform well.

--
Gail Shaw
SQL Server MVP

Kristen
Test

22859 Posts

Posted - 2010-07-01 : 10:46:41

One option is to do something like:

Is @Param1 used?

make a first cut on that, getting PKs into @TempTable

If @Param1 IS NULL and @Param2 IS NOT NULL then make a first cut on that

Choose Param1 (i.e. the one you process first for the first-cut) to be the one that will give the most efficient query balanced against how frequently it is likely to be used. No sense having something "first" that grabs 90% of the database every time!

Then join @TempTable to the others using the catch-all query to refine toe exactly what you want

You may have to move the various First-Cut queries into separate Sprocs to get the query plans cached without interference from the variety of paths that could be exercised in an all-in-one Sproc

But, having typed that, we are moving towards dynamic (but parametrised) SQL in SQL2008 now that we can have EXECUTE permission on Sprocs and use certificates to allow dynamic SQL to be used [without SELECT permissions being required on the individual tables]

R
Constraint Violating Yak Guru

328 Posts

Posted - 2010-07-01 : 11:06:57

Thank you both for your comments. I wish I had learned that advice a long time ago.

Within the table, the [eventID] column can only ever be one of seven possible integer values. Therefore, should I use that first with the WHERE clause to eliminate the most data straight away?

Could you possibly show me a working example of your advice based on this initial Sproc that I used? I realise this query is not suitable for my large table as it currently is written:


CREATE PROC myTest

 @sID int = NULL,
 @sgID int = NULL,
 @userID int = NULL,
 @eventID int = NULL

AS

 SELECT
     *
 FROM
     [myTest]
 WHERE
     (eventID = @eventID OR @eventID IS NULL) AND
     (sID = @sID OR @sID IS NULL) AND
     (sgID = @sgID OR @sgID IS NULL) AND
     (userID = @userID OR @userID IS NULL)

Thanks again. This is really a good learning exercise for me

Kristen
Test

22859 Posts

Posted - 2010-07-01 : 11:37:50

The WHERE clause won't care what is first - Query Optimiser will look at the whole thing and then decide "how" to tackle it - i.e. create a query plan.

If EventID has only seven distinct values it will be a poor index candidate (not highly selective), so is likely to cause a table scan

GilaMonster
Master Smack Fu Yak Hacker

4507 Posts

Posted - 2010-07-01 : 13:04:44

I strongly recommend that you look at the dynamic SQL alternative on my blog. It will be fase regardless of what parameters are passed and what won't, and it can effectively use indexes.

--
Gail Shaw
SQL Server MVP

R
Constraint Violating Yak Guru

328 Posts

Posted - 2010-07-02 : 12:14:59

Having removed eventID from the index, then rebuilt the index, and finally altered the Sproc so that it now only searches sID, sgID and userID (the indexed columns), the execution time has fallen to 31 seconds.

Within the execution plan, it still shows INDEX SCAN 72% when the proc is used as follows:

EXEC testProc @sID=500, @sgID=500, @userID=500

Could you please advise why an index SEEK is not being used instead, when only indexed columns are being searched?

GilaMonster
Master Smack Fu Yak Hacker

4507 Posts

Posted - 2010-07-02 : 13:00:39

As I stated earlier

quote:
It's because of the catch-all query. SQL cannot index seek with a query of that form. Index scan is the best that it can do.

--
Gail Shaw
SQL Server MVP

R
Constraint Violating Yak Guru

328 Posts

Posted - 2010-07-04 : 03:57:14

Hi Gail

I incorrectly presumed that when values were passed to the Sproc for each parameter, that the query optimizer would interpret this as:


WHERE
     (sID = 500) AND
     (sgID = 600) AND
     (userID = 700)

Thanks for clarifying, and for sharing your knowledge with me. It is much appreciated

GilaMonster
Master Smack Fu Yak Hacker

4507 Posts

Posted - 2010-07-04 : 05:37:22

No. The optimiser cannot rewrite the proc.

Remember execution plans are cached and reused for later executions of the proc. Any plan generated has to be safe for reuse. That is, has to return correct results no matter what parameters are passed. Hence it must keep the possibility for the params to be null, because on future executions that may be the case.

--
Gail Shaw
SQL Server MVP

Kristen
Test

22859 Posts

Posted - 2010-07-04 : 05:51:26

"WHERE
(sID = 500) AND
(sgID = 600) AND
(userID = 700)"

Hence my suggestion to check for parameters being non-null


IF @sID IS NOT NULL
BEGIN
   SELECT ...
   FROM ...
   WHERE
           (sID = @sID) 
       AND (@sgID IS NULL OR sgID = @sgID)
       AND (@userID IS NULL OR userID = @userID)   
END
ELSE
IF @sgID IS NOT NULL
BEGIN
   SELECT ...
   FROM ...
   WHERE
           sgID = @sgID
       AND (@userID IS NULL OR userID = @userID)   
END
ELSE ...

although, as I suggested, each of those queries may need to be a separate sub-Sproc call so that each gets its own query plan

Put tests for the most common / most efficient query first.

Next Page

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources