Query optimization with a clustered index

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

SQL Server 2005 Forums

Transact-SQL (2005)

Query optimization with a clustered index

Author

Topic

sqllamer
Starting Member

9 Posts

Posted - 2010-11-11 : 07:08:00

Hello, friends!

I have the following problem:
Necessary to optimize the query, without changing the structure of the tables without changing existing or adding new indexes.

Query:

SELECT
   T. Account_Id,
   T. Date,
   T. Value
FROM Transactions T
WHERE T. Date> = '20100101 'AND T. Date <'20100201'

Structure:

-- Table of Accounts - 1000 records
CREATE TABLE Accounts
(
  Account_Id int PRIMARY KEY,
  Name varchar (255) NOT NULL
)

- Table of Transactions - 10,000,000 records
CREATE TABLE Transactions
(
  Transaction_Id int PRIMARY KEY NONCLUSTERED,
  Account_Id int REFERENCES Accounts (Account_Id) NOT NULL,
  Date datetime NOT NULL,
  Value numeric (18, 2) NOT NULL,
)
CREATE CLUSTERED INDEX [IX_Transactions] ON Transactions (Account_Id, Date)

I do not understand how else can be optimized this query! In execution plan contained only Clustered Index Scan.

GilaMonster
Master Smack Fu Yak Hacker

4507 Posts

Posted - 2010-11-11 : 07:57:42

A clustered index scan is a table scan. It's about the worst operator that you can have for accessing data. It's reading the entire 10 million rows to get the data you want.
You need to add an index to optimise this query. No way to do anything without a new index on (Date) Include (value)

If you put that index on, your clustered index (table) scan will be replaced by a nonclustered index seek that just reads the rows needed.

--
Gail Shaw
SQL Server MVP

AndrewMurphy
Master Smack Fu Yak Hacker

2916 Posts

Posted - 2010-11-11 : 12:42:51

or reverse the order of the fields

CREATE CLUSTERED INDEX [IX_Transactions_NEW] ON Transactions (Date, Account_Id)

TG
Master Smack Fu Yak Hacker

6065 Posts

Posted - 2010-11-11 : 13:24:08

>>Necessary to optimize the query, without changing the structure of the tables without changing existing or adding new indexes

With no changes allowed to the structure there is nothing to optimize. The query is so simple there is no better way to write it.

Be One with the Optimizer
TG

GilaMonster
Master Smack Fu Yak Hacker

4507 Posts

Posted - 2010-11-11 : 15:59:58

quote:
Originally posted by AndrewMurphy

or reverse the order of the fields

CREATE CLUSTERED INDEX [IX_Transactions_NEW] ON Transactions (Date, Account_Id)

I would strongly not recommend that.

With the order reversed, any queries that filter just on accountID can no longer seek on that index. Any queries that filter on Accountid and have a ranged seek on date can no longer optimally seek on that index.

It would optimise the posted query, but it may seriously hinder any number of other queries.

--
Gail Shaw
SQL Server MVP

namman
Constraint Violating Yak Guru

285 Posts

Posted - 2010-11-11 : 21:36:30

quote:

or reverse the order of the fields

CREATE CLUSTERED INDEX [IX_Transactions_NEW] ON Transactions (Date, Account_Id)

I would strongly not recommend that.

With the order reversed, any queries that filter just on accountID can no longer seek on that index. Any queries that filter on Accountid and have a ranged seek on date can no longer optimally seek on that index.

Hi GilaMonster,
I agree with you in blue part, but disagree on the red part. For me, in that case, that index is still seek, but seeking efficiently or not, it depends on the nature of the data.

Could you explain more detail. Thanks

GilaMonster
Master Smack Fu Yak Hacker

4507 Posts

Posted - 2010-11-12 : 03:32:59

If the predicate on Date is an inequality and the predicate on AccountID is an equality then:

If the index is AccountID, Date, SQL can make a single seek to the desired accountID and start of the date range, read until the end of the date range and know that it has all the rows it needs, no additional filters necesssary

If the index is Date, AccountID then SQL cannot seek on the accountID at all, it has to just seek to the beginning of the date range, read until the end of the date range, then do a secondary filter to eliminate the unwanted accountIDs

General rule for indexes - equality columns before inequality in the key.

http://sqlinthewild.co.za/index.php/2009/02/06/index-columns-selectivity-and-inequality-predicates/

--
Gail Shaw
SQL Server MVP

namman
Constraint Violating Yak Guru

285 Posts

Posted - 2010-11-13 : 14:53:26

quote:

If the index is Date, AccountID then SQL cannot seek on the accountID at all, it has to just seek to the beginning of the date range, read until the end of the date range, then do a secondary filter to eliminate the unwanted accountIDs

General rule for indexes - equality columns before inequality in the key.

http://sqlinthewild.co.za/index.php/2009/02/06/index-columns-selectivity-and-inequality-predicates/

It is about data, not equality/inequality.

Example:

SELECT
T. Account_Id,
T. Date,
T. Value
FROM Transactions T
WHERE Account_Id = 10 AND T. Date> = '20100101 'AND T. Date <'20100201'

Index:
1 - CREATE CLUSTERED INDEX [IX_Transactions] ON Transactions (Account_Id, Date)
2 - CREATE CLUSTERED INDEX [IX_Transactions] ON Transactions (Date, Account_Id)

If Account_Id = 10 returns 2 rows, and T. Date> = '20100101 'AND T. Date <'20100201' returns 1000 rows, then the index 1 is better for that query.

But If Account_Id = 10 returns 1000 rows, and T. Date> = '20100101 'AND T. Date <'20100201' returns 2 rows, then the index 2 is better for that query.

Composite index is a bit confusing with clustered and non-clustered index, specially on how data is stored ...... (BOL)

GilaMonster
Master Smack Fu Yak Hacker

4507 Posts

Posted - 2010-11-14 : 04:42:12

No, it is not. I'm guessing you didn't read my blog post

Regardless of what the selectivity is for the individual columns, with Index 1, SQL can do a double column seek and go straight to the 2 rows it wants. It is not limited to first seeking on the leading column, then seeking on the second.

With Index 2, the best SQL can do is fetch the rows that match by date (because date is an inequality) and then do a secondary filter. If the date predicate returns only 2 rows, then that will be as efficient (but not more) than Index 1.

--
Gail Shaw
SQL Server MVP

sqllamer
Starting Member

9 Posts

Posted - 2010-11-14 : 05:24:00

I'm sorry, I'm a little behind the discussion, but I would like to return to the original problem.
As you may remember I had to optimize the query without changing existing or adding new indexes. Therefore, the option to change the order of fields in the index does not suit me. Adding a new index also does not fit.

quote:
A clustered index scan is a table scan. It's about the worst operator that you can have for accessing data.

As I understand it, if Clustered Index Scan - this is the worst operator in this case, the best should be a Clustered Index Seek, is not it?
Thus, despite the fact that the query is really very simple, I need to make the optimizer use the index seek instead of scanning. But since the IX_Transactions is composite index and the first field is Account_Id, it is necessary that in the search condition appeared Account_Id.
What do you think, would behave the optimizer as I described when I reconstructed the query as follows:

SELECT
   T. Account_Id,
   T. Date,
   T. Value
FROM
   Accounts A
   inner join Transactions T on
      T. Account_Id = A. Account_Id
      and T. Date> = '20100201 'AND T. Date <20100301

Now I am preparing the test data to test this theory, so that soon I will be able answer for sure ...

GilaMonster
Master Smack Fu Yak Hacker

4507 Posts

Posted - 2010-11-14 : 07:00:48

No better at all. It'll likely run slower because there's now a second (unnecessary) table involved.
That won't result in a seek. SQL still can see that the only filter on the Transactions table is the one on date (the one on account ID is not a filter as it doesn't limit the rows)

Without a new index, you are not going to be able to optimise this at all. It needs an index on Date and that query can't get any simpler than what it already is.
Why can you not add another index? It's not going to break anything (changing the order of the cluster however might)

--
Gail Shaw
SQL Server MVP

sqllamer
Starting Member

9 Posts

Posted - 2010-11-14 : 07:36:37

quote:
Originally posted by GilaMonster

No better at all. It'll likely run slower because there's now a second (unnecessary) table involved.
That won't result in a seek. SQL still can see that the only filter on the Transactions table is the one on date (the one on account ID is not a filter as it doesn't limit the rows)

Without a new index, you are not going to be able to optimise this at all. It needs an index on Date and that query can't get any simpler than what it already is.
Why can you not add another index? It's not going to break anything (changing the order of the cluster however might)

You're right, but only partly.
Indeed, in such form as I wrote earlier in the query speed and in execution plan, nothing has changed.
However, my decision is still being proved right. I even found two ways:
First:

SELECT
   A. Account_Id,
   T. Date,
   T. Value
FROM
   Accounts A
   inner join Transactions T on
      T. Account_Id = A. Account_Id
      and T. Date> = '20100201 'AND T. Date < '20100301'

Second:

SET FORCEPLAN ON
SELECT
T. Account_Id,
T. Date,
T. Value
FROM
Accounts A
inner join Transactions T on
T. Account_Id = A. Account_Id
and T. Date> = '20100201 'AND T. Date < '20100301'

I think that you do not need to explain the change. Rather, it can me explain.

Kristen
Test

22859 Posts

Posted - 2010-11-14 : 07:45:44

Is there a Foreign Key between Accounts and Transactions tables?

If so don't bother with the Accounts table in your query at all. That may mean that the query will be covered by the Clustered Index (although you are using Date Range and thus not using the first key field, so SQL has to scan the whole index (but hopefully it will Cover the query from the Index alone)

sqllamer
Starting Member

9 Posts

Posted - 2010-11-14 : 07:45:53

Basic query execution plan

  |--Parallelism(Gather Streams)
       |--Clustered Index Scan(OBJECT:([test2].[dbo].[Transactions].[IX_Transactions] AS [T]), WHERE:([test2].[dbo].[Transactions].[Date] as [T].[Date]>='2010-01-01 00:00:00.000' AND [test2].[dbo].[Transactions].[Date] as [T].[Date]<'2010-02-01 00:00:00.000'))

Execution plan of the modified query

  |--Nested Loops(Inner Join, OUTER REFERENCES:([A].[Account_Id], [Expr1005]) WITH UNORDERED PREFETCH)
       |--Clustered Index Scan(OBJECT:([test2].[dbo].[Accounts].[PK__Accounts__7C8480AE] AS [A]))
       |--Clustered Index Seek(OBJECT:([test2].[dbo].[Transactions].[IX_Transactions] AS [T]), SEEK:([T].[Account_Id]=[test2].[dbo].[Accounts].[Account_Id] as [A].[Account_Id] AND [T].[Date] >= '2010-02-01 00:00:00.000' AND [T].[Date] < '2010-03-01 00:00:00.000') ORDERED FORWARD)

At the same time, query performance has increased more than fivefold

GilaMonster
Master Smack Fu Yak Hacker

4507 Posts

Posted - 2010-11-14 : 09:14:58

Run
SET STATISTICS IO ON
SET STATISTICS TIME ON
Then run your original query and the one where you added the accounts table and post the contents of the messages pane for both.

Do you know what FORCEPLAN does? Don't use hints if you don't know their complete effect and downsides. Forceplan is a very dangerous setting to use.

--
Gail Shaw
SQL Server MVP

namman
Constraint Violating Yak Guru

285 Posts

Posted - 2010-11-14 : 14:44:59

Hi GilaMonster,

I come back to that topic later ..... may be in another thread.

Hi sqllammer,

quote:

Necessary to optimize the query, without changing the structure of the tables without changing existing or adding new indexes.

Query:

SELECT
T. Account_Id,
T. Date,
T. Value
FROM Transactions T
WHERE T. Date> = '20100101 'AND T. Date <'20100201'

Structure:

-- Table of Accounts - 1000 records
CREATE TABLE Accounts
(
Account_Id int PRIMARY KEY,
Name varchar (255) NOT NULL
)

- Table of Transactions - 10,000,000 records
CREATE TABLE Transactions
(
Transaction_Id int PRIMARY KEY NONCLUSTERED,
Account_Id int REFERENCES Accounts (Account_Id) NOT NULL,
Date datetime NOT NULL,
Value numeric (18, 2) NOT NULL,
)
CREATE CLUSTERED INDEX [IX_Transactions] ON Transactions (Account_Id, Date)

I don't think we can optimize the query with the red condition. I see some developers try to play around to optimize similar queries, you are not alone. Let's see your work-around query:

quote:

SELECT
T. Account_Id,
T. Date,
T. Value
FROM
Accounts A
inner join Transactions T on
T. Account_Id = A. Account_Id
and T. Date> = '20100201 'AND T. Date <20100301'

CREATE CLUSTERED INDEX [IX_Transactions] ON Transactions (Account_Id, Date)

I think your main point is to get Account_Id involve in the where (or on) clause so that the index seek will be used (even the table Account may not be necessary).

If so, it still does not work. Indexed column involving in the condition is the first req to use index seek, but it is not only one.

Your query is similar to this one:

quote:

SELECT
T. Account_Id,
T. Date,
T. Value
FROM
Transactions T on
T. Account_Id = T. Account_Id
and T. Date> = '20100201 'AND T. Date <20100301'

It uses scan because I do not make any filter on Account_Id (first condition accepts all Account_Id).

Optimizer bases on many factors to decide using seek or scan. One of most important factor is the data. In general, if value of indexed column is less than 20% of all in the table, it uses seek, otherwise it uses scan.

So is there any case in which the index seek is used as you expected with your work-around query?

Yes, but it is bad db design or even wrong. That is the Account table has much less rows than ones on Transaction table. If Account table has 2 rows and Transaction table has 1000 rows, the index seek may be used in your query. However, that is not a good design.

GilaMonster
Master Smack Fu Yak Hacker

4507 Posts

Posted - 2010-11-15 : 00:18:37

quote:
Originally posted by namman

In general, if value of indexed column is less than 20% of all in the table, it uses seek, otherwise it uses scan.

Between 0.1% and 1%, and only if it's a non-covering index.

--
Gail Shaw
SQL Server MVP

sqllamer
Starting Member

9 Posts

Posted - 2010-11-15 : 03:45:08

quote:
Originally posted by GilaMonster

Run
SET STATISTICS IO ON
SET STATISTICS TIME ON
Then run your original query and the one where you added the accounts table and post the contents of the messages pane for both.

The result of the basic query with instructions SET STATISTICS IO ON and SET STATISTICS TIME ON (first run):

SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 108 ms.

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

(45346 row(s) affected)
Table 'Transactions'. Scan count 3, logical reads 71444, physical reads 4, read-ahead reads 64636, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
   CPU time = 2140 ms,  elapsed time = 201597 ms.

One of the subsequent launches (the characteristic time of 2 seconds, the actual variations from 2 to 60 seconds):

SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

(45346 row(s) affected)
Table 'Transactions'. Scan count 3, logical reads 71444, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
   CPU time = 2202 ms,  elapsed time = 2477 ms.

The result of the modified query with instructions SET STATISTICS IO ON and SET STATISTICS TIME ON (first run):

SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 224 ms.

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

(45443 row(s) affected)
Table 'Transactions'. Scan count 1000, logical reads 5313, physical reads 70, read-ahead reads 1973, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Accounts'. Scan count 1, logical reads 7, physical reads 1, read-ahead reads 14, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
   CPU time = 62 ms,  elapsed time = 10616 ms.

One of the subsequent launches (the characteristic time of 1 seconds):

SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

(45443 row(s) affected)
Table 'Transactions'. Scan count 1000, logical reads 3512, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Accounts'. Scan count 1, logical reads 7, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
   CPU time = 94 ms,  elapsed time = 1647 ms.

quote:
Originally posted by GilaMonster

Do you know what FORCEPLAN does? Don't use hints if you don't know their complete effect and downsides. Forceplan is a very dangerous setting to use.

I do not pretend to have a deep understanding of instructions SET FORCEPLAN, but I think that in this case it is quite appropriate.
MSDN:

quote:
When FORCEPLAN is set to ON, the Microsoft SQL Server query optimizer processes a join in the same order as the tables appear in the FROM clause of a query. In addition, setting FORCEPLAN to ON forces the use of a nested loop join unless other types of joins are required to construct a plan for the query, or they are requested with join hints or query hints.

But I still prefer the first option:

SELECT A.Account_Id, ...

Kristen
Test

22859 Posts

Posted - 2010-11-15 : 04:47:29

Note that you will see a difference on repeat runs because the Query Plan and the Data are cached.

Just in case you are not familiar with the output from STATISTICS the key things to look at are "Scan count" and "logical reads" - these are the same for all runs of the same query
(unless indexes or statistics or number of data rows or somesuch changes)

"physical reads" and "read-ahead reads" etc. show what physically happened, and in your first example there is a figure present, and these are zero in the second example
- so the data was already in cache.

When I use STATISTICS I experiment (e.g. by creating different indexes, or changing the query) to see if I can reduce "Scan count" and/or "logical reads" as they are a good
indicator whether an experiment was more, or less, efficient.

So basically that boils down to your first query being:

Table 'Transactions'. Scan count 3, logical reads 71444

and your second being

Table 'Transactions'. Scan count 1000, logical reads 3512
Table 'Accounts'. Scan count 1, logical reads 7

Strikes me as an awful lot of SCANS on [Transactions] table.

Why is [Accounts] not appearing in the STATISTICS for the first query?Has SQL optimised it out of the query altogether?

(Maybe I have misunderstood from the thread exactly which queries you are now using for "First" and "Second" tests)

Maybe it indicates as per my earlier suggestion that (assuming Foreign Key is in place) that [Accounts] table is not required in this query at all.

sqllamer
Starting Member

9 Posts

Posted - 2010-11-15 : 05:56:16

quote:
Originally posted by namman

I think your main point is to get Account_Id involve in the where (or on) clause so that the index seek will be used (even the table Account may not be necessary).

In general, you're right. More precisely, I was like to add a field Account_Id into the search conditions, to enable searching the clustered index.

quote:
Originally posted by namman

If so, it still does not work. Indexed column involving in the condition is the first req to use index seek, but it is not only one.

One option that you gave (and which I quoted earlier) does not really work. But note, I also gave a pair of modified versions of this query, which forced the optimizer to use search:

SELECT
   A. Account_Id,
   T.Date,
   T.Value
FROM
   Accounts A
   inner join Transactions T on
      T.Account_Id = A.Account_Id
      and T.Date> = '20100201 'AND T.Date < '20100301'

In this case, using nested loops, so that the condition T. Account_Id = A. Account_Id actually is a filter for the table Transactions, choosing at each iteration of scan table Accounts records with a specific Account_Id. The idea is that the 1000 records scan with the seek ordered range of dates (from this certainly we will not go away) more profitable than scan 10000000 records

quote:
Originally posted by namman

That is the Account table has much less rows than ones on Transaction table. If Account table has 2 rows and Transaction table has 1000 rows, the index seek may be used in your query. However, that is not a good design.

Remember the initial conditions: in the table Accounts 1000 records in the table Transactions 10 million records and it is growing. This is a physical essence, why do you think, this is not a good design?

sqllamer
Starting Member

9 Posts

Posted - 2010-11-15 : 07:35:46

2 Kristen:

Query for the "first" test:

SELECT
   T.Account_Id,
   T.Date,
   T. Value
FROM Transactions T
WHERE T.Date> = '20100101' AND T.Date < '20100201'

Query for the "second" test:

SELECT
   A.Account_Id,
   T.Date,
   T.Value
FROM
   Accounts A
   inner join Transactions T on
      T.Account_Id = A.Account_Id
      and T.Date> = '20100201' AND T.Date < '20100301'

Accordingly, the Accounts table is not included in the statistics of the first query, because this table does not use. In this sense, all right.

Number of scans in the table Transactions corresponds to the number of records in the table Accounts because using nested loops.

Foreign key between Transactions(Account_Id) and Accounts(Account_Id) available. Accounts table is really not necessary to obtain the required results. As I said earlier, it is used in a query only for setting up optimizer for a more optimal execution plan (in my opinion). Thus it does the work-around query.

Next Page

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources