Please start any new threads on our new
site at https://forums.sqlteam.com. We've got lots of great SQL Server
experts to answer whatever question you can come up with.
Author |
Topic |
sqllamer
Starting Member
9 Posts |
Posted - 2010-11-11 : 07:08:00
|
Hello, friends!I have the following problem:Necessary to optimize the query, without changing the structure of the tables without changing existing or adding new indexes.Query:SELECT T. Account_Id, T. Date, T. ValueFROM Transactions TWHERE T. Date> = '20100101 'AND T. Date <'20100201' Structure:-- Table of Accounts - 1000 recordsCREATE TABLE Accounts( Account_Id int PRIMARY KEY, Name varchar (255) NOT NULL)- Table of Transactions - 10,000,000 recordsCREATE TABLE Transactions( Transaction_Id int PRIMARY KEY NONCLUSTERED, Account_Id int REFERENCES Accounts (Account_Id) NOT NULL, Date datetime NOT NULL, Value numeric (18, 2) NOT NULL,)CREATE CLUSTERED INDEX [IX_Transactions] ON Transactions (Account_Id, Date) I do not understand how else can be optimized this query! In execution plan contained only Clustered Index Scan. |
|
GilaMonster
Master Smack Fu Yak Hacker
4507 Posts |
Posted - 2010-11-11 : 07:57:42
|
A clustered index scan is a table scan. It's about the worst operator that you can have for accessing data. It's reading the entire 10 million rows to get the data you want.You need to add an index to optimise this query. No way to do anything without a new index on (Date) Include (value)If you put that index on, your clustered index (table) scan will be replaced by a nonclustered index seek that just reads the rows needed.--Gail ShawSQL Server MVP |
 |
|
AndrewMurphy
Master Smack Fu Yak Hacker
2916 Posts |
Posted - 2010-11-11 : 12:42:51
|
or reverse the order of the fieldsCREATE CLUSTERED INDEX [IX_Transactions_NEW] ON Transactions (Date, Account_Id) |
 |
|
TG
Master Smack Fu Yak Hacker
6065 Posts |
Posted - 2010-11-11 : 13:24:08
|
>>Necessary to optimize the query, without changing the structure of the tables without changing existing or adding new indexesWith no changes allowed to the structure there is nothing to optimize. The query is so simple there is no better way to write it.Be One with the OptimizerTG |
 |
|
GilaMonster
Master Smack Fu Yak Hacker
4507 Posts |
Posted - 2010-11-11 : 15:59:58
|
quote: Originally posted by AndrewMurphy or reverse the order of the fieldsCREATE CLUSTERED INDEX [IX_Transactions_NEW] ON Transactions (Date, Account_Id)
I would strongly not recommend that. With the order reversed, any queries that filter just on accountID can no longer seek on that index. Any queries that filter on Accountid and have a ranged seek on date can no longer optimally seek on that index.It would optimise the posted query, but it may seriously hinder any number of other queries.--Gail ShawSQL Server MVP |
 |
|
namman
Constraint Violating Yak Guru
285 Posts |
Posted - 2010-11-11 : 21:36:30
|
quote: or reverse the order of the fieldsCREATE CLUSTERED INDEX [IX_Transactions_NEW] ON Transactions (Date, Account_Id)I would strongly not recommend that.With the order reversed, any queries that filter just on accountID can no longer seek on that index. Any queries that filter on Accountid and have a ranged seek on date can no longer optimally seek on that index.
Hi GilaMonster,I agree with you in blue part, but disagree on the red part. For me, in that case, that index is still seek, but seeking efficiently or not, it depends on the nature of the data.Could you explain more detail. Thanks |
 |
|
GilaMonster
Master Smack Fu Yak Hacker
4507 Posts |
Posted - 2010-11-12 : 03:32:59
|
If the predicate on Date is an inequality and the predicate on AccountID is an equality then:If the index is AccountID, Date, SQL can make a single seek to the desired accountID and start of the date range, read until the end of the date range and know that it has all the rows it needs, no additional filters necesssaryIf the index is Date, AccountID then SQL cannot seek on the accountID at all, it has to just seek to the beginning of the date range, read until the end of the date range, then do a secondary filter to eliminate the unwanted accountIDsGeneral rule for indexes - equality columns before inequality in the key.http://sqlinthewild.co.za/index.php/2009/02/06/index-columns-selectivity-and-inequality-predicates/--Gail ShawSQL Server MVP |
 |
|
namman
Constraint Violating Yak Guru
285 Posts |
Posted - 2010-11-13 : 14:53:26
|
quote: If the index is Date, AccountID then SQL cannot seek on the accountID at all, it has to just seek to the beginning of the date range, read until the end of the date range, then do a secondary filter to eliminate the unwanted accountIDsGeneral rule for indexes - equality columns before inequality in the key.http://sqlinthewild.co.za/index.php/2009/02/06/index-columns-selectivity-and-inequality-predicates/
It is about data, not equality/inequality.Example:SELECT T. Account_Id, T. Date, T. ValueFROM Transactions TWHERE Account_Id = 10 AND T. Date> = '20100101 'AND T. Date <'20100201'Index:1 - CREATE CLUSTERED INDEX [IX_Transactions] ON Transactions (Account_Id, Date)2 - CREATE CLUSTERED INDEX [IX_Transactions] ON Transactions (Date, Account_Id)If Account_Id = 10 returns 2 rows, and T. Date> = '20100101 'AND T. Date <'20100201' returns 1000 rows, then the index 1 is better for that query.But If Account_Id = 10 returns 1000 rows, and T. Date> = '20100101 'AND T. Date <'20100201' returns 2 rows, then the index 2 is better for that query.Composite index is a bit confusing with clustered and non-clustered index, specially on how data is stored ...... (BOL) |
 |
|
GilaMonster
Master Smack Fu Yak Hacker
4507 Posts |
Posted - 2010-11-14 : 04:42:12
|
No, it is not. I'm guessing you didn't read my blog postRegardless of what the selectivity is for the individual columns, with Index 1, SQL can do a double column seek and go straight to the 2 rows it wants. It is not limited to first seeking on the leading column, then seeking on the second. With Index 2, the best SQL can do is fetch the rows that match by date (because date is an inequality) and then do a secondary filter. If the date predicate returns only 2 rows, then that will be as efficient (but not more) than Index 1.--Gail ShawSQL Server MVP |
 |
|
sqllamer
Starting Member
9 Posts |
Posted - 2010-11-14 : 05:24:00
|
I'm sorry, I'm a little behind the discussion, but I would like to return to the original problem.As you may remember I had to optimize the query without changing existing or adding new indexes. Therefore, the option to change the order of fields in the index does not suit me. Adding a new index also does not fit. quote: A clustered index scan is a table scan. It's about the worst operator that you can have for accessing data.
As I understand it, if Clustered Index Scan - this is the worst operator in this case, the best should be a Clustered Index Seek, is not it?Thus, despite the fact that the query is really very simple, I need to make the optimizer use the index seek instead of scanning. But since the IX_Transactions is composite index and the first field is Account_Id, it is necessary that in the search condition appeared Account_Id.What do you think, would behave the optimizer as I described when I reconstructed the query as follows:SELECT T. Account_Id, T. Date, T. ValueFROM Accounts A inner join Transactions T on T. Account_Id = A. Account_Id and T. Date> = '20100201 'AND T. Date <20100301 Now I am preparing the test data to test this theory, so that soon I will be able answer for sure ... |
 |
|
GilaMonster
Master Smack Fu Yak Hacker
4507 Posts |
Posted - 2010-11-14 : 07:00:48
|
No better at all. It'll likely run slower because there's now a second (unnecessary) table involved. That won't result in a seek. SQL still can see that the only filter on the Transactions table is the one on date (the one on account ID is not a filter as it doesn't limit the rows)Without a new index, you are not going to be able to optimise this at all. It needs an index on Date and that query can't get any simpler than what it already is.Why can you not add another index? It's not going to break anything (changing the order of the cluster however might)--Gail ShawSQL Server MVP |
 |
|
sqllamer
Starting Member
9 Posts |
Posted - 2010-11-14 : 07:36:37
|
quote: Originally posted by GilaMonster No better at all. It'll likely run slower because there's now a second (unnecessary) table involved. That won't result in a seek. SQL still can see that the only filter on the Transactions table is the one on date (the one on account ID is not a filter as it doesn't limit the rows)Without a new index, you are not going to be able to optimise this at all. It needs an index on Date and that query can't get any simpler than what it already is.Why can you not add another index? It's not going to break anything (changing the order of the cluster however might)
You're right, but only partly.Indeed, in such form as I wrote earlier in the query speed and in execution plan, nothing has changed.However, my decision is still being proved right. I even found two ways:First:SELECT A. Account_Id, T. Date, T. ValueFROM Accounts A inner join Transactions T on T. Account_Id = A. Account_Id and T. Date> = '20100201 'AND T. Date < '20100301' Second:SET FORCEPLAN ONSELECTT. Account_Id,T. Date,T. ValueFROMAccounts Ainner join Transactions T onT. Account_Id = A. Account_Idand T. Date> = '20100201 'AND T. Date < '20100301' I think that you do not need to explain the change. Rather, it can me explain. |
 |
|
Kristen
Test
22859 Posts |
Posted - 2010-11-14 : 07:45:44
|
Is there a Foreign Key between Accounts and Transactions tables?If so don't bother with the Accounts table in your query at all. That may mean that the query will be covered by the Clustered Index (although you are using Date Range and thus not using the first key field, so SQL has to scan the whole index (but hopefully it will Cover the query from the Index alone) |
 |
|
sqllamer
Starting Member
9 Posts |
Posted - 2010-11-14 : 07:45:53
|
Basic query execution plan |--Parallelism(Gather Streams) |--Clustered Index Scan(OBJECT:([test2].[dbo].[Transactions].[IX_Transactions] AS [T]), WHERE:([test2].[dbo].[Transactions].[Date] as [T].[Date]>='2010-01-01 00:00:00.000' AND [test2].[dbo].[Transactions].[Date] as [T].[Date]<'2010-02-01 00:00:00.000')) Execution plan of the modified query |--Nested Loops(Inner Join, OUTER REFERENCES:([A].[Account_Id], [Expr1005]) WITH UNORDERED PREFETCH) |--Clustered Index Scan(OBJECT:([test2].[dbo].[Accounts].[PK__Accounts__7C8480AE] AS [A])) |--Clustered Index Seek(OBJECT:([test2].[dbo].[Transactions].[IX_Transactions] AS [T]), SEEK:([T].[Account_Id]=[test2].[dbo].[Accounts].[Account_Id] as [A].[Account_Id] AND [T].[Date] >= '2010-02-01 00:00:00.000' AND [T].[Date] < '2010-03-01 00:00:00.000') ORDERED FORWARD) At the same time, query performance has increased more than fivefold |
 |
|
GilaMonster
Master Smack Fu Yak Hacker
4507 Posts |
Posted - 2010-11-14 : 09:14:58
|
RunSET STATISTICS IO ONSET STATISTICS TIME ONThen run your original query and the one where you added the accounts table and post the contents of the messages pane for both.Do you know what FORCEPLAN does? Don't use hints if you don't know their complete effect and downsides. Forceplan is a very dangerous setting to use.--Gail ShawSQL Server MVP |
 |
|
namman
Constraint Violating Yak Guru
285 Posts |
Posted - 2010-11-14 : 14:44:59
|
Hi GilaMonster,I come back to that topic later ..... may be in another thread.Hi sqllammer,quote: Necessary to optimize the query, without changing the structure of the tables without changing existing or adding new indexes.Query:SELECT T. Account_Id, T. Date, T. ValueFROM Transactions TWHERE T. Date> = '20100101 'AND T. Date <'20100201'Structure:-- Table of Accounts - 1000 recordsCREATE TABLE Accounts( Account_Id int PRIMARY KEY, Name varchar (255) NOT NULL)- Table of Transactions - 10,000,000 recordsCREATE TABLE Transactions( Transaction_Id int PRIMARY KEY NONCLUSTERED, Account_Id int REFERENCES Accounts (Account_Id) NOT NULL, Date datetime NOT NULL, Value numeric (18, 2) NOT NULL,)CREATE CLUSTERED INDEX [IX_Transactions] ON Transactions (Account_Id, Date)
I don't think we can optimize the query with the red condition. I see some developers try to play around to optimize similar queries, you are not alone. Let's see your work-around query:quote: SELECT T. Account_Id, T. Date, T. ValueFROM Accounts A inner join Transactions T on T. Account_Id = A. Account_Id and T. Date> = '20100201 'AND T. Date <20100301'CREATE CLUSTERED INDEX [IX_Transactions] ON Transactions (Account_Id, Date)
I think your main point is to get Account_Id involve in the where (or on) clause so that the index seek will be used (even the table Account may not be necessary).If so, it still does not work. Indexed column involving in the condition is the first req to use index seek, but it is not only one. Your query is similar to this one:quote: SELECT T. Account_Id, T. Date, T. ValueFROM Transactions T on T. Account_Id = T. Account_Id and T. Date> = '20100201 'AND T. Date <20100301'
It uses scan because I do not make any filter on Account_Id (first condition accepts all Account_Id).Optimizer bases on many factors to decide using seek or scan. One of most important factor is the data. In general, if value of indexed column is less than 20% of all in the table, it uses seek, otherwise it uses scan.So is there any case in which the index seek is used as you expected with your work-around query? Yes, but it is bad db design or even wrong. That is the Account table has much less rows than ones on Transaction table. If Account table has 2 rows and Transaction table has 1000 rows, the index seek may be used in your query. However, that is not a good design. |
 |
|
GilaMonster
Master Smack Fu Yak Hacker
4507 Posts |
Posted - 2010-11-15 : 00:18:37
|
quote: Originally posted by namman In general, if value of indexed column is less than 20% of all in the table, it uses seek, otherwise it uses scan.
Between 0.1% and 1%, and only if it's a non-covering index.--Gail ShawSQL Server MVP |
 |
|
sqllamer
Starting Member
9 Posts |
Posted - 2010-11-15 : 03:45:08
|
quote: Originally posted by GilaMonster RunSET STATISTICS IO ONSET STATISTICS TIME ONThen run your original query and the one where you added the accounts table and post the contents of the messages pane for both.
The result of the basic query with instructions SET STATISTICS IO ON and SET STATISTICS TIME ON (first run):SQL Server parse and compile time: CPU time = 0 ms, elapsed time = 108 ms.SQL Server Execution Times: CPU time = 0 ms, elapsed time = 0 ms.SQL Server Execution Times: CPU time = 0 ms, elapsed time = 0 ms.SQL Server parse and compile time: CPU time = 0 ms, elapsed time = 0 ms.(45346 row(s) affected)Table 'Transactions'. Scan count 3, logical reads 71444, physical reads 4, read-ahead reads 64636, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.SQL Server Execution Times: CPU time = 2140 ms, elapsed time = 201597 ms. One of the subsequent launches (the characteristic time of 2 seconds, the actual variations from 2 to 60 seconds):SQL Server parse and compile time: CPU time = 0 ms, elapsed time = 0 ms.SQL Server Execution Times: CPU time = 0 ms, elapsed time = 0 ms.SQL Server Execution Times: CPU time = 0 ms, elapsed time = 0 ms.SQL Server parse and compile time: CPU time = 0 ms, elapsed time = 0 ms.(45346 row(s) affected)Table 'Transactions'. Scan count 3, logical reads 71444, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.SQL Server Execution Times: CPU time = 2202 ms, elapsed time = 2477 ms. The result of the modified query with instructions SET STATISTICS IO ON and SET STATISTICS TIME ON (first run):SQL Server parse and compile time: CPU time = 0 ms, elapsed time = 224 ms.SQL Server Execution Times: CPU time = 0 ms, elapsed time = 0 ms.SQL Server Execution Times: CPU time = 0 ms, elapsed time = 0 ms.(45443 row(s) affected)Table 'Transactions'. Scan count 1000, logical reads 5313, physical reads 70, read-ahead reads 1973, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.Table 'Accounts'. Scan count 1, logical reads 7, physical reads 1, read-ahead reads 14, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.SQL Server Execution Times: CPU time = 62 ms, elapsed time = 10616 ms. One of the subsequent launches (the characteristic time of 1 seconds):SQL Server parse and compile time: CPU time = 0 ms, elapsed time = 0 ms.SQL Server Execution Times: CPU time = 0 ms, elapsed time = 0 ms.SQL Server Execution Times: CPU time = 0 ms, elapsed time = 0 ms.(45443 row(s) affected)Table 'Transactions'. Scan count 1000, logical reads 3512, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.Table 'Accounts'. Scan count 1, logical reads 7, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.SQL Server Execution Times: CPU time = 94 ms, elapsed time = 1647 ms. quote: Originally posted by GilaMonster Do you know what FORCEPLAN does? Don't use hints if you don't know their complete effect and downsides. Forceplan is a very dangerous setting to use.
I do not pretend to have a deep understanding of instructions SET FORCEPLAN, but I think that in this case it is quite appropriate.MSDN: quote: When FORCEPLAN is set to ON, the Microsoft SQL Server query optimizer processes a join in the same order as the tables appear in the FROM clause of a query. In addition, setting FORCEPLAN to ON forces the use of a nested loop join unless other types of joins are required to construct a plan for the query, or they are requested with join hints or query hints.
But I still prefer the first option:SELECT A.Account_Id, ... |
 |
|
Kristen
Test
22859 Posts |
Posted - 2010-11-15 : 04:47:29
|
Note that you will see a difference on repeat runs because the Query Plan and the Data are cached.Just in case you are not familiar with the output from STATISTICS the key things to look at are "Scan count" and "logical reads" - these are the same for all runs of the same query (unless indexes or statistics or number of data rows or somesuch changes)"physical reads" and "read-ahead reads" etc. show what physically happened, and in your first example there is a figure present, and these are zero in the second example - so the data was already in cache.When I use STATISTICS I experiment (e.g. by creating different indexes, or changing the query) to see if I can reduce "Scan count" and/or "logical reads" as they are a good indicator whether an experiment was more, or less, efficient.So basically that boils down to your first query being:Table 'Transactions'. Scan count 3, logical reads 71444and your second beingTable 'Transactions'. Scan count 1000, logical reads 3512Table 'Accounts'. Scan count 1, logical reads 7Strikes me as an awful lot of SCANS on [Transactions] table.Why is [Accounts] not appearing in the STATISTICS for the first query?Has SQL optimised it out of the query altogether?(Maybe I have misunderstood from the thread exactly which queries you are now using for "First" and "Second" tests)Maybe it indicates as per my earlier suggestion that (assuming Foreign Key is in place) that [Accounts] table is not required in this query at all. |
 |
|
sqllamer
Starting Member
9 Posts |
Posted - 2010-11-15 : 05:56:16
|
quote: Originally posted by namman I think your main point is to get Account_Id involve in the where (or on) clause so that the index seek will be used (even the table Account may not be necessary).
In general, you're right. More precisely, I was like to add a field Account_Id into the search conditions, to enable searching the clustered index.quote: Originally posted by namman If so, it still does not work. Indexed column involving in the condition is the first req to use index seek, but it is not only one.
One option that you gave (and which I quoted earlier) does not really work. But note, I also gave a pair of modified versions of this query, which forced the optimizer to use search:SELECT A. Account_Id, T.Date, T.ValueFROM Accounts A inner join Transactions T on T.Account_Id = A.Account_Id and T.Date> = '20100201 'AND T.Date < '20100301' In this case, using nested loops, so that the condition T. Account_Id = A. Account_Id actually is a filter for the table Transactions, choosing at each iteration of scan table Accounts records with a specific Account_Id. The idea is that the 1000 records scan with the seek ordered range of dates (from this certainly we will not go away) more profitable than scan 10000000 recordsquote: Originally posted by namman That is the Account table has much less rows than ones on Transaction table. If Account table has 2 rows and Transaction table has 1000 rows, the index seek may be used in your query. However, that is not a good design.
Remember the initial conditions: in the table Accounts 1000 records in the table Transactions 10 million records and it is growing. This is a physical essence, why do you think, this is not a good design? |
 |
|
sqllamer
Starting Member
9 Posts |
Posted - 2010-11-15 : 07:35:46
|
2 Kristen:Query for the "first" test:SELECT T.Account_Id, T.Date, T. ValueFROM Transactions TWHERE T.Date> = '20100101' AND T.Date < '20100201' Query for the "second" test:SELECT A.Account_Id, T.Date, T.ValueFROM Accounts A inner join Transactions T on T.Account_Id = A.Account_Id and T.Date> = '20100201' AND T.Date < '20100301' Accordingly, the Accounts table is not included in the statistics of the first query, because this table does not use. In this sense, all right.Number of scans in the table Transactions corresponds to the number of records in the table Accounts because using nested loops.Foreign key between Transactions(Account_Id) and Accounts(Account_Id) available. Accounts table is really not necessary to obtain the required results. As I said earlier, it is used in a query only for setting up optimizer for a more optimal execution plan (in my opinion). Thus it does the work-around query. |
 |
|
Next Page
|
|
|
|
|