Author |
Topic |
Kristen
Test
22859 Posts |
Posted - 2010-11-24 : 18:10:00
|
I can't put my finger on what is the cause of this.We are getting predictable / repeatable bottlenecks. When server gets to a particular level of activity application slows to a crawl. As soon as activity drops below the threshold the application immediately picks up.Server is running at about 50% CPU and all the disk activity and Queue Lengths look OK - i.e. the server does not look maxed out.I have taken a snapshot of sp_who2 using TC's script http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=141197#553872235 processesstatus -------------------- ----------- BACKGROUND 14RUNNABLE 17sleeping 122SUSPENDED 82Blockers:blkBy Count ---------- ----------- 64 876 1586 15 . 156348 1537 1606 28627 11Seems like a lot of processes being blocked ...(Not entirely sure that the data is "sound" as the output from sp_who2 seems to have duplicates in the SPID column, and I'm not sure that should be the case?)I'd appreciate suggestions on where to put my efforts into analysing the bottleneck. |
|
X002548
Not Just a Number
15586 Posts |
|
X002548
Not Just a Number
15586 Posts |
|
Kristen
Test
22859 Posts |
Posted - 2010-11-25 : 01:33:01
|
"And you did you DBCC Checks I'm sure"CHECKDB? Yes, run regularly"And you ran a trace...what did that say?"OK, a Trace of what?"And it's all sproc only access correct?"Yup. No users have access. Web application. Everything is through Sprocs, no dynamic queries from users etc."Did you ever install manual sproc execution logging?"Dunno what that is, sorry"Are you using WebSphere?"Nope. Dunno what that is either. Classic ASP application. Been using it for years. New hardware though - upgraded to SQL 2008 a few months ago, this is the first "busy period" we have had with it"I say my guess is the middleware?"We don't have any ... just application and database - unless you mean the database transport layer?. We are still using the original driver, can't get SQLCLI10 native driver to work in the connect string, a number of calls just fail - no error message, nothing useful. |
 |
|
AndrewMurphy
Master Smack Fu Yak Hacker
2916 Posts |
Posted - 2010-11-25 : 11:56:08
|
"Did you ever install manual sproc execution logging?"I'm guessing at what's meant here....but insidemyproc log myproc start run myproc code log myproc endwould seem to fit the rough description. from this you should be able to see average run time at key time periods. a home-made profiler per se. |
 |
|
nigelrivett
Master Smack Fu Yak Hacker
3385 Posts |
Posted - 2010-11-25 : 12:07:03
|
Duplicates in the spid column are multiple threads - parallel processing. The main thread will wait for the completion before continuing.If you have processes being blocked then just follow the chain to find out what is blocking.You can use dbcc inputbuffer to find out what is being called.You can also usehttp://www.nigelrivett.net/SQLAdmin/sp_nrSpidByStatus.htmlto give a snapshot of what is running and the commands executed.It has something which is too small for v2008 - you will get a rtucation error but it's easy to find an increase the size.Blocking is easy.For performance look for things with rapidly increasing disk io.If you have long running SPs - i.e. more than a couple of minutes then it's probably easy to find the issue. It'll probably be one or two that need fixing.You could also try using the data collector which will give the top SPs by resource usage.==========================================Cursors are useful if you don't know sql.SSIS can be used in a similar way.Beer is not cold and it isn't fizzy. |
 |
|
Kristen
Test
22859 Posts |
Posted - 2010-11-25 : 12:56:27
|
quote: Originally posted by AndrewMurphy "Did you ever install manual sproc execution logging?"I'm guessing at what's meant here....but insidemyproc log myproc start run myproc code log myproc end
Ah, OK. Yes we do have that, and yes I can see the stats.What I think I am seeing is contention for something, and then the queue outside the door rapidly grows longer, impatient people press REFRESH for web pages which are already half way through a query - which exacerbates the problem (we do have a client-disconnected test, but the current Sproc has to finish before disconnection is detected and the rest of the page-building is then aborted)"Duplicates in the spid column are multiple threads - parallel processing. The main thread will wait for the completion before continuing."That explains that, thanks. Didn't think of parallel processing."If you have processes being blocked then just follow the chain to find out what is blocking."Yup, done that. Lets say we have 100 Sprocs all wanting to insert a row in TableA - they are all blocked by one-of-their-own doing the same thing. The next time I look at the report its a different set of 100 Sprocs all wanting to insert into TableB ...We ahve a count of Deadlocks - we see one or two a day at most. (We are using READ_COMMITED_SNAPSHOT on the database)"You can use dbcc inputbuffer to find out what is being called."The script from T.C. does that bit, which is very helpful. I'll take a look at yours too, thanks Nigel nigelrivett ."Blocking is easy.For performance look for things with rapidly increasing disk io."What sort of things? We see TCP connections increase to over 500 and then immediately the site is slow. The two go together like hand-and-glove. Probably TCP connections are increasing as more people try to use the service, but also as the server takes longer to process anything more people are in the queue outside the door (and some are pressing Refresh of course).How do I review Disk I/O to tie it to a particular culprit?We have an option to display Server Busy page. Our Server Busy page only denies new sessions, not existing ones, so we can let people IN the store finish up and leave. However, Server Busy is currently only trigger-able by CPU, Transactions or Connections - and there is no sensible setting for that which we can see on this server. The server is not maxed out. So really I could do with a better metric (ideally the render-time for our most common dynamic web page, but just calculating that is going to add some strain )"If you have long running SPs - i.e. more than a couple of minutes then it's probably easy to find the issue. It'll probably be one or two that need fixing."Sadly we don't have any low-hanging fruit like that. We have Sprocs that ordinarily perform in 100ms or less, its when the site gets busy that we get into trouble. At that point everything is responding slowly. (Of course there may well be queries that can be optimised that I can find, but we've done lots of work in that area over the years, so there are no easy ones left to find - baring any recent cock-ups of course!)Slightly strange (and possibly relevant) observation:We have 4 web sites connecting to the same database. They are basically just different branding of the same products. Some brands have more / less products, but there is significant overlap.One store, which has only 30% of the orders of the biggest store, consistently runs 50% faster.Imagine an Sproc which says "Get products for Catalogue-A". Obviously asking for "Catalogue-B" is a different query, but as it is parametrised I presume its using the same cached query plan. These two brands have different numbers of products in their catalogues, so that could be the cause.But!For an Sproc that does something like "Get the content for Page-X" this will be identical for the two brands (different data, but pulled by PK and same number of matching rows). Yet still one brand's sproc execution is running 50% faster than the other.Its almost as thought SQL can get the data fast enough but is then throttled - as though IIS cannot take the data quickly enough, so SQL has to sit-and-wait a bit.The IIS box is running all the sites, and its barely touching 10% CPU (its a 16CPU box, and plenty large for its job, but its spare capacity is part of the DR plan)We've looked to see if we can find any sort of throttling in IIS, and the data transport layer, and not found anything significant. We did connect an additional Server + Website to the same database to see if a second machine+website would reduce the bottleneck (the plan being to use virtual web-servers and load-balance if that was the case) but we didn't see an improvement (which is contrary, in my mind, to the fact that some of our brands run faster than others on what is, essentially, the same concurrent queries)"You could also try using the data collector which will give the top SPs by resource usage."Is that going to be OK to install and activate on a busy SQL box? (I've not used it before) |
 |
|
tkizer
Almighty SQL Goddess
38200 Posts |
Posted - 2010-11-26 : 00:31:31
|
What do the wait stats show?:--http://www.mssqltips.com/tip.asp?tip=1949WITH Waits AS ( SELECT wait_type, wait_time_ms / 1000. AS wait_time_s, 100. * wait_time_ms / SUM(wait_time_ms) OVER() AS pct, ROW_NUMBER() OVER(ORDER BY wait_time_ms DESC) AS rn FROM sys.dm_os_wait_stats WHERE wait_type NOT IN ('CLR_SEMAPHORE', 'LAZYWRITER_SLEEP', 'RESOURCE_QUEUE', 'SLEEP_TASK', 'SLEEP_SYSTEMTASK', 'SQLTRACE_BUFFER_FLUSH', 'WAITFOR', 'CLR_AUTO_EVENT', 'CLR_MANUAL_EVENT') ) -- filter out additional irrelevant waits SELECT W1.wait_type, CAST(W1.wait_time_s AS DECIMAL(12, 2)) AS wait_time_s, CAST(W1.pct AS DECIMAL(12, 2)) AS pct, CAST(SUM(W2.pct) AS DECIMAL(12, 2)) AS running_pct FROM Waits AS W1 INNER JOIN Waits AS W2 ON W2.rn <= W1.rn GROUP BY W1.rn, W1.wait_type, W1.wait_time_s, W1.pct HAVING SUM(W2.pct) - W1.pct < 95; -- percentage threshold; Tara KizerMicrosoft MVP for Windows Server System - SQL Serverhttp://weblogs.sqlteam.com/tarad/Subscribe to my blog |
 |
|
spirit1
Cybernetic Yak Master
11752 Posts |
Posted - 2010-11-26 : 03:44:14
|
run Adam's sp_whoIsActive and see what that gives you.http://sqlblog.com/files/folders/release/entry29675.aspxa few possibilites to look for:you catalog A and catalog B queries probably have different cardinalities the exec plan for one is not goo for the other.if CatA has good seeks, maybe for CatB a scan is better because CatB has more random IO.another thing that might be the problem is sorts, spools, etc in tempdb.poll sys.dm_io_virtual_file_stats every 5 minutes and then check the delta values for columns io_stall_<everything>___________________________________________________________________________Causing trouble since 1980Blog: http://weblogs.sqlteam.com/mladenpSpeed up SSMS development: www.ssmstoolspack.com <- version 1.9 out!SQL Server MVP |
 |
|
elwoos
Master Smack Fu Yak Hacker
2052 Posts |
Posted - 2010-11-26 : 04:49:18
|
You say you've recently gone to SQL 2008, has anything else changed?-----------I used to think I wasn't a morning person but it never got better in the afternoon |
 |
|
Kristen
Test
22859 Posts |
Posted - 2010-11-26 : 07:39:16
|
Sorry ... been putting out fires ..."What do the wait stats show?:"Server Time = 2010-11-26 12:09:39.147 (in case we want to run one later and compare)wait_type wait_time_s pct running_pct ------------------------------ -------------- -------------- -------------- LCK_M_X 3416012.94 70.02 70.02LATCH_EX 215333.98 4.41 74.43REQUEST_FOR_DEADLOCK_SEARCH 139669.75 2.86 77.29XE_TIMER_EVENT 139653.52 2.86 80.16FT_IFTS_SCHEDULER_IDLE_WAIT 138488.93 2.84 83.00CHECKPOINT_QUEUE 137820.17 2.82 85.82LOGMGR_QUEUE 130338.31 2.67 88.49XE_DISPATCHER_WAIT 100262.57 2.06 90.55CXPACKET 82422.65 1.69 92.24BROKER_TO_FLUSH 69828.45 1.43 93.67SOS_SCHEDULER_YIELD 63183.67 1.30 94.96WRITELOG 61643.69 1.26 96.23(12 row(s) affected) "run Adam's sp_whoIsActive and see what that gives you."Thanks. At the moment nothing significant. System will get busy later on, and I'll recheck."you catalog A and catalog B queries probably have different cardinalities the exec plan for one is not goo for the other.if CatA has good seeks, maybe for CatB a scan is better because CatB has more random IO."Yes, this is true. But there are other queries where this is not the case - different data for each brand, but my opinion is that all the data is same-effort to reach, and thus should be same-time for both brands.SELECT Col1, Col2, ...FROM TableA JOIN TableB ON B.ID = A.ID AND B.Brand = @MyBrandWHERE A.ID = @MyID Identical number of "child" records in TableB for Brand A and Brand B. Data is almost static. TableA = 1,646 rows TableB=4,277 rows (there are more tables involved, but they of that sort of size). last time I looked Brand_A was doing this in half the time of Brand-B when the server was over the tipping-point."You say you've recently gone to SQL 2008, has anything else changed?"Not that recently. 6 months say. Application has had bits added month-by-month, very few changes in the Core systems. Customers are way up. Amount of rubbish traffic from what the Client perceives as being "useful affiliates" is way up - which results in lots of rubbish page views - one-page, or only-a-couple of page views, per session. Client has built far more complex HTML pages than normal - so the amount of HTML we shuffle around (in our CMS system) is certainly twice what it normally is - but even so the pages are not huge, but it does increase the seize of CMS RecordSets and the memory-shuffling in the IIS box.I cannot fathom what the traffic is. We are getting web page render times of 30 seconds or so at peak times, but still get 200 orders an hour or more (that's completed orders, not attempts!) ... Customers have more patience than me.We have a server busy page. We turned that on and throttled the Customers (not literally!). It prevents new sessions, but allows existing sessions. We let SQL's User Count fall to around 150 and released the server busy page. Within 60 - 90 seconds it has grown to 650 Users. I just don't believe these are all real users ... but there is no evidence from WebLog Analytics that there is any adverse traffic, or a snow-storm of Spiders and BOTsWe don't have a sensible way of automatically activating the Server Busy page - we are working on that today - we want to trigger it when page (best) render time gets to 10 seconds (say). Previously we have used CPU (@@cpu_busy), Users (SQLServer:General Statistics / User Connections) and Transactions/second (SQLServer:Databases / Transactions/sec) but it is not a metric that is working this time around as the server is not under 100% load, and teh numbers skyrocket so quickly when the Server Busy page turns off that we don't catch it quickly enough.We are also considering changing the Server Busy page to let through anyone who has a persistent cookie (indicating they have been before). That will keep out all the crud traffic.Client right-hand-and-left-hand tends to make things worse. Their Marketing people sent out an Email-shot at peak time last night on all 4 store-brands simultaneously with no staged-release. So I guess several hundred thousand, could easily have been half a million even , Customers all got the email at around 7PM |
 |
|
spirit1
Cybernetic Yak Master
11752 Posts |
Posted - 2010-11-26 : 07:59:10
|
since your largest wait type is LCK_M_Xyou might have a problem with fully qualified sproc names begin looked up:http://support.microsoft.com/kb/263889do your sprocs recompile every time because of temp table use? try table variables.how many CPU's and tempdb files do you have?___________________________________________________________________________Causing trouble since 1980Blog: http://weblogs.sqlteam.com/mladenpSpeed up SSMS development: www.ssmstoolspack.com <- version 1.9 out!SQL Server MVP |
 |
|
Kristen
Test
22859 Posts |
Posted - 2010-11-26 : 08:34:02
|
I suspect that our logging is causing some / part / all the bottle neck. We are inserting 13,000,000 rows a day into that table (and each one is updated with Elapsed Time and probably some sort of result data that makes the row wider)"do your sprocs recompile every time because of temp table use? try table variables."There is a temp table involved. If it referenced by a child Sproc, so we did not use @TableVar in SQL2000"how many CPU's and tempdb files do you have?"4 sockets, 16 cores. 16 x TEMPDB files |
 |
|
Kristen
Test
22859 Posts |
Posted - 2010-11-26 : 08:38:08
|
"you might have a problem with fully qualified sproc names begin looked up:"Everything is qualified to FROM / JOIN dbo.MyTable and EXEC dbo.MySproc level (possibly some bits of code that have missed this, but that's how the code is intended to be written and I would doubt that we have more than 0.1% omissions, probably less than that).Nothing is qualified to the database level (unless it is referencing something in another database, and the only code that does that is the batch-driven Data Integration stuff which we have been turning off during peak periods) |
 |
|
nigelrivett
Master Smack Fu Yak Hacker
3385 Posts |
Posted - 2010-11-26 : 08:38:22
|
>> There is a temp table involved. If it referenced by a child SprocThat will cause recompiles as it needs to resolve the temp table id with every run.==========================================Cursors are useful if you don't know sql.SSIS can be used in a similar way.Beer is not cold and it isn't fizzy. |
 |
|
spirit1
Cybernetic Yak Master
11752 Posts |
Posted - 2010-11-26 : 08:40:50
|
why 16 tempdb files? for your situation 8 or 4 would be better, no?or did you get this value out of testing?also have you though about doing your logging code in async with service broker code? IMO it would make things better.you could also have a system to switch to service broker receiving your orders when you get past some value.EDIT:database level referencing is not needed.___________________________________________________________________________Causing trouble since 1980Blog: http://weblogs.sqlteam.com/mladenpSpeed up SSMS development: www.ssmstoolspack.com <- version 1.9 out!SQL Server MVP |
 |
|
Kristen
Test
22859 Posts |
Posted - 2010-11-26 : 09:26:55
|
">> There is a temp table involved. If it referenced by a child SprocThat will cause recompiles as it needs to resolve the temp table id with every run."OK, that explains that, thanks. The Child Sproc is used by other parents. Parent gets data to #TEMP, Child retrieves the appropriate columns of the RecordSet to send to APP (APP has a number of different Sprocs to call, but all should return the same recordset / handle paging / handle sorting)Perhaps we should duplicate the child-sproc SELECT statement in the Parent Sprocs instead (and just treat it as a code management issue, rather than reducing coding work by trying to share the Child Sproc?)I have done Profiler for SP:Starting, SP:Recompile, SP:Completed and they are coming along in threes ..."why 16 tempdb files? for your situation 8 or 4 would be better, no?or did you get this value out of testing?"Tara told me to (and I didn't argue!)http://weblogs.sqlteam.com/tarad/archive/2008/05/20/60604.aspxWe were running just one before that (up until a few days ago). It was only 100MB or so - so I don't suppose it was being heavily used (but surprising as we have READ_COMMITTED_SNAPSHOT on this DB)"also have you though about doing your logging code in async with service broker code? IMO it would make things better."Yeah, that's a good idea. Not really an instant fix though, is it? (Every Sproc has EXEC usp_LogMe at the top of the code ...) but maybe there is a way to change usp_LogMe to pass the data to Service Broker isntead?Currently we doEXEC @MyLogID = dbo.usp_LogMe @MySessionID, @MySprocID, @strParameterListand then at the bottom we do:EXEC dbo.usp_UpdateMyLog @MyLogID, @intReturnResult, @strAnyOtherStuffToLogI suspect this UPDATE is not helpful (widening the row, causing contention). I have been wondering whether we should just INSERT into another table (instead of UPDATEing original row) and then on the odd occasions where we need to report on it just JOIN the two tables.There may be some code that does multiple-updates, and relies on the intermediate state, but it won't be much.Can Service Broker give me a Transaction ID back - so that I have a @MyLogID value for the Exit Sproc record?"you could also have a system to switch to service broker receiving your orders when you get past some value."I don't think we have a problem with order volume. That would be a nice-to-have problem . I doubt our Client's Customers would approach 400 - 500 orders-per-hour maximum |
 |
|
Michael Valentine Jones
Yak DBA Kernel (pronounced Colonel)
7020 Posts |
Posted - 2010-11-26 : 09:52:58
|
If you have "Auto Update Statistics" set to True, do you also have "Auto Update Statistics Asynchronously" set to True?That will keep queries from waiting until the statistics are updated.CODO ERGO SUM |
 |
|
Kristen
Test
22859 Posts |
Posted - 2010-11-26 : 10:02:57
|
I have copied all the code from the Child Procs into the Parent proc.I see recompile when some path through the code has not been exercised before/yet, but after that I'm not seeing more recompiles But that's on a STAGING database with no contention (although it is on the same machine, so subject to same CPU/Disk throughput)So far so good ..."If you have "Auto Update Statistics" set to True, do you also have "Auto Update Statistics Asynchronously" set to True?That will keep queries from waiting until the statistics are updated."Auto Update Statistics" was True"Auto Update Statistics Asynchronously" was False. Brilliant, thanks for that. I expect adding 13M rows to Logging Table a day means it updates stats a fair bit on that table? |
 |
|
spirit1
Cybernetic Yak Master
11752 Posts |
Posted - 2010-11-26 : 10:23:12
|
@Kristenwouldn't it be easier to just replace the temp table with table variable? this way you don't need to change parent sprocs.you can get the current transaction from sys.dm_tran_current_transaction.logging sprocs can be easily changed to support service borker. you dont' have to change other sprocs.if you need any help with that let me know.one tempdb per core is a myth: http://www.sqlskills.com/BLOGS/PAUL/post/A-SQL-Server-DBA-myth-a-day-(1230)-tempdb-should-always-have-one-data-file-per-processor-core.aspxIt's more like: number of tempdb files = #CPUs/4 or #CPUs/2@MVJgood catch about stats!___________________________________________________________________________Causing trouble since 1980Blog: http://weblogs.sqlteam.com/mladenpSpeed up SSMS development: www.ssmstoolspack.com <- version 1.9 out!SQL Server MVP |
 |
|
elwoos
Master Smack Fu Yak Hacker
2052 Posts |
Posted - 2010-11-26 : 10:52:30
|
For the longer term you might want to look at some sort of caching on the web side (assuming you don't do it already) e.g. Squid or Nginx.-----------I used to think I wasn't a morning person but it never got better in the afternoon |
 |
|
Previous Page&nsp;
Next Page
|