Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 SQL Server 2000 Forums
 SQL Server Administration (2000)
 sp4 - indexdefrag blocking and HUGE perf problems

Author  Topic 

ThreePea
Yak Posting Veteran

83 Posts

Posted - 2005-11-28 : 08:54:30
We installed sp4 two weeks ago and have been through a living hell ever since. I just wanted to quickly outline our problems and see if anyone has encountered anything similar and has some tips.

1) INDEXDEFRAG. We have run this for a couple hours every night for years with no problems, but it is now causing major blocking. We are going to have to disable it, but our past history tells us that if we go too long without proper index maintenance performance degrades eventually to unacceptable levels. But I don't know what else to do, as our customers are pissed that they cannot do their normal jobs. (BTW, we run 24/7, and we run indexdefrag during our least load, but it doesn't matter). I saw an old post on dbforums.com that this was a bug in sp4, but I haven't seen anything on it since, and there is nothing on Microsoft support.

2) CPU time has doubled. The amount of time SQL processes spend using CPU doubled the day after sp4 was put in place. We think this is because of...

3) Compile lock wait time has increased 100 fold. Previously the accumulative lockwait duration for compile locks was about 10 seconds for a 15-minute timeblock. It is now at 12-15 minutes per 15 minute timeblock. However, the number of compiles and recompiles has not increased, which would lead us to believe that the amount of time held by those same compile locks has increased 100-fold, and thereby increasing CPU.

4) A large number of our sprocs are now using execution plans that have increased reads and duration exponentially from pre-sp4. We haven't figured out a pattern yet, but we have been addressing each of these sprocs one by one with our normal performance-improvement methodology. We have had moderate success beating these sprocs back into submission.

If our problems were only #4, we could deal with it. But #'s 1-3 not only have us stumped, but have us in the doghouse with our customers. We have no recourse for going back to sp3 - we are worldwide 24/7, and therefore get only 2 hours of maintenance downtime per month; not enough to uninstall, reinstall, and patch back up to sp3.

We opened a case with Microsoft PSS a week and a half ago. It has been painfully slow and has yielded no help whatsoever so far. When I initiated the call I told the screener we had significant business interruption as a result of our downtime. Often during the worse times our customers cannot ship their product due to the performance problems. This is no small thing - if a truck shows up from the Big Three (Ford, GM, Chrysler) and they can't ship, they start getting fined thousands of dollars for every minute those trucks sit in the dock. Think that makes for happy customers?

Anyway, I was told I could wait an undetermined number of hours for a "Level A" technician, or I could get a "Level B" technician right away. I went with the Level B as I figured getting started was better than waiting. Now I think I made a mistake. He's trying, but all I've done so far is run numerous pss diags (which kills our system and causes more timeouts) and send him gigabytes of files. Argh! Can you tell I'm frustrated? Sorry, don't mean to use the forum as a place to unload, just hoping someone has had similar problems and success in fixing them, especially the indexdefrag issue.

Thanks.

Jerry Foster
Plexus Systems
www.plex.com

AndrewMurphy
Master Smack Fu Yak Hacker

2916 Posts

Posted - 2005-11-28 : 09:15:11
Get paulrandal (a member here from microsoft) involved...he wrote some of the DBCC commands.
Go to Top of Page

MichaelP
Jedi Yak

2489 Posts

Posted - 2005-11-28 : 09:52:08
Well, first let me say that it's pretty awesome that you have the above metrics. That's the best "it was working fast and now sucks" post I've ever seen.

That being said, could you not just rollback to SP3a? I've never had to rollback a SP on SQL Server, but there must be a way. Worst case, detace your databases, uninstall SQL Server, reinstall, install sp3a, re-attach databases. I think Master isn't going to be happy with this, so be sure to read the docs on "restoring master" before you attempt anything like this. Depending on how many databases and database users you have this might not be feasible, esp. at 24x7x365.

This is just a guess / thought. Please research this idea.
You milage may vary, etc, etc.

Michael

<Yoda>Use the Search page you must. Find the answer you will. Cursors, path to the Dark Side they are. Avoid them, you must. Use Order By NewID() to get a random record you will.</Yoda>
Go to Top of Page

MichaelP
Jedi Yak

2489 Posts

Posted - 2005-11-28 : 09:53:23
Be sure your case stays elevated. When it's at the highest level, lots of people get lots of e-mails about your case, and that tends to get people on the stick.

Michael

<Yoda>Use the Search page you must. Find the answer you will. Cursors, path to the Dark Side they are. Avoid them, you must. Use Order By NewID() to get a random record you will.</Yoda>
Go to Top of Page

eyechart
Master Smack Fu Yak Hacker

3575 Posts

Posted - 2005-11-28 : 11:28:52
escalate the call to a crit sit immediately. That is the only way you are going to get good service out of microsoft support these days. Do you have a TAM? get them involved too.

Also, there have been a few threads here about poor performance after applying SP4. I'll see if I can find them and post the links.


EDIT:
check this thread out:
http://www.sqlteam.com/Forums/topic.asp?TOPIC_ID=52230



-ec
Go to Top of Page

paulrandal
Yak with Vast SQL Skills

899 Posts

Posted - 2005-11-28 : 12:25:44
The INDEXDEFRAG issue is a known problem and there's a QFE for it. See http://support.microsoft.com/kb/907250 for details and the hotfix. Can't believe the person didn't give this to you already.

What's your PSS case number? (Begins 'SR'...) And the name/email of the person you're dealing with? I shall have a word with the Director of SQL PSS.

Thanks

Paul Randal
Lead Program Manager, Microsoft SQL Server Storage Engine
(Legalese: This posting is provided "AS IS" with no warranties, and confers no rights.)
Go to Top of Page

paulrandal
Yak with Vast SQL Skills

899 Posts

Posted - 2005-11-28 : 12:26:23
BTW - please let me apologize for the problems you've had and the poor service you've received.


Paul Randal
Lead Program Manager, Microsoft SQL Server Storage Engine
(Legalese: This posting is provided "AS IS" with no warranties, and confers no rights.)
Go to Top of Page

eyechart
Master Smack Fu Yak Hacker

3575 Posts

Posted - 2005-11-28 : 12:56:49
quote:
Originally posted by paulrandal

The INDEXDEFRAG issue is a known problem and there's a QFE for it. See http://support.microsoft.com/kb/907250 for details and the hotfix. Can't believe the person didn't give this to you already.

What's your PSS case number? (Begins 'SR'...) And the name/email of the person you're dealing with? I shall have a word with the Director of SQL PSS.

Thanks

Paul Randal
Lead Program Manager, Microsoft SQL Server Storage Engine
(Legalese: This posting is provided "AS IS" with no warranties, and confers no rights.)



Paul,

Are there any other post SP4 hotfixes we should be aware of? There is the AWE memory fix and now this one. Is that all that we should apply?



-ec

Go to Top of Page

ThreePea
Yak Posting Veteran

83 Posts

Posted - 2005-11-28 : 12:59:02
Paul, bless you.

I searched and searched MS support for a hotfix as I figured it should be up there. But I must be daft, because the terms I searched on came up with nothing. I would think "sp4 indexdefrag" would flag it, but it did not. In any case, thank you. We will proceed with the hotfix if we can acquire the downtime, though our SLA's might prevent that until our next scheduled maintenance window. In that case we'll see if we have any luck with the trace flag workaround described in the article.

That still leaves the CPU and compile lock issues, but I'll keep at it with the PSS rep; hopefully he'll be able to escalate the issue to someone who can get to the bottom of it, if he cannot.

The support case is: SRX051118602523
The PSS rep's name is Avik, his email is avikroy@microsoft.com. He's a nice guy. I'm just not sure he is grasping the gravity of the situation for us.

Thanks again, Paul, have a great day. It's dedicated people like yourself that restore my faith in MS when it is dealt a blow.

Jerry
Go to Top of Page

eyechart
Master Smack Fu Yak Hacker

3575 Posts

Posted - 2005-11-28 : 16:53:36
quote:
Originally posted by ThreePea

Paul, bless you.

I searched and searched MS support for a hotfix as I figured it should be up there. But I must be daft, because the terms I searched on came up with nothing. I would think "sp4 indexdefrag" would flag it, but it did not. In any case, thank you. We will proceed with the hotfix if we can acquire the downtime, though our SLA's might prevent that until our next scheduled maintenance window. In that case we'll see if we have any luck with the trace flag workaround described in the article.

That still leaves the CPU and compile lock issues, but I'll keep at it with the PSS rep; hopefully he'll be able to escalate the issue to someone who can get to the bottom of it, if he cannot.

The support case is: SRX051118602523
The PSS rep's name is Avik, his email is avikroy@microsoft.com. He's a nice guy. I'm just not sure he is grasping the gravity of the situation for us.

Thanks again, Paul, have a great day. It's dedicated people like yourself that restore my faith in MS when it is dealt a blow.

Jerry





Jerry, escalate this to a crit sit. Do you have a TAM?



-ec
Go to Top of Page

ThreePea
Yak Posting Veteran

83 Posts

Posted - 2005-11-28 : 17:07:16
quote:
Originally posted by eyechart

Jerry, escalate this to a crit sit. Do you have a TAM?
-ec



EC:

I talked to our support rep today and asked that he escalate it to a critical situation. He refused. He said that he had already discussed the situation with the escalation engineers and they felt the action plan was a good one, so he won't escalate it until we move through all the steps in the "plan."

Here's the current plan requested of me:

- Do UPDATE STATISTICS WITH FULLSCAN on all tables in the entire database
- Turn on trace flag 8690, an undocumented flag which disables table spooling
- Run Index Tuning Wizard.

If these don't fix it he said he would consider escalating. I have detailed and indepth statistics, peformance counters, waitstats, and traces of all our procedure executions before and after sp4, but I still need to run ITW? @#$*%#*!*$@#$*% I have worked here 15 years and I don't ever remember so many of my customers angry and frustrated with us as they have been these last two weeks.

He was not aware of the KB that Paul Randal referred to for the index defrag blocking. I gave it to him and asked him to get us the hotfix for us.

Also, I am not sure what a TAM is, so I am supposing we don't have one.

Thanks!

Jerry
Go to Top of Page

eyechart
Master Smack Fu Yak Hacker

3575 Posts

Posted - 2005-11-28 : 17:46:32
quote:
Originally posted by ThreePea

quote:
Originally posted by eyechart

Jerry, escalate this to a crit sit. Do you have a TAM?
-ec



EC:

I talked to our support rep today and asked that he escalate it to a critical situation. He refused. He said that he had already discussed the situation with the escalation engineers and they felt the action plan was a good one, so he won't escalate it until we move through all the steps in the "plan."

Here's the current plan requested of me:

- Do UPDATE STATISTICS WITH FULLSCAN on all tables in the entire database
- Turn on trace flag 8690, an undocumented flag which disables table spooling
- Run Index Tuning Wizard.

If these don't fix it he said he would consider escalating. I have detailed and indepth statistics, peformance counters, waitstats, and traces of all our procedure executions before and after sp4, but I still need to run ITW? @#$*%#*!*$@#$*% I have worked here 15 years and I don't ever remember so many of my customers angry and frustrated with us as they have been these last two weeks.

He was not aware of the KB that Paul Randal referred to for the index defrag blocking. I gave it to him and asked him to get us the hotfix for us.

Also, I am not sure what a TAM is, so I am supposing we don't have one.

Thanks!

Jerry



I have never had a problem escalating a call. This is pretty unacceptable in my opinion. Unfortunately Microsoft's first level support has been really bad the last year or so, the only time you get the really good people on the phone (the paul randal's) is if you get things escalated.

A TAM is a technical account manager and you get a TAM when you have purchased Premier Support agreement. You pay for some portion of their time (like 1/8 or 1/4 of a year or whatever) and they typically get to know who's who at your company, what technologies are in place and being used. The really helpful part is that they can get things bumped to a high priority very quickly. Whenever we get into one of these situations, we always have our TAM get things escalated to the right people.

Keep us informed as to what happens and what solutions you come up with.



-ec

Go to Top of Page

paulrandal
Yak with Vast SQL Skills

899 Posts

Posted - 2005-11-28 : 18:39:29
I've escalated this situation to PSS management.

Sqlteam regulars: Please notify me when you think MS is failing a customer to see if there's anything I can do about it (this is clearly such a case).

Regards



Paul Randal
Lead Program Manager, Microsoft SQL Server Storage Engine
(Legalese: This posting is provided "AS IS" with no warranties, and confers no rights.)
Go to Top of Page

Kristen
Test

22859 Posts

Posted - 2005-11-28 : 21:39:05
ThreePea, can I be blunt and ask what the testing process was before rollout of SP4?

You clearly know what you are doing and I am wondering which of several thoughts in my mind fits the situation:

1) Tested, no problem found, only showed up in production
2) Tight-fisted budgets don't allow for full hardware duplication for testing of this nature
3) No feasible test plan exists (i.e. effort is available for a new version, but not for a SP. possibly no automated testing suite exists, etc.)
4) MS Service Pack assumed "safe"

but I expect there are other possible scenarios too.

The other observation I've made is that the downtime for a rollback to SP3a strikes me as being less than the accumulated downtime by NOT rolling back until the next scheduled maintenance window. This must suck from both your perspective AND the customers.

I'm going to take a careful look at our SLAs and make sure that in future we can agree emergency downtime where that would prevent more aggregate time lost. Doesn't seem much point in an SLA that ties both hands behind your back - and the customer's too! but I reckon our current SLAs would cause the same situation as you are in.

Good luck getting it sorted out

Kristen
Go to Top of Page

eyechart
Master Smack Fu Yak Hacker

3575 Posts

Posted - 2005-11-29 : 00:00:39
good points kristen.

Sometimes no amount of testing will uncover the bug you run into though. In those situations it would be nice if there was a faster way to "undo" a servicepack then the uninstall/reinstall of SQL Server method that we currently have to use.



-ec

Go to Top of Page

ThreePea
Yak Posting Veteran

83 Posts

Posted - 2005-11-29 : 04:29:01
quote:
Originally posted by Kristen

ThreePea, can I be blunt and ask what the testing process was before rollout of SP4?


No.



Of course.

We installed sp4 first on our development server, where it resided over a month. During this time most of our crucial sprocs and apps would have been run and tested by developers and implementers. We then installed it on our test server (where our customers test upcoming changes) and our report server for a period of two weeks, both of which get a solid representation of work by our customers. After encountering no problems or issues we rolled it out to our main workhorse production server on Black Sunday, Nov 13. :)

However, the one thing that we could not test, and this has always been a sticking point for us, is the main production system under heavy load. We have similar hardware available for testing, so functionality has never been an issue, but load-testing is. And of course, this is precisely where sp4 fell down for us.

We did do some due diligence and research on any problems associated with sp4. Of course there were some reports by users of performance problems, but to be honest, there are always similar reports with a new sp or release. Other than that, there didn't seem to be any backlash from any of the sources that we trust.

We are an extremely transaction-heavy OLTP system. We have made numerous attempts at recreating our load on a test system, but simply have not been successful. While we do many things very well (if I say so myself), this is one area we need to improve. And to be honest a large part of it is simply making it a priority, which it hasn't been so far.

We have tried to use our own full-out traces from production to recreate a load on our test server, but we are unable to set something up that:
(a) accurately recreates 100's of users all submitting batches simultaneously with the resultant contention for resources, and
(b) recreates the transactions so that the modification of data occurs precisely as it does/did on production (FK, identity, etc.).

We have also investigated some of the 3rd-party stress test tools, but found them lacking (this research was done by a colleague, so I cannot elaborate on exactly what products he tested and why they were lacking w/o talking to him first).

So, that is the situation. If anyone has any good suggestions on accurate stress-testing for a heavy OLTP system, I'm all ears. This will be extremely pertinent for us again soon, as we are getting ready to move from 2000/32bit to 2005/64bit.

Thanks for your input, everyone.
Go to Top of Page

ThreePea
Yak Posting Veteran

83 Posts

Posted - 2005-11-29 : 04:50:05
quote:
Originally posted by Kristen


The other observation I've made is that the downtime for a rollback to SP3a strikes me as being less than the accumulated downtime by NOT rolling back until the next scheduled maintenance window. This must suck from both your perspective AND the customers.



Yes, exactly. Now if we could just convince them of that. But since downtime for a rollback means we are completely unavailable, as opposed to just being crippled, albeit severely at times, downtime is just not acceptable to our customers.

How long for a reinstall/install with all service packs and hotfixes - 2 or 3 hours (if everything goes perfectly)? 6 hours? Our software runs the entire manufacturing process, beginning to end. If we go down, our customer's plants literally shut down. And since we have customers around the world, there is no such thing as "off" hours, though the majority of our customers are USA-based so night-time load is less.

[And yes, a swappable hot site would be awesome in this case. It's something we are working on, but haven't arrived at that nirvana yet.]

Jerry
Go to Top of Page

Rovastar
Starting Member

38 Posts

Posted - 2005-11-29 : 11:54:00
I was thinking of implementing a global update of SP4 for our webservers that run SQL server. I am thinking again now. It is hard enough to get CMS 2002 and Commerce Server 2002 run totally smoothly as well as all the databases.

I am not sure about what decent load testing is available as I don't just do SQL server.

But what disaster recover plan did you have for this SP4 project?

Surely you had one..........
Go to Top of Page

ThreePea
Yak Posting Veteran

83 Posts

Posted - 2005-11-29 : 13:27:41
quote:
Originally posted by Rovastar


But what disaster recover plan did you have for this SP4 project?

Surely you had one..........


Of course, the same as everyone else's sp4 disaster recover plan - rollback to sp3a. Unless there's a different disaster plan that I'm unable to envision right now, that is what we would do if we absolutely had to. But since doing so would violate some of our SLA's due to downtime required we are avoiding that alternative at the moment.

Jerry
Go to Top of Page

Kristen
Test

22859 Posts

Posted - 2005-11-30 : 02:15:38
"We installed sp4 first ..."

And I thought we went the Extra Mile ...

Thanks for the info ThreePea, and good luck with sorting it out.

Kristen
Go to Top of Page

Rovastar
Starting Member

38 Posts

Posted - 2005-11-30 : 04:50:36
quote:
Originally posted by ThreePea

quote:
Originally posted by Rovastar


But what disaster recover plan did you have for this SP4 project?

Surely you had one..........


Of course, the same as everyone else's sp4 disaster recover plan - rollback to sp3a. Unless there's a different disaster plan that I'm unable to envision right now, that is what we would do if we absolutely had to. But since doing so would violate some of our SLA's due to downtime required we are avoiding that alternative at the moment.

Jerry



I understand the techincal aspects (i.e. Rollback to SP3a) but if you cannot do this in your enivornment (we cannot have any downtime) then the plan fails on day one. So the plan appears invalid. Part of the disaster recover plan is for all the other things too that are none technical. Who to talk to, what timeframe to implement this, etc.

It is difficult when the Server still works but is painfully slow/unreliable to decide when to rollback. DO you or do you not wait until the next window.

Can you not give x days notice for emergency maintence work that can be aranged with you clients?

They either have a slow/unreliable service or have 1 hour downtime off-peak to install SP3a to get this resolved.
Go to Top of Page
    Next Page

- Advertisement -