Please start any new threads on our new
site at https://forums.sqlteam.com. We've got lots of great SQL Server
experts to answer whatever question you can come up with.
Author |
Topic |
Brain2000
Starting Member
3 Posts |
Posted - 2006-06-03 : 16:56:49
|
I have a very intersting scenario occuring during a DBCC CheckDB, which causes the server and I/O bus to completely hang when using an ExtremeRAID3000 fibre card. Here is the full hardware scenario:Two Windows 2000 Advanced Servers (clustered) w/2GB RAM eachExtremeRAID3000 cardJBOD with Fibre DrivesSQL2000 SP3 (have not tried SP4 yet)I have been able to reproduce this on 3 different clusters. Each had different brands of memory/motherboard/hard drives. The only common hardware item was the ExtremeRAID3000.Each clustered server runs about 400 databases, 24/7/365. I only reboot the computers about once every 3000 hours as the Windows Updates pile up and need to be applied.If I run a DBCC CHECKDB on various databases, the command hangs and the I/O subsystem crashes, requiring a full reboot of both nodes in order to get the fibre I/O working again.All other operations work fine, including: physical defragmenting the hard drive logical defragmentation of the databases full/Differential database backups chkdsk (fsutil /set dirty) failover multiple terrabytes of bytes read/written through SQL for weeks on endI currently backup and restore the databases onto a different server each week, and run DBCC CHECKDB to verify them, which is how I know they are fine (plus the customer's don't complain of anything corrupt).Any ideas what DBCC CHECKDB could possibly be doing to cause the I/O subsystem to go offline? This is not consistent either, I can run it on some databases some weeks, and other databases other weeks. It varies from time to time. But if I find one it crashes on, it will hang again if I run it again immediately after rebooting. So I haven't run DBCC CHECKDB's on the production servers with ExtremeRAID 3000's in over two years.One final thing, I have several other clusters that do not use these cards, and they have no problems with DBCC CHECKDB.Intelligent Design is Christianity Evolving due to Natural Selection. |
|
paulrandal
Yak with Vast SQL Skills
899 Posts |
Posted - 2006-06-03 : 17:38:50
|
CHECKDB does nothing special to read pages - it issues readahead through the buffer pool, using parallel threads if parallelism is not disabled.Now, CHECKDB can put a significant load on the IO subsystem, especially if you're running multiple concurrent CHECKDBs on databases that are accessed through the same IO system.The fact that the IO system crashes is the smoking gun that shows that the issue is with the IO system and not anything to do with SQL Server. All SQL Server can do is issue read requests - the IO system has to be able to cope with the load that you're putting on it. In short, its your IO system that has a problem, not the computers themselves or SQL Server.My theory is that the IO system is getting sufficiently stressed that the disk queue length is getting very long and there's a rare issue with that hardware (either in the driver or firmware) that can cause it to crash.Do you have the latest driver/firmware versions? Looking on LSI Logic's support page (http://www.lsilogic.com/cm/LookupDownloads.do), the latest firmware rev is 7.01-0-00 release on 4/18/02.You could try running the SQLIOStress utility to see if you can force the hardware to fail under intensive IO load too.RegardsPaul RandalLead Program Manager, Microsoft SQL Server Storage Engine + SQL Express(Legalese: This posting is provided "AS IS" with no warranties, and confers no rights.) |
|
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2006-06-12 : 07:47:39
|
A link to SQLIOStress would be nice :)Peter LarssonHelsingborg, Sweden |
|
|
paulrandal
Yak with Vast SQL Skills
899 Posts |
|
Brain2000
Starting Member
3 Posts |
Posted - 2006-07-14 : 01:58:58
|
Hi Paul,Thank you very much for the information. The thing that boggles me is this SQL server has 20 hard drives on it through a fibre channel, and runs 24/7 with US/International customers. It's hit pretty hard, and never ever crashes (max uptime before manual reboot for OS patches is over 3,000 hours). I run full backups once a week, and incrementals each night. Yet if I run a DBCC CHECKDB on the right 100MB database, it crashes the entire I/O subsystem. I thought SQL wasn't doing much more than standard I/O requests, but wanted to see if maybe it was doing something at a lower level.I do agree with you that it is most likely the I/O subsystem, because when it crashes, both nodes lose connectivity to the Fibre drives, it can't failover, and I have to completely power down both nodes remotely through an APC power switch. I have the latest versions of the firmware and drivers.Thanks again for the reply. My current workaround is restoring all databases to a "warm backup" array each week, and checkdb them with a script. Of course, the warm backup does not use the same brand card :)Intelligent Design is Christianity Evolving due to Natural Selection. |
|
|
|
|
|
|
|