Haunted by inconsistencies - SQL Server Forums

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

General SQL Server Forums

Data Corruption Issues

Haunted by inconsistencies

Author

Topic

symes
Starting Member

7 Posts

Posted - 2005-08-09 : 12:52:39

Our client's database is hosted by a third party in a clustered SQL Server 2000 environment. Things were running swimmingly for 18 months when, in January, some inconsistencies started to appear. The first symptoms were three records in one of the tables that look corrupted; i.e. they contained dates for the years 2079 and 6619! We simply removed the data that hd been bulk loaded that day and re-bcpd it and were able to continue without issue.

We never quite got to the bottom of the exact cause of this corruption and were happy to leave it be until, on 30 June, a DBCC after a failure gave the following (trimmed) results.

DBCC results for 'DealPeriodParmsFlexShadow'.
There are 0 rows in 1 pages for object 'DealPeriodParmsFlexShadow'.
DBCC results for 'DealPeriodParmsShadow'.
Server: Msg 8928, Level 16, State 1, Line 1
Object ID 1383012008, index ID 0: Page (1:128399) could not be processed. See other errors for details.
There are 127575 rows in 1057 pages for object 'DealPeriodParmsShadow'.
DBCC results for 'recvspre17'.
There are 97593 rows in 3367 pages for object 'recvspre17'.
DBCC results for 'DealPeriodParmsSubShadow'.
There are 0 rows in 1 pages for object 'DealPeriodParmsSubShadow'.
DBCC results for 'System'.
There are 22 rows in 1 pages for object 'System'.
DBCC results for 'Recvs'.
There are 4980591 rows in 185222 pages for object 'Recvs'.
CHECKDB found 0 allocation errors and 1 consistency errors in table 'Recvs' (object ID 1383012008).
DBCC results for 'DealPeriodsShadow'.
There are 1182 rows in 32 pages for object 'DealPeriodsShadow'.
DBCC results for 'DealsShadow'.
There are 0 rows in 1 pages for object 'DealsShadow'.

Running DBCC CHECKDB a second time seemed to imply that the issue was fixed. The application crashed within 45 minutes again. Manually running the SP that was executing at the time of the crash resulted in the following message:
Server: Msg 21, Level 21, State 1, Procedure sp_FlagGalpIneligibleOverdueObligors,
Line 109 Warning: Fatal error 644 occurred at Jun 30 2005 4:23PM

Running the same SP immediately again resulted in a new message:
Server: Msg 21, Level 20, State 1, Procedure sp_FlagGalpIneligibleOverdueObligors, Line 109
Warning: Fatal error 3624 occurred at Jun 30 2005 4:31PM
Location: scanrid.cpp:321
Expression: m_len != 0
SPID: 52
Process ID: 3756

A DBCC CHECKDB resulted in "CHECKDB found 0 allocation errors and 68 consistency errors in database" with messages such as:
Server: Msg 8952, Level 16, State 1, Line 1
Table error: Database 'dbCresta', index 'Recvs.XIEDealPeriodTypeStatus' (ID 1383012008) (index ID 2). Extra or invalid key for the keys:
Server: Msg 8956, Level 16, State 1, Line 1
Index row (1:245:193) with values (DealId = 2 and PeriodId = 469 and Type = 'R' and Status = 'I') points to the data row identified by (RID = (1:663407:10)).
Server: Msg 8952, Level 16, State 1, Line 1
Table error: Database 'dbCresta', index 'Recvs.XIEDealPeriodTypeStatus' (ID 1383012008) (index ID 2). Extra or invalid key for the keys:
Server: Msg 8956, Level 16, State 1, Line 1
Index row (1:245:194) with values (DealId = 2 and PeriodId = 469 and Type = 'R' and Status = 'I') points to the data row identified by (RID = (1:663407:11)).
Server: Msg 8952, Level 16, State 1, Line 1
Table error: Database 'dbCresta', index 'Recvs.XIEDealPeriodTypeStatus' (ID 1383012008) (index ID 2). Extra or invalid key for the keys:
Server: Msg 8956, Level 16, State 1, Line 1
Index row (1:245:195) with values (DealId = 2 and PeriodId = 469 and Type = 'R' and Status = 'I') points to the data row identified by (RID = (1:663407:12)).
Server: Msg 8952, Level 16, State 1, Line 1
Table error: Database 'dbCresta', index 'Recvs.XIEDealPeriodTypeStatus' (ID 1383012008) (index ID 2). Extra or invalid key for the keys:
Server: Msg 8956, Level 16, State 1, Line 1
Index row (1:245:196) with values (DealId = 2 and PeriodId = 469 and Type = 'R' and Status = 'I') points to the data row identified by (RID = (1:663407:16)).
Server: Msg 8952, Level 16, State 1, Line 1
Table error: Database 'dbCresta', index 'Recvs.XIEDealPeriodTypeStatus' (ID 1383012008) (index ID 2). Extra or invalid key for the keys:
Server: Msg 8956, Level 16, State 1, Line 1
Index row (1:245:197) with values (DealId = 2 and PeriodId = 469 and Type = 'R' and Status = 'I') points to the data row identified by (RID = (1:663407:17)).
Server: Msg 8952, Level 16, State 1, Line 1
Table error: Database 'dbCresta', index 'Recvs.XIEDealPeriodTypeStatus' (ID 1383012008) (index ID 2). Extra or invalid key for the keys:

{These two messages appeared 42 consecutive times, then there were some standard DBCC result messages and then...}

Server: Msg 8952, Level 16, State 1, Line 1
Table error: Database 'dbCresta', index 'Recvs.XIEDealPeriodTypeStatus' (ID 1383012008) (index ID 2). Extra or invalid key for the keys:
Server: Msg 8956, Level 16, State 1, Line 1
Index row (1:571866:69) with values (DealId = 2 and PeriodId = 469 and Type = 'R' and Status = 'I') points to the data row identified by (RID = (1:663445:25)).
Server: Msg 8952, Level 16, State 1, Line 1
Table error: Database 'dbCresta', index 'Recvs.XIEDealPeriodTypeStatus' (ID 1383012008) (index ID 2). Extra or invalid key for the keys:
Server: Msg 8956, Level 16, State 1, Line 1
Index row (1:571867:244) with values (DealId = 2 and PeriodId = 469 and Type = 'R' and Status = 'I') points to the data row identified by (RID = (1:663382:1)).
Server: Msg 8952, Level 16, State 1, Line 1
Table error: Database 'dbCresta', index 'Recvs.XIEDealPeriodTypeStatus' (ID 1383012008) (index ID 2). Extra or invalid key for the keys:

{a further 26 consecutive times.}

Running DBCC a second time resulted in 0 inconsistencies. This seemed to be a growing pattern: i.e. DBCC would show problems on the first attempt and come back clean on immediate subsequent attempts. The third party hosting the database insist that there are no hard drive issues (we pointed them to a number of forums implying this might be the case). They are using a RAID 5 configuration and also insist that no config changes have taken place in the last year.

That night a job was run to rebuild all of the indexes in the database. The following morning the very first thing I did was run a DBCC checkdb, with the following (trimmed) results:
...
Table error: Table 'Recvs' (ID 1383012008). Missing or invalid key in index 'XIEDealPeriodTypeStatus' (ID 2) for the row:
Server: Msg 8955, Level 16, State 1, Line 1
Data row (1:663382:1) identified by (RID = (1:663382:1) ) has index values (DealId = 2 and PeriodId = 469 and Type = 'R' and Status = 'E').
Server: Msg 8951, Level 16, State 1, Line 1
Table error: Table 'Recvs' (ID 1383012008). Missing or invalid key in index 'XIEDealPeriodTypeStatus' (ID 2) for the row:
Server: Msg 8955, Level 16, State 1, Line 1
Data row (1:663382:2) identified by (RID = (1:663382:2) ) has index values (DealId = 2 and PeriodId = 469 and Type = 'R' and Status = 'E').
Server: Msg 8951, Level 16, State 1, Line 1
Table error: Table 'Recvs' (ID 1383012008). Missing or invalid key in index 'XIEDealPeriodTypeStatus' (ID 2) for the row:
Server: Msg 8955, Level 16, State 1, Line 1
Data row (1:663382:3) identified by (RID = (1:663382:3) ) has index values (DealId = 2 and PeriodId = 469 and Type = 'R' and Status = 'E').
...
CHECKDB found 0 allocation errors and 3 consistency errors in database

{Running DBCC immediately again resulted in:}
...
DBCC results for 'DealPeriodParmsShadow'.
Server: Msg 8951, Level 16, State 1, Line 1
Table error: Table 'Recvs' (ID 1383012008). Missing or invalid key in index 'XIEDealPeriodTypeStatus' (ID 2) for the row:
Server: Msg 8955, Level 16, State 1, Line 1
Data row (1:663382:1) identified by (RID = (1:663382:1) ) has index values (DealId = 2 and PeriodId = 469 and Type = 'R' and Status = 'E').
Server: Msg 8951, Level 16, State 1, Line 1
Table error: Table 'Recvs' (ID 1383012008). Missing or invalid key in index 'XIEDealPeriodTypeStatus' (ID 2) for the row:
Server: Msg 8955, Level 16, State 1, Line 1
Data row (1:663382:2) identified by (RID = (1:663382:2) ) has index values (DealId = 2 and PeriodId = 469 and Type = 'R' and Status = 'E').
Server: Msg 8951, Level 16, State 1, Line 1
Table error: Table 'Recvs' (ID 1383012008). Missing or invalid key in index 'XIEDealPeriodTypeStatus' (ID 2) for the row:
Server: Msg 8955, Level 16, State 1, Line 1
Data row (1:663382:3) identified by (RID = (1:663382:3) ) has index values (DealId = 2 and PeriodId = 469 and Type = 'R' and Status = 'E').
There are 127803 rows in 1059 pages for object 'DealPeriodParmsShadow'.
...
CHECKDB found 0 allocation errors and 3 consistency errors in database

It should be noted that between the overnight job running and me executing DBCC in the morning no other external database activity took place. I.e. our application had been closed and no other databases on the server are currently in use.

The third party host subsequently executed dbcc checkdb with repair_rebuild and after this a standard DBCC CHECKDB came back clean. We later found that some of the data we had deleted in an attempt to rollback to a state before the errors had actually been resurrected by some of this activity. We then had to run many manual checks on the affected tables to verify the data was as expected.

Over the weekend of the 2nd and 3rd July diagnostic tests were run against the hardware in question in order to rule out h/w issues. The system was given a clean bill of health.

On Monday 4th the DBCC inconsistencies were back... They come and go, appearing on average once a week (sometimes two days in a row, sometimes not for two weeks). We have tried archiving data to make the tables smaller, removing the statistics (which we discovered were severely affecting performance anyway), running nightly DBCC CHECKDBs, all to no avail. The Recvs table seems to be the one that is the main problem. The bizarre thing is that the indexes on this table are dropped and recreated daily, since we bcp our daily data in. So the indexes should always be 'fresh'. There are no clustered indexes on this table, simply four non-clustered ones.

The support work resulting from the application failing (each time these DBCC inconsistencies occur) is time we'd very much like to spend more productively. We would very much appreciate any assistance anyone can give in getting to the root cause of the problem so that it can be addressed and the problems made to go away.

paulrandal
Yak with Vast SQL Skills

899 Posts

Posted - 2005-08-09 : 13:03:11

What version of SQL Server 2000 are you running?

Does the query workload include updates to already existing rows that would make the rows longer? This may be a known issue that you've run into.

Paul Randal
Dev Lead, Microsoft SQL Server Storage Engine
(Legalese: This posting is provided "AS IS" with no warranties, and confers no rights.)

X002548
Not Just a Number

15586 Posts

Posted - 2005-08-09 : 13:03:39

You don't mention it, but did the application have a release just prior to the corruption?

How many tables in your database, and what's it's size?

Do you have a disaster box or dev box?

You say you rebuild indexes nightly. Is that for everything? And why do you do this?

Brett

8-)

Hint: Want your questions answered fast? Follow the direction in this link
http://weblogs.sqlteam.com/brettk/archive/2005/05/25/5276.aspx

paulrandal
Yak with Vast SQL Skills

899 Posts

Posted - 2005-08-09 : 13:20:42

http://support.microsoft.com/default.aspx?scid=kb;en-us;827178 gives more details of the problem I think you may have hit.

If this is happening constantly, can you try creating a clustered index and see if the problem goes away (I realize that this may be prohibitively expensive for you to try).

Another way we can see if you have the problem described in the KB article is to use DBCC PAGE to examine one of the heap pages referenced in the CHECKDB output. If you want to do this, can you post the output from the latest CHECKDB failure and I'll tell you how to do this.

Thanks

Paul Randal
Dev Lead, Microsoft SQL Server Storage Engine
(Legalese: This posting is provided "AS IS" with no warranties, and confers no rights.)

symes
Starting Member

7 Posts

Posted - 2005-08-10 : 06:53:30

quote:

Originally posted by paulrandal

What version of SQL Server 2000 are you running?

Does the query workload include updates to already existing rows that would make the rows longer? This may be a known issue that you've run into.

Hi Paul,

@@Version:
Microsoft SQL Server 2000 - 8.00.760 (Intel X86)
Dec 17 2002 14:22:05
Copyright (c) 1988-2003 Microsoft Corporation
Enterprise Edition on Windows NT 5.0 (Build 2195: Service Pack 4)

There are very few updates to existing rows in the Recvs table, and none that would make the row longer.

This morning the process stopped during the building of the indexes. I.e. the indexes were dropped, new rows were bcpd in, the first index built okay, then nothing happened. I am trying to get hold of any logs that might indicate what exactly happened and see if any error messages were reported. A DBCC CHECKDB reported no inconsistencies on this occassion.

Thanks fo the feedback,
Scott

symes
Starting Member

7 Posts

Posted - 2005-08-10 : 07:03:53

quote:
Originally posted by X002548

You don't mention it, but did the application have a release just prior to the corruption?

How many tables in your database, and what's it's size?

Do you have a disaster box or dev box?

You say you rebuild indexes nightly. Is that for everything? And why do you do this?

Hi Brett,
There have been no new releases of the application for over a year. We supply new/altered stored procedures from time to time, but nothing around the time of the issues starting.

There are four main tables in the database, and many supporting ones. Recvs is the most important table; typical - it being the one with apparent problems.

There is no stand-by disaster box capable of hosting the database and processing queries fast enough to make it viable. We have dev boxes here in London, and the client has a dev box hosted by their third party outsourcers. Typically we archive as much data as possible before using these environments.

That said, we have recently helped to archive as much of the live data as possible. I will try and get some actual sizes of the database and relevant tables to post a it later.

The indexes on three of the four main tables are dropped each morning, allowing data to be bcpd in, and the indexes are then recreated. This is to allow the bulk loading to take place as quickly as possible. No other users or processes access the database during this phase. It is mostly a single user app anyway.

Thanks for your feedback,

Scott

symes
Starting Member

7 Posts

Posted - 2005-08-10 : 07:17:38

quote:
Originally posted by paulrandal

http://support.microsoft.com/default.aspx?scid=kb;en-us;827178 gives more details of the problem I think you may have hit.

If this is happening constantly, can you try creating a clustered index and see if the problem goes away (I realize that this may be prohibitively expensive for you to try).

Another way we can see if you have the problem described in the KB article is to use DBCC PAGE to examine one of the heap pages referenced in the CHECKDB output. If you want to do this, can you post the output from the latest CHECKDB failure and I'll tell you how to do this.

Hi Paul,
As seen above SP4 has already been applied to the SQL Server in question. A clustered index is probably not feasible, but I'll look into the idea further.
I'll post the next CHECKDB failure here, if we catch it. Will the DBCC PAGE still be useful if the issues are transient and disappear after the CHECKDB?

Thanks again,
Scott

paulrandal
Yak with Vast SQL Skills

899 Posts

Posted - 2005-08-10 : 12:10:11

quote:
Originally posted by symes

quote:
Originally posted by paulrandal

http://support.microsoft.com/default.aspx?scid=kb;en-us;827178 gives more details of the problem I think you may have hit.

If this is happening constantly, can you try creating a clustered index and see if the problem goes away (I realize that this may be prohibitively expensive for you to try).

Another way we can see if you have the problem described in the KB article is to use DBCC PAGE to examine one of the heap pages referenced in the CHECKDB output. If you want to do this, can you post the output from the latest CHECKDB failure and I'll tell you how to do this.

Hi Paul,
As seen above SP4 has already been applied to the SQL Server in question. A clustered index is probably not feasible, but I'll look into the idea further.
I'll post the next CHECKDB failure here, if we catch it. Will the DBCC PAGE still be useful if the issues are transient and disappear after the CHECKDB?

Thanks again,
Scott

If you've already got SP4 and you're not updating heap rows to make them longer, its not the problem I suspected.

Also, the number/size of the tables/database is irrelevant to working out the cause of the corruption.

Looking back at your original post, the error
Object ID 1383012008, index ID 0: Page (1:128399) could not be processed. See other errors for details.
is very concerning, and there should have been other errors relating to that page. This says that something was corrupt on the heap page itself. Do you have the complete output from that initial CHECKDB that you could post/attach (or email me if its very large)?

The scanrid.cpp assert happens during deletion of a non-clustered index entry, and is most likely caused by the corruption on the heap page.

Can you access the SQL errorlog to see if there are any IO failures noted in it? I know that hosting company says there aren't any h/w issues - do they have all the latest firmware/drivers?

The missing nc-index row errors are suspicious as they're all grouped in one area of the disk, the first ones on page 1:663407, then page 663445, then page 1:663382.
Can you try dropping and rebuilding the indexes with no other activity and then run a DBCC CHECKDB to see if there is corruption straight off the bat?

Next time you discover corruption, we can examine the heap and index pages to see where the corruption is - whether there really are missing index rows (SQL Server bug) or something's corrupted the heap/index pages (hardware).

Thanks

Paul Randal
Dev Lead, Microsoft SQL Server Storage Engine
(Legalese: This posting is provided "AS IS" with no warranties, and confers no rights.)

symes
Starting Member

7 Posts

Posted - 2005-08-10 : 12:38:11

Hi Paul,

I have emailed you the full DBCC CHECKDB results as requested. I did not spot any other error messages in there though.

Thanks,
Scott

paulrandal
Yak with Vast SQL Skills

899 Posts

Posted - 2005-08-10 : 12:43:59

quote:
Originally posted by symes

Hi Paul,

I have emailed you the full DBCC CHECKDB results as requested. I did not spot any other error messages in there though.

Thanks,
Scott

Got them and I'm stuck. Was this CHECKDB run from a SQL Agent job? If so, it turns all the error states into '1', so the meaning of the error is partially lost if raised with a different state in a different piece of code. All I can think of is that was really raised with state '6', which means a page was wrongly marked as an IAM page in a PFS page, and so other errors may not always be generated.

I guess we'll have to wait for the next occurence. Sorry about that.

Paul Randal
Dev Lead, Microsoft SQL Server Storage Engine
(Legalese: This posting is provided "AS IS" with no warranties, and confers no rights.)

symes
Starting Member

7 Posts

Posted - 2005-08-10 : 13:03:25

Hi Paul,

I'm fairly sure that these results were obtained manually through SQL Query Analyzer. As seen from the gist of this thread though, the results seem to be pretty varied and inconsistent with anything that may explain things.

Just to clarify the procedure for "next time":
Once the app seems to have stopped, run a DBCC;
If there are inconsistencies reported (and there are not always) then post them here for you;
Refrain from further database activity until we receive further instructions from you (probably a DBCC PAGE statement);
Post the results of such statement here for you and hope you can perform your magic with it.

I need to forewarn the client that we will need to sacrifice the daily results of processing the next time such a db error occurs. This is because they are required to get the results out by 13:00 GMT and halting all activity between DBCC CHECKDB and DBCC PAGE while we wait for assistance from the USA to start the business day will push us past this. At this point such a sacrifice is acceptable from our point of view, and I'm sure the client will see it that way too.

Thanks again for the help,

Scott

paulrandal
Yak with Vast SQL Skills

899 Posts

Posted - 2005-08-10 : 13:24:57

quote:
Originally posted by symes

Hi Paul,

I'm fairly sure that these results were obtained manually through SQL Query Analyzer. As seen from the gist of this thread though, the results seem to be pretty varied and inconsistent with anything that may explain things.

Just to clarify the procedure for "next time":
Once the app seems to have stopped, run a DBCC;
If there are inconsistencies reported (and there are not always) then post them here for you;
Refrain from further database activity until we receive further instructions from you (probably a DBCC PAGE statement);
Post the results of such statement here for you and hope you can perform your magic with it.

I need to forewarn the client that we will need to sacrifice the daily results of processing the next time such a db error occurs. This is because they are required to get the results out by 13:00 GMT and halting all activity between DBCC CHECKDB and DBCC PAGE while we wait for assistance from the USA to start the business day will push us past this. At this point such a sacrifice is acceptable from our point of view, and I'm sure the client will see it that way too.

Thanks again for the help,

Scott

I can't guarantee that I'll see your post immediately (I may be on vacation for instance) so I think a better course of action would be that if you get corruption, take a full backup of the database and then continue. We can analyze the db copy without interrupting your business.

Thanks

Paul Randal
Dev Lead, Microsoft SQL Server Storage Engine
(Legalese: This posting is provided "AS IS" with no warranties, and confers no rights.)

symes
Starting Member

7 Posts

Posted - 2005-08-16 : 03:27:27

Hi Paul,

I have just had it confirmed that the write-cache is enabled on the host machine in question. Could this be a possible cause of the problems being experienced? I recall seeing something in another thread where write-caching was mentioned as potentially resulting in apparent data corruption.

Kind regards,

Scott

ryanston
Microsoft SQL Server Product Team

89 Posts

Posted - 2005-08-16 : 12:22:49

quote:
Originally posted by symes

Hi Paul,

I have just had it confirmed that the write-cache is enabled on the host machine in question. Could this be a possible cause of the problems being experienced? I recall seeing something in another thread where write-caching was mentioned as potentially resulting in apparent data corruption.

Kind regards,

Scott

It is certainly something to suspect, but there's nothing inherently wrong with having write-caching enabled on teh controller, as long as it correctly supports the write-through and battery-backup requirements for SQL Server.

Take a look at http://support.microsoft.com/default.aspx?scid=kb;en-us;234656, and the SQL Server IO Basics whitepaper at http://www.microsoft.com/technet/prodtechnol/sql/2000/maintain/sqlIObasics.mspx. It should do a good job of answering your questions.

Thanks,

----------------------
Ryan Stonecipher
Developer, Microsoft SQL Server Storage Engine, DBCC
(Legalese: This posting is provided "AS IS" with no warranties, and confers no rights.)

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources