Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 General SQL Server Forums
 Data Corruption Issues
 MSDB Corruption

Author  Topic 

cmbergin
Starting Member

14 Posts

Posted - 2011-09-26 : 12:40:35
My weekly system database maintenance plan failed on the "check database integrity" step. Running CHECKDB this morning, I found corruption in MSDB.

I don't care about the data in sysmail_allitems; it's nice to have, but not a big deal. I need to have the shortest maintenance window possible, so if repair allow data loss is faster than a restore from last week's MSDB backup, I'd like to do that instead.

However, I'm concerned about the errors found not associated with any particular object. I don't know what might happen there.

In a related question, will my user databases remain online while restoring MSDB? I make use of database mail and some CLR procedures; if only those things will fail while MSDB is being restored, I can get around those failures by letting the users know what features will be disabled so they can work around it.

Anyway, sorry for the rambling. Here's the CHECKDB output:

Msg 8909, Level 16, State 1, Line 1
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 0 (type Unknown), page ID (1:5126312) contains an incorrect page ID in its page header. The PageId in the page header = (0:0).
Msg 8909, Level 16, State 1, Line 1
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 0 (type Unknown), page ID (1:5126313) contains an incorrect page ID in its page header. The PageId in the page header = (0:0).
Msg 8909, Level 16, State 1, Line 1
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 0 (type Unknown), page ID (1:5126314) contains an incorrect page ID in its page header. The PageId in the page header = (0:0).
CHECKDB found 0 allocation errors and 3 consistency errors not associated with any single object.
Msg 8929, Level 16, State 1, Line 1
Object ID 1454628225, index ID 1, partition ID 72057594044874752, alloc unit ID 72057594049331200 (type In-row data): Errors found in off-row data with ID 155254784 owned by data record identified by RID = (1:5125138:1)
Msg 8928, Level 16, State 1, Line 1
Object ID 1454628225, index ID 1, partition ID 72057594044874752, alloc unit ID 72057594049462272 (type LOB data): Page (1:5126312) could not be processed. See other errors for details.
Msg 8965, Level 16, State 1, Line 1
Table error: Object ID 1454628225, index ID 1, partition ID 72057594044874752, alloc unit ID 72057594049462272 (type LOB data). The off-row data node at page (1:5126312), slot 0, text ID 155254784 is referenced by page (1:5125138), slot 1, but was not seen in the scan.
Msg 8928, Level 16, State 1, Line 1
Object ID 1454628225, index ID 1, partition ID 72057594044874752, alloc unit ID 72057594049462272 (type LOB data): Page (1:5126313) could not be processed. See other errors for details.
Msg 8965, Level 16, State 1, Line 1
Table error: Object ID 1454628225, index ID 1, partition ID 72057594044874752, alloc unit ID 72057594049462272 (type LOB data). The off-row data node at page (1:5126313), slot 0, text ID 155254784 is referenced by page (1:5125138), slot 1, but was not seen in the scan.
Msg 8928, Level 16, State 1, Line 1
Object ID 1454628225, index ID 1, partition ID 72057594044874752, alloc unit ID 72057594049462272 (type LOB data): Page (1:5126314) could not be processed. See other errors for details.
Msg 8965, Level 16, State 1, Line 1
Table error: Object ID 1454628225, index ID 1, partition ID 72057594044874752, alloc unit ID 72057594049462272 (type LOB data). The off-row data node at page (1:5126314), slot 0, text ID 155254784 is referenced by page (1:5125138), slot 1, but was not seen in the scan.
CHECKDB found 0 allocation errors and 7 consistency errors in table 'sysmail_mailitems' (object ID 1454628225).
CHECKDB found 0 allocation errors and 10 consistency errors in database 'msdb'.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (msdb).

GilaMonster
Master Smack Fu Yak Hacker

4507 Posts

Posted - 2011-09-26 : 13:10:25
The damage is entirely in sysmail_mailitems. If you look, the page numbers for the three 'errors not associated with any single object' are the same page numbers for the errors in sysmail...

Do you know that last week's backup was clean?
As for repairing, maybe... http://sqlskills.com/BLOGS/PAUL/post/Is-running-repair-on-msdb-safe.aspx
If you're absolutely certain you don't need the mail items that will be removed then it's probably OK.

SQL agent will be down while MSDB repairs or restores. Probably service broker, definitely mail. CLR procedures should be fine (unless you have ones stored in MSDB). Backups may fail because they can't write history.

--
Gail Shaw
SQL Server MVP
Go to Top of Page

Kristen
Test

22859 Posts

Posted - 2011-09-26 : 13:28:51
"Backups may fail because they can't write history."

Just curious: If they are triggered by Sshduled Tasks will they not run anyway (whilst MSDB inaccessible)?
Go to Top of Page

GilaMonster
Master Smack Fu Yak Hacker

4507 Posts

Posted - 2011-09-26 : 13:30:04
SQL Agent can't run without MSDB. Manually (or windows scheduled or 3rd party tool scheduled) backups may fail as well.

--
Gail Shaw
SQL Server MVP
Go to Top of Page

cmbergin
Starting Member

14 Posts

Posted - 2011-09-26 : 14:34:47
Thank you. I do know that last week's backup is clean because the maintenance plan is set up to check integrity first, then only take a backup if the integrity check is successful; the "verify backup integrity" option is turned on as well. I guess I could always restore the backup as a user database and run CHECKDB on that to be sure.

Thank you for the link. Since the damage is entirely to the mail items table, I feel safe running the repair option. We only use the mail records there to prove that certain messages were sent, and we've only needed to do that twice in the last three years. It's an acceptable data loss for us.

I read somewhere about a couple of techniques to see exactly what's going to be deleted by using DBCC PAGE or something like that. Perhaps I'll find one of those articles and see if I can extract the data the repair will delete just to be safe.

Thank you for your help!

PS - is there a way to estimate how long a repair will run based on the size of the data files or something like that? I know how long my backups typically take if those times are in any way related. I need to request an appropriately-sized maintenance window from my IT director.
Go to Top of Page

GilaMonster
Master Smack Fu Yak Hacker

4507 Posts

Posted - 2011-09-26 : 14:59:18
Nope. It takes as long as it takes, usually longer than a normal checkDB. Not related to backup duration.

Have fun trying to get the raw data back... those are LOB columns on the damaged pages, and that table has multiple LOB columns included in it.

--
Gail Shaw
SQL Server MVP
Go to Top of Page

GilaMonster
Master Smack Fu Yak Hacker

4507 Posts

Posted - 2011-09-26 : 15:00:58
p.s. Yes, you should restore and run checkDB to ensure that a backup is usable. The verify option just checks that the backup header is intact (unless you are backing up with checksum)

--
Gail Shaw
SQL Server MVP
Go to Top of Page

cmbergin
Starting Member

14 Posts

Posted - 2011-09-26 : 15:05:09
Alright. Thank you again. I'll definitely test the backup, then I'll probably just restore over msdb since I know roughly how long that takes.
Go to Top of Page

Kristen
Test

22859 Posts

Posted - 2011-09-26 : 17:35:57
You could restore a fresh MSDB backup to A.N.Other machine (or A.N.Other database on main machine) and do a repair there - that might give you some confidence on the repair time.

Why doesn't MSDB run in Full Recovery model? If it did you'd probably get the whole lot back from TLog backups. Oh well ...

Is the cause of this likely to be a hardware issue? if so that needs to be found & fixed before it strikes again.
Go to Top of Page

Kristen
Test

22859 Posts

Posted - 2011-09-26 : 17:40:22
quote:
Originally posted by GilaMonster

SQL Agent can't run without MSDB. Manually (or windows scheduled or 3rd party tool scheduled) backups may fail as well.



As I thought, thanks. So only got to worry about manually launched things, (we don't use 3rd party backups, but I'll try to remember that in case I'm working on a site where they do)

I might build into our adhoc "Make A Backup Sproc" to check that MSDB is available - it would be a bit neater than potentially generating a backup file that then cannot be logged in MSDB.
Go to Top of Page

GilaMonster
Master Smack Fu Yak Hacker

4507 Posts

Posted - 2011-09-27 : 05:32:07
I did a quick test:

quote:
Processed 3760 pages for database 'Masters', file 'Masters' on file 1.
Processed 1 pages for database 'Masters', file 'Masters_log' on file 1.
Msg 924, Level 14, State 1, Line 1
Database 'msdb' is already open and can only have one user at a time.
Msg 3009, Level 16, State 1, Line 1
Could not insert a backup or restore history/detail record in the msdb database. This may indicate a problem with the msdb database. The backup/restore operation was still successful.
BACKUP DATABASE successfully processed 3761 pages in 1.418 seconds (20.721 MB/sec).


--
Gail Shaw
SQL Server MVP
Go to Top of Page

cmbergin
Starting Member

14 Posts

Posted - 2011-09-27 : 15:58:25
Thank you for your suggestion, Kristen. That makes a lot of sense.

With corruption, I normally assume something is wrong with the hardware and immediately send over a work order to the hardware group. It's only happened twice before, and both times it was something going wrong on the RAID controller. This time, they said the RAID cabinet has no errors or warnings, and there's nothing in the Windows event log, either. I'm not too surprised, as this was after a failover and subsequent failback to our passive server, and we use a third-party tool for that. I'd feel better using SQL Server's built-in HA options, but we only have Standard Edition.
Go to Top of Page

GilaMonster
Master Smack Fu Yak Hacker

4507 Posts

Posted - 2011-09-27 : 16:09:22
Corruption is (in the vast majority of cases) a hardware error. If that 3rd party tool is replicating the data files, it could have made mistakes. What tool is it?

Something has scrambled the page IDs of 3 contiguous pages in MSDB. If they contain recent data, could be that their changes were not replicated correctly

--
Gail Shaw
SQL Server MVP
Go to Top of Page

cmbergin
Starting Member

14 Posts

Posted - 2011-09-27 : 16:47:38
We use DoubleTake, which I've found to be a little touchy. It works fine when it's a controlled failover for the purpose of moving to a new server with minimal downtime. I don't trust it to handle failovers during disasters, as in my experience, we're usually just left with a suspect database on the passive target and have to restore from backup anyway. My best guess is that the product is designed to work with two servers in close physical proximity, but we've got it replicating between two different data centers.

I, unfortunately, don't get to make the call regarding backup and HA solutions. My SQL Server backups are "unofficial" since, in theory, BackupExec is taking care of the database backups, but I don't see any reason to trust BackupExec, and plenty of reasons to trust native backups.
Go to Top of Page

GilaMonster
Master Smack Fu Yak Hacker

4507 Posts

Posted - 2011-09-27 : 17:18:53
Ok... Nice environment. Considered looking for somewhere a little saner?

--
Gail Shaw
SQL Server MVP
Go to Top of Page
   

- Advertisement -