Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 SQL Server 2005 Forums
 SQL Server Administration (2005)
 Resolve Strange Deadlocks

Author  Topic 

Transact Charlie
Master Smack Fu Yak Hacker

3451 Posts

Posted - 2009-12-02 : 04:59:12
Hi all,

Sorry in advance for the lengthy post:

I've got a weird situation that I can't figure out. Here's the situation.

There are 3 separate stacks comprising of

1 application server (running linux with our java application)
1 SQL SERVER 2005 Database Server

The hardware of the stacks is the same (HP Proliant servers)
The operating systems and the sql server setup are the same. (same tempddb settings, RAID setup, same everything)

On each of the database servers we have multiple databases (maybe 20 per stack) (1 for each of our clients). Each database has an associated SQL SERVER AGENT job set up that runs at 1AM every day which does various things to implement our business logic. (future dated changes to employee data etc). The databases have the same schemas but different data.

At 1AM no-one is going to be using our systems barring special cases - the load on severs at that time is non-existent.

Here's the problem.

On one of the stacks (stack 3) we get sporadic deadlock issues with the sql server agent jobs. There seems to be no rhyme or reason for it - but the jobs all deadlock in step 1 which handles future dated data changes (for address / name / salary / whatever). We never get deadlocks on the other stacks even though everything is the same.

Deadlock frequency seems to be random. We'll have days, or weeks, when there are no deadlocks and suddenly (like last night) 15 clients deadlocks during this step.

What I've done so far:

1) Enclosed the step in a TRY / CATCH block that attempts to email me the details of the deadlock using a VIEW. However, I don't get any indication of the object locked and the sql text I return always is just the sql of the TRY / CATCH block.

2) partially rewritten step 1 to make is SLIGHTLY less horrible - it basically makes a cursor and derives dynamic sql updates from an approval event table. I'm made sure that the cursor is local.

What I'm going to do:
===================================================================
Run profiler on the db server from 1 to 2. I didn't want to do this because of the randomness of the problem.



Do you guys have any advice?


Charlie
===============================================================
Msg 3903, Level 16, State 1, Line 1736
The ROLLBACK TRANSACTION request has no corresponding BEGIN TRANSACTION

abeman
Starting Member

3 Posts

Posted - 2009-12-02 : 06:12:57
i got one idea as we had the same sort of issue with our software.

Have you looked at SIDS being re-used due to muti- threading in SQL server?

Is there any chance an application is reusing the same SID , causing the deadlock?

Go to Top of Page

AndrewMurphy
Master Smack Fu Yak Hacker

2916 Posts

Posted - 2009-12-02 : 07:02:39
try to staggger the servers to work at 1am, 2am & 3am.... to totally eliminate any possibility of cross-server interference.

can you also stagger the database updates to eliminate cross-database links.

when all else fails, grasp at straws....or post on the internet.
Go to Top of Page

Transact Charlie
Master Smack Fu Yak Hacker

3451 Posts

Posted - 2009-12-02 : 08:24:23
quote:
Originally posted by abeman

i got one idea as we had the same sort of issue with our software.

Have you looked at SIDS being re-used due to muti- threading in SQL server?

Is there any chance an application is reusing the same SID , causing the deadlock?




Cheers for the suggestion.

I guess I'll see that in profiler tonight (even if there are no deadlocks).


Charlie
===============================================================
Msg 3903, Level 16, State 1, Line 1736
The ROLLBACK TRANSACTION request has no corresponding BEGIN TRANSACTION
Go to Top of Page

Transact Charlie
Master Smack Fu Yak Hacker

3451 Posts

Posted - 2009-12-02 : 08:26:46
quote:
Originally posted by AndrewMurphy

try to staggger the servers to work at 1am, 2am & 3am.... to totally eliminate any possibility of cross-server interference.

can you also stagger the database updates to eliminate cross-database links.

when all else fails, grasp at straws....or post on the internet.


Yup -- at that stage!

I was going to stagger the jobs if all else failed but 1am - 2am is our downtime window. The jobs can take some time if there happens to be a lot of data to change.

That wouldn't explain why the problems all happen on only one box though.

But yeah -- try everything!


Charlie
===============================================================
Msg 3903, Level 16, State 1, Line 1736
The ROLLBACK TRANSACTION request has no corresponding BEGIN TRANSACTION
Go to Top of Page

russell
Pyro-ma-ni-yak

5072 Posts

Posted - 2009-12-02 : 08:40:15
Do these jobs operate on data only on their own server or do they all perform operations on the same set of data? If the former, staggering the job times won't help.

Are the reads and writes occurring against the same tables or are you reading from one and writing to another?

Is the cursor read only?

Is the select statement in the cursor utilizing an index?

Sure nothing else is happening at the same time on server that may be contributing to the deadlocks?
Go to Top of Page

Transact Charlie
Master Smack Fu Yak Hacker

3451 Posts

Posted - 2009-12-02 : 11:25:52
quote:
Originally posted by russell

Do these jobs operate on data only on their own server or do they all perform operations on the same set of data? If the former, staggering the job times won't help.

Are the reads and writes occurring against the same tables or are you reading from one and writing to another?

Is the cursor read only?

Is the select statement in the cursor utilizing an index?

Sure nothing else is happening at the same time on server that may be contributing to the deadlocks?



Cheers Russel,

The jobs all operate on their own data. Each database has an [approvalEventRow] table that keeps track of changes local to that database. The changes are things like name changes, salary changes, grade changes, etc. Once the effective date has been reached then the job generates UPDATE statements to change the relevant data in the real tables.

The UPDATES are blind. The real tables aren't read from just updated by the statements. In general only a few tables will ever be touched. the [Employee] table, the [EmployeeCompensation] table and the [employeeDependant] table.

The loop logic is a nested CURSOR (urgh I know) The CURSORS are


-- Outside loop
DECLARE eventCursor CURSOR LOCAL READ_ONLY FOR SELECT
[id]
, [employeeId]
, [effectiveDate]
FROM
ApprovalEvent
WHERE
[status] IN ('P', 'A')
AND [effectiveDate] < @tomorrow
AND [companyId] = @companyId
AND (
[employeeID] = @employeeID
OR @employeeID IS NULL
)

-- Inside loop
DECLARE eventRowCursor CURSOR LOCAL READ_ONLY FOR SELECT
[tableName]
, [idValue]
, [columnName]
, [newValue]
, [oldValue]
FROM
ApprovalEventRow
WHERE
[eventId] = @eventId
ORDER BY
[tableName]
, [idValue]

There are a mess of indices on the tables. Over the years developers have tended to throw an index on one or more columns whenever they remembered that indices existed.

However, the deadlock can happen even when the outer cursor returns no rows (there's no event to be processed).

I've modified these cursors (making them LOCAL) in case that was the problem. Before they were just declared with no additional flags - it hasn't seemed to make much difference

quote:

Sure nothing else is happening at the same time on server that may be contributing to the deadlocks?

99% sure. I'm going to set up a profiler task to run between 1 and 2 am and see what's what.


Charlie
===============================================================
Msg 3903, Level 16, State 1, Line 1736
The ROLLBACK TRANSACTION request has no corresponding BEGIN TRANSACTION
Go to Top of Page

russell
Pyro-ma-ni-yak

5072 Posts

Posted - 2009-12-02 : 13:04:25
where do the updates occur in all this? also should check the exec plan on those queries.

profiler may reveal the problem tonight
Go to Top of Page

tkizer
Almighty SQL Goddess

38200 Posts

Posted - 2009-12-02 : 13:56:42
I haven't read the whole thread, but you may want to consider using READ_COMMITTED_SNAPSHOT isolation level to avoid deadlocks between writes and reads. If your deadlocks are between writes and writes, that setting will not help. But it is the recommended isolation level for OLTP system, so it is worth it to change it anyway. The default isolation level is READ_COMMITTED, which causes performance issues with reads blocking writes.

Tara Kizer
Microsoft MVP for Windows Server System - SQL Server
http://weblogs.sqlteam.com/tarad/

Subscribe to my blog

"Let's begin with the premise that everything you've done up until this point is wrong."
Go to Top of Page

Transact Charlie
Master Smack Fu Yak Hacker

3451 Posts

Posted - 2009-12-03 : 04:11:02
Thanks for the suggestion Tara - that's something I can set on the individual batch isn't it?

SET TRANSACTION ISOLATION LEVEL READ_COMMITTED_SNAPSHOT
...
...
<stuff>

The only problem is that testing such changes is really hard when the problem is so intermittent. I'll change the isolation level and see though.


Russell -- sadly no deadlocks happened last night. Even though nothing is different from yesterday (that I can see)...

I'll keep on tracing.


Charlie
===============================================================
Msg 3903, Level 16, State 1, Line 1736
The ROLLBACK TRANSACTION request has no corresponding BEGIN TRANSACTION
Go to Top of Page

AndrewMurphy
Master Smack Fu Yak Hacker

2916 Posts

Posted - 2009-12-03 : 05:06:56
1. whats different is the portion of the database that is updated each night. bacause of the future dated transactions I suspect different employees are beign updated each night. any chance some employees are having simialr updtes on problem nights.


2. can you load a "problem" night into a test environment and re-run there? you should get the same results if the environment & machine power is the same
Go to Top of Page

spirit1
Cybernetic Yak Master

11752 Posts

Posted - 2009-12-03 : 07:37:16
here's a method that will tell you about deadlocks and give yo uthe whole report without you having to run profiler and strain the system. it can even send you an email.

http://weblogs.sqlteam.com/mladenp/archive/2008/07/18/Immediate-deadlock-notifications-without-changing-existing-code.aspx

note that you have to enable service broker on the monitored database

___________________________________________________________________________
Causing trouble since 1980
Blog: http://weblogs.sqlteam.com/mladenp
Speed up SSMS development: www.ssmstoolspack.com <- version 1.7 out!
Go to Top of Page

Transact Charlie
Master Smack Fu Yak Hacker

3451 Posts

Posted - 2009-12-03 : 07:54:29
quote:
Originally posted by spirit1

here's a method that will tell you about deadlocks and give yo uthe whole report without you having to run profiler and strain the system. it can even send you an email.

http://weblogs.sqlteam.com/mladenp/archive/2008/07/18/Immediate-deadlock-notifications-without-changing-existing-code.aspx

note that you have to enable service broker on the monitored database

___________________________________________________________________________
Causing trouble since 1980
Blog: http://weblogs.sqlteam.com/mladenp
Speed up SSMS development: www.ssmstoolspack.com <- version 1.7 out!



Hi Spirit1

Yeah -- I did look at that and wasn't allowed to enable service broker . I did try using a TRY CATCH method and do get mails but unfortunately they aren't very helpful as they just contain the sql text of the stored proc that makes up the email


Charlie
===============================================================
Msg 3903, Level 16, State 1, Line 1736
The ROLLBACK TRANSACTION request has no corresponding BEGIN TRANSACTION
Go to Top of Page

Transact Charlie
Master Smack Fu Yak Hacker

3451 Posts

Posted - 2009-12-03 : 07:59:28
quote:
Originally posted by AndrewMurphy

1. whats different is the portion of the database that is updated each night. bacause of the future dated transactions I suspect different employees are beign updated each night. any chance some employees are having simialr updtes on problem nights.


2. can you load a "problem" night into a test environment and re-run there? you should get the same results if the environment & machine power is the same




Hi Andrew.

Sadly that doesn't work.

I've tested getting a database backup from just before (backup jobs run at 2200 -- nothing happened between then and 1AM)

I then restored the backup onto the same database server (with a different name) and added the overnight job. Then if I run the overnight job on that database it works fine. no deadlocks.

A deadlock can occur even if there are no changes to be made . It doesn't seem to matter if there are valid changes or not. That's why I thought there must have been some other process running at 1am but I can't find anything.


Charlie
===============================================================
Msg 3903, Level 16, State 1, Line 1736
The ROLLBACK TRANSACTION request has no corresponding BEGIN TRANSACTION
Go to Top of Page

Transact Charlie
Master Smack Fu Yak Hacker

3451 Posts

Posted - 2009-12-03 : 08:06:58
Hi Spirit1. Can you advice if there is a glaring error in my TRY CATCH block and associated mailing sp's?

Here's the sql job

DECLARE
@Body VARCHAR(4000)
, @From VARCHAR(50)
, @To VARCHAR(50)
, @Subject VARCHAR(100)
, @companyId INT
, @tomorrow DATETIME
, @error INT
, @message XML
, @ErrorNumber INT
, @ErrorMessage NVARCHAR(2048)
, @ErrorSeverity INT
, @errorState INT
, @MailBody NVARCHAR(MAX)
SET @companyId = 3
SET @tomorrow = CONVERT(VARCHAR(50), DATEADD(DAY, 1, GETDATE()), 101)
EXEC logStartFlexJobAudit @companyId, 'adminEffectEvents', 1
BEGIN TRAN
BEGIN TRY
EXEC @error = adminEffectEvents @companyId, @tomorrow
COMMIT
IF @error <> 0 BEGIN
SELECT @Body = 'The overnight process has failed during the adminEffectEvents step, with return code '+CONVERT(VARCHAR(50),@error)
SELECT @From = '<REMOVED>'
SELECT @To = '<REMOVED>'
SELECT @Subject = 'SQLMail: Overnight Job Failure - [' + @@SERVERNAME + '] ' + DB_NAME()
EXEC master..sp_send_cdosysmail @From, @To, @Subject, @Body
END
END TRY
BEGIN CATCH
SELECT
@error = ERROR_NUMBER()
, @errorMessage = ERROR_MESSAGE()
, @ErrorSeverity = ERROR_SEVERITY()
, @ErrorState = ERROR_STATE()

IF @error = 1205 BEGIN
SET @subject = 'OVERNIGHT DEADLOCK ON ' + @@SERVERNAME + '.' + DB_NAME()
EXEC master.dbo.lockMailer
@subject
, '<REMOVED>'
END
RAISERROR(@errorMessage, @errorSeverity, @errorState)
ROLLBACK
RETURN
END CATCH
EXEC logCompleteFlexJobAudit @companyId, 'adminEffectEvents', @error, 1


Here's the lockMailer

ALTER PROCEDURE [dbo].[lockMailer]
@subject NVARCHAR(255)
, @recipient NVARCHAR(255)
AS BEGIN
DECLARE
@spid INT
, @dbname NVARCHAR(255)
, @lockedObjNAme NVARCHAR(512)
, @sqlText NVARCHAR(MAX)
, @loginName NVARCHAR(512)
, @hostName NVARCHAR(512)
, @isUserTran BIT
, @transName NVARCHAR(512)
, @msgSubject VARCHAR(2000)
, @msgBody VARCHAR(8000)
, @uniqueBatch UNIQUEIDENTIFIER
, @from VARCHAR(255)

SET @uniqueBatch = NEWID()
SET @from = '<REMOVED>'

DECLARE lockCursor CURSOR LOCAL FOR SELECT
ISNULL([SPID], -1)
, ISNULL([DatabaseName], 'NULL')
, ISNULL([LockedObjectName], 'NULL')
, ISNULL([SqlStatementText], 'NULL')
, ISNULL([LoginName], 'NULL')
, ISNULL([HostName], 'NULL')
, ISNULL([IsUserTransaction], 1)
, ISNULL([TransactionName], 'NULL')
FROM
master.dbo.vLocks
ORDER BY
[spid]

OPEN lockCursor

FETCH NEXT FROM lockCursor INTO
@spid
, @dbname
, @lockedObjNAme
, @sqlText
, @loginName
, @hostName
, @isUserTran
, @transName

WHILE ( @@FETCH_STATUS = 0 ) BEGIN

SET @msgSubject = @subject
+ ' (BATCH: '
+ CAST(@uniqueBatch AS VARCHAR(64))
+ ') (SPID: '
+ CAST(@spid AS NVARCHAR(255))
+ ')'

SET @msgBody = '
<body>
<p>
<table>
<tr><td>SPID: ' + CAST(@spid AS VARCHAR(255)) + '</td></tr>
<tr><td>DBNAME: ' + @dbName + '</td></tr>
<tr><td>LOCKED: ' + @lockedObjName + '</td></tr>
<tr><td>HOST: ' + @hostName + '</td></tr>
<tr><td>USER TRAN: ' + CASE @isUserTran WHEN 1 THEN 'YES' ELSE 'NO' END + '</td></tr>
<tr><td>TRAN NAME: ' + @transName + '</td></tr>
<tr><td>SQL: ' + CAST(@sqlText AS VARCHAR(1000)) + '</td></tr>
</table>
</p>
</body>'

-- Send the message
EXEC master.dbo.sp_send_cdosysmail
@from
, @recipient
, @msgSubject
, @msgBody

FETCH NEXT FROM lockCursor INTO
@spid
, @dbname
, @lockedObjNAme
, @sqlText
, @loginName
, @hostName
, @isUserTran
, @transName
END

CLOSE lockCursor
DEALLOCATE lockCursor
END


And finally the view

USE [master]
GO
/****** Object: View [dbo].[vLocks] Script Date: 12/03/2009 13:07:56 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO

ALTER VIEW [dbo].[vLocks] AS SELECT
L.[request_session_id] AS SPID
, DB_NAME(L.[resource_database_id]) AS DatabaseName
, O.[Name] AS LockedObjectName
, P.[object_id] AS LockedObjectId
, L.[resource_type] AS LockedResource
, L.[request_mode] AS LockType
, ST.[text] AS SqlStatementText
, ES.[login_name] AS LoginName
, ES.[host_name] AS HostName
, TST.[is_user_transaction] AS IsUserTransaction
, AT.[name] AS TransactionName
FROM
sys.dm_tran_locks L
LEFT JOIN sys.partitions P ON P.[hobt_id] = L.[resource_associated_entity_id]
LEFT JOIN sys.objects O ON O.[object_id] = P.[object_id]
LEFT JOIN sys.dm_exec_sessions ES ON ES.[session_id] = L.[request_session_id]
LEFT JOIN sys.dm_tran_session_transactions TST ON ES.[session_id] = TST.[session_id]
LEFT JOIN sys.dm_tran_active_transactions AT ON TST.[transaction_id] = AT.[transaction_id]
LEFT JOIN sys.dm_exec_requests ER ON AT.[transaction_id] = ER.[transaction_id]

CROSS APPLY sys.dm_exec_sql_text(ER.[sql_handle]) AS ST
WHERE
resource_database_id = db_id()


With this I do get multiple mails but the sqltext is always the text of dbo.lockmailer which isn't helpful!


Charlie
===============================================================
Msg 3903, Level 16, State 1, Line 1736
The ROLLBACK TRANSACTION request has no corresponding BEGIN TRANSACTION
Go to Top of Page

tkizer
Almighty SQL Goddess

38200 Posts

Posted - 2009-12-03 : 12:23:51
quote:
Originally posted by Transact Charlie

Thanks for the suggestion Tara - that's something I can set on the individual batch isn't it?

SET TRANSACTION ISOLATION LEVEL READ_COMMITTED_SNAPSHOT
...
...
<stuff>

The only problem is that testing such changes is really hard when the problem is so intermittent. I'll change the isolation level and see though.



I recommend doing this at the database level and not at the query level. It does require downtime though since the database needs exclusive access during the change. You can make the change via ALTER DATABASE.

Tara Kizer
Microsoft MVP for Windows Server System - SQL Server
http://weblogs.sqlteam.com/tarad/

Subscribe to my blog

"Let's begin with the premise that everything you've done up until this point is wrong."
Go to Top of Page

Transact Charlie
Master Smack Fu Yak Hacker

3451 Posts

Posted - 2009-12-04 : 04:27:11
quote:
Originally posted by tkizer

quote:
Originally posted by Transact Charlie

Thanks for the suggestion Tara - that's something I can set on the individual batch isn't it?

SET TRANSACTION ISOLATION LEVEL READ_COMMITTED_SNAPSHOT
...
...
<stuff>

The only problem is that testing such changes is really hard when the problem is so intermittent. I'll change the isolation level and see though.



I recommend doing this at the database level and not at the query level. It does require downtime though since the database needs exclusive access during the change. You can make the change via ALTER DATABASE.

Tara Kizer
Microsoft MVP for Windows Server System - SQL Server
http://weblogs.sqlteam.com/tarad/

Subscribe to my blog

"Let's begin with the premise that everything you've done up until this point is wrong."



Cheers for the advice.

Can you advise if setting the isolation level for the batch would allow me to test this rather than jumping on and changing the default isolation level?

Changing the isolation level settings on the database proper is something that I'd have to argue for. I'm not a DBA (we don't really have one) - just the guy that has to deal with stuff like this. I'd have to argue the case with out infrastructure team and they will (quite rightly) be a little concerned about this for the downtime alone.

Looking at the docs for 2005 it says that "READ_COMMITED_SNAPSHOT database option is set OFF, which is the default setting, read committed isolation behaves as it did in earlier versions of SQL Server. Both implementations meet the ANSI definition of read committed isolation."

I take it that means that if a certain block of code has no problems at the current isolation level (READ_COMMITED) then it will * definitely* still have no problems at (READ_COMMITED_SNAPSHOT)?

I'm not doubting your advice at all. I'd just like to know that it will help before arguing for such a change.


Charlie
===============================================================
Msg 3903, Level 16, State 1, Line 1736
The ROLLBACK TRANSACTION request has no corresponding BEGIN TRANSACTION
Go to Top of Page

tkizer
Almighty SQL Goddess

38200 Posts

Posted - 2009-12-04 : 10:58:32
I would assume you could test it at the query level, but I've never tried that. Just to repeat what I earlier said, switching the isolation level will only help if your deadlocks are between reads and writes. If they are between writes and writes, then it will not help. Have you identified the deadlock victim and owner via SQL Profiler or the trace flag?

We haven't had any issues with READ_COMMITTED_SNAPSHOT. We are running it on very mission critical systems with a heavy load (3000 queries per second). The recommendation to switch from the default to READ_COMMITTED_SNAPSHOT is actually being made by Microsoft. Each time I have opened a case with them regarding deadlocks and performance issues, their first question is always if we are running READ_COMMITTED_SNAPSHOT. After a while, we got the hint and start switching to it.

The downtime needed is only a few seconds.

Tara Kizer
Microsoft MVP for Windows Server System - SQL Server
http://weblogs.sqlteam.com/tarad/

Subscribe to my blog

"Let's begin with the premise that everything you've done up until this point is wrong."
Go to Top of Page

Transact Charlie
Master Smack Fu Yak Hacker

3451 Posts

Posted - 2009-12-04 : 12:18:24
quote:
I would assume you could test it at the query level, but I've never tried that. Just to repeat what I earlier said, switching the isolation level will only help if your deadlocks are between reads and writes. If they are between writes and writes, then it will not help. Have you identified the deadlock victim and owner via SQL Profiler or the trace flag?

Sadly (or not!) we haven't had any deadlocks since I started profiling. I've got no clue what's causing the deadlock or even which object is the point of contention :(

What's the trace flag you mention?

I've set the isolation level for the batch but obviously is is hard to prove a negative.

I'll get us switched into READ_COMMITTED_SNAPSHOT on our client testing environment soon and then hopefully make the switch to live after a week or so.

BTW -- is there any reason why the documentation for SQL server 2005 has both

READ_COMMITED_SNAPSHOT

and

READ_COMMITTED_SNAPSHOT

as options?

is COMMITED an American spelling?


Charlie
===============================================================
Msg 3903, Level 16, State 1, Line 1736
The ROLLBACK TRANSACTION request has no corresponding BEGIN TRANSACTION
Go to Top of Page

tkizer
Almighty SQL Goddess

38200 Posts

Posted - 2009-12-04 : 12:23:53
The trace flag is 1222. You'll get deadlock output in the Error Log if you enable it that way you don't need to constantly run a trace. DBCC TRACEON(1222, -1)

As far as I know, the command to run to switch the isolation level is:
ALTER DATABASE dbName SET READ_COMMITTED_SNAPSHOT ON

To see what databases have it, you check the is_read_committed_snapshot_on column in sys.databases.

Tara Kizer
Microsoft MVP for Windows Server System - SQL Server
http://weblogs.sqlteam.com/tarad/

Subscribe to my blog

"Let's begin with the premise that everything you've done up until this point is wrong."
Go to Top of Page
   

- Advertisement -