Sunday, October 10, 2010

How to detect the cause of a deadlock

hi,

This is the situation I was in last week:

Four Ax processes (all batch threads) executing the same logic (executing the same class).
Different processes were blocking each other (deadlock situation).

All steps below were taken during the deadlock situation.

The quick and drastic solution would be to kill some processes in SQL server ... but it would be a matter of time before the same would happen again.
The better solution implied some processes were hanging in the production system during the research, but we did get the time to sort this out decently.

Overview of the SPID's, based on the online users form:
Ax session 14 -> SPID's 59, 66, 66, 6
Ax session 13 -> SPID's 67, 68, 69
Ax session 12 -> SPID's 57, 58, 54
Ax session 11 -> SPID's 56, 60, 65

A dump of the blocking sessions on SQL server (based on SP_WHO2, only the relevant SPID's - the ones blocked + the source SPID for the blocking - are shown here):


A quick look would point to SPID 57 as process causing the blocking.
I repeatedly ran 'SP_WHO2 57' over about 30 seconds, and found the CPUtime nor the diskIO changing over time.
On top SPID 57 has status 'sleeping'.

Another view on this situation, including the Ax sessions:

Ax session 14 is blocked by Ax session 12 (because SPID 66 is blocked by 57)
Ax sessoin 13 is blocked by Ax session 11 (because SPID 69 is blocked by 65)
Ax sessoin 12 is blocked by Ax session 11 (because SPID 54 is blocked by 65)
Ax sessoin 11 is blocked by Ax session 12 and 14 (because SPID 56 is locked by 57 and 65 is blocked by 66)

This very much looks like a deadlock situation : 12 blocked by 11, 11 blocked by 12
And even more: 11 is blocked by 14, and 14 is blocked by 12

Still: Ax session 12 with SPID 57 seems to be the cause of the problem
but SPID 57 is sleeping ... and not moving at all ... (remember CPUTime, DiskIO)

Since SPID 57 is sleeping, I'm guessing we need to look a the other SPID's from Ax session 12: SPID 54 is blocked by SPID 65, SPID 65 is blocked by SPID 66.
SPID 66 from Ax session 14 seems to be the one to focus on, because
- this is the one blocked by SPID 57 in Ax session 12.
- if we manage to unblock Ax session 14, Ax session 11 will most likely be able to continue (because Ax sessoin 11 is only blocked by SPID 66 from Ax session 14)
 ... which will resolve (or even better, if we manage to trace down the problem to it's origin: prevent the possibility on) the deadlock situation.

In the Queyr Analyser 'sp_lock 66' gives me:

So the one with status 'wait' is the interesting one.
This means SPID 66 wants to update (U) data in object 2062018477, but it's blocked by SPID 57 (SPID 66 blocked by SPID 57)

'sp_lock 57' in the Query Analyser gives us a whole lot more, but the only locks in common with SPID 66 are the ones on objectID 20062018477.

To find the locks that SPID's 57 and 66 have in common, I used the statement below :
(based on what's described at http://weblogs.sqlteam.com/mladenp/archive/2008/04/29/SQL-Server-2005-Get-full-information-about-transaction-locks.aspx)

SELECT locksForSPID57.*,
'<->' as '<->',
locksForSPID66.*
FROM
(SELECT  L.request_session_id AS SPID,
        DB_NAME(L.resource_database_id) AS DatabaseName,
        O.Name AS LockedObjectName,
        P.object_id AS LockedObjectId,
        L.resource_type AS LockedResource,
        L.request_mode AS LockType,
        L.request_type,
        L.request_status
FROM    sys.dm_tran_locks L
        JOIN sys.partitions P ON P.hobt_id = L.resource_associated_entity_id
        JOIN sys.objects O ON O.object_id = P.object_id
        JOIN sys.dm_exec_sessions ES ON ES.session_id = L.request_session_id
        JOIN sys.dm_tran_session_transactions TST ON ES.session_id = TST.session_id
        JOIN sys.dm_tran_active_transactions AT ON TST.transaction_id = AT.transaction_id
        JOIN sys.dm_exec_connections CN ON CN.session_id = ES.session_id
        CROSS APPLY sys.dm_exec_sql_text(CN.most_recent_sql_handle) AS ST
WHERE   resource_database_id = db_id()
AND L.request_session_id = '57') as locksForSPID57,
(SELECT  L.request_session_id AS SPID,
        DB_NAME(L.resource_database_id) AS DatabaseName,
        O.Name AS LockedObjectName,
        P.object_id AS LockedObjectId,
        L.resource_type AS LockedResource,
        L.request_mode AS LockType,
        L.request_type,
        L.request_status

FROM    sys.dm_tran_locks L
        JOIN sys.partitions P ON P.hobt_id = L.resource_associated_entity_id
        JOIN sys.objects O ON O.object_id = P.object_id
        JOIN sys.dm_exec_sessions ES ON ES.session_id = L.request_session_id
        JOIN sys.dm_tran_session_transactions TST ON ES.session_id = TST.session_id
        JOIN sys.dm_tran_active_transactions AT ON TST.transaction_id = AT.transaction_id
        JOIN sys.dm_exec_connections CN ON CN.session_id = ES.session_id
        CROSS APPLY sys.dm_exec_sql_text(CN.most_recent_sql_handle) AS ST
WHERE   resource_database_id = db_id()
AND L.request_session_id = '66') as locksForSPID66
WHERE locksForSPID57.DatabaseName = locksForSPID66.DatabaseName
AND locksForSPID57.LockedObjectId = locksForSPID66.LockedObjectId

I'll probably get jeered by the SQL experts for this kind of statement, but it does the job.

The next step then is to find out which table is behind object 2062018477.
'SELECT object_name(2062018477)' tells us this is the numberSequenceTTS table

This table is used by Ax for numbersequences defined as 'continuous'.

We would like to see the actual data that is being locked, but there seems to be no way to see the specific record(s) (KEY 6a018d6c5a90 I'd guess).
Therefor we're looking into the page (PAG 1:76393) that is not actually part of the problem (only KEY (ec0110e62862) with status WAIT is), but since SQL Server has this lock hierarchy (http://msdn.microsoft.com/en-us/library/ms189849.aspx) the page will include the data from KEY (ec0110e62862).
Again, if there's anyone out there who does know a way to identify wich specific data (resource type KEY or RID instead of complete pages) is held locked based on the sp_lock output ... please let me know!
For now, up to the data page:

DBCC traceon(3604) -- enables a specific trace flag to get the output in the console
DBCC PAGE(5, 1, 76393, 3) -- 5 = the ID of the DB we're working in, 1 is the datafile, 76393 the page, we get this from the SP_LOCK output
DBCC traceoff(3604) -- disables the specific trace flag (cover up your tracks)

A huge dump is created by 'DBCC PAGE' but we do recognize some things:
a compare between the 'TRANSID = ' lines from the dump and the records currently in the numberSequenceTTS table

SELECT *
FROM NumberSequenceTTS WITH (READUNCOMMITTED)

shows us 5652282410 and 5652282411 are the transID values to dig in to.
These are the onces tha appear in both the NumberSequenceTTS output and the page dump.
Or in other words: these are the records in use by SPID 57 while SPID 66 wants to use them as well.

This statement:
SELECT *
FROM NumberSequenceList WITH (READUNCOMMITTED)
WHERE EXISTS (SELECT RecId FROM NumberSequenceTTS WITH (READUNCOMMITTED) where NumberSequenceTTS.TRANSID = NumberSequenceList.TRANSID)
AND TRANSID != 0

tells us the lock is related to the TOB number sequence ... which is set to continuous .. but shouldn't be (or it is not strictly required anyway).

So the problem here is a number sequence that was set to be continuous and several processes fetching a new number from this number sequence simultaniously.
I adviced to make this specific number sequence not continuous and review all other continuous number sequences.

The solution is being applied ... fingers crossed ...

1 comment:

  1. I don't understand why there is a lock anyway.

    Why is process 57 sleeping? The correct behavior would be that other processes be blocked until process 57 is over, that I can understand; but 57 going to sleep? WDF!!

    I have the same behaviour by the way.. really sux!

    ReplyDelete