Friday, October 22, 2010

AOS and clustering

hi,

There seems to be quite some vagueness when it comes to Ax and AOS clustering.
A few statements to (hopefully) clear things up a bit:

Single AOS
The simplest configuration is an Ax implementation with one single AOS server.
Extra AOS servers can be added in scenarios where you want to increase availability or scale up the Ax environment to handle a larger number of users/connections.
Several possibilities from that point on:

AOS side by side
By default multiple AOS's in a single Ax environment will exist side by side. They do know about each others existence, sync their caches, try not to hamper each other too much,  …
A user connects to a specific AOS as defined in his client configuration. In case there is more than one AOS defined in the configuration the first available will be used.
Adding all available AOS's to one client config file would probably be a smart thing to do. If you only add one node of the cluster to your config file, and this one node is down, you won't be able to connect. If you add all - let's say - 4 nodes, the client config will connect to the first available.
Unless you want to control they way users connect and have a specific department or region connect to a specific AOS of course. I'm thinking latency, bandwidth, ...

Clustered AOS
You can create a load balancing cluster (administration - setup - cluster configuration) and link AOS instances to it. Again two options:

Cluster without load balancer
The naming here is a bit tricky, even if there is no dedicated load balancer set up in a cluster, there will still be load balancing: all AOS's in a cluster implicitly act as load balancers
A user connects to a any of the AOS's as defined in his client configuration. He cannot predict which AOS he will eventually host his session.
The only criteria maintained to balance the load is the number of clients (including business connectors, worker threads, ...) connected to an AOS. If a cluster has two nodes: AOS A and AOS B and A has 3 client connections and B has 4. The next connection will most likely be hosted by A, even if the connecting user's client config only has AOS B in the list of available AOS instances. Most likely, because if AOS A does not respond (in a fashionable time), it will try to connect to B.
In this same example it is possible the load (CPU, memory usage, ...) created by those 3 connections on AOS A is much higher than the load generated by the 4 connections on AOS B. Still the next connection will be hosted by A.
Therefore the client config should have all AOS's listed. You're not sure which AOS will host your session, but you do want all entry point available.

Cluster with load balancer
If you set up an AOS instance to be a load balancer (checkbox 'load balancer' in server configuration) you dedicate  this instance to spread incoming user connection among the non-load balancing AOS's. Ideally the one  with the lowest number of active connections in the cluster will host the first incoming connection. That's all the dedicated load balancer will do for you.
Frankly I don't see the point of setting up a dedicated load balancing server. Maybe when you're dealing with a huge number of concurrent users, logging on and off all the time and you don't want a 'real' AOS to waste time handling those? On the other side: a dedicated load balancer does not consume an AOS license. So it's actually free.

The client config would only contain the dedicated load balancing AOS instance(s), because that's the only valid entry point here. But then again, if your load balancer is down …
Therefore:
- Is it possible to directly connect to a non-load balancer AOS in an AOS cluster with dedicated load balancer? Yes it is.
- Does a non-load balancer AOS in an AOS cluster with a dedicated load balancer also balances load?
Yes it does.
So even if you've got a dedicated load balancer … setting the non-dedicated AOS's instances from the cluster in the client config may be a satisfying plan.
You can go all fancy an set up difference clusters on one Ax environment, for example to split the online and batch processing. Most likely you want client config files per AOS cluster to control who's connecting to which cluster.

Failover
A word on failover might also be in place. An AOS cluster does not mean failover capabilities. It just indicates two or more AOS instances are grouped. That's about it. No automatic moving of sessions from AOS A to AOS B if server A crashes, users will lose their connection to the AOS, manual restart of Ax client by user is required, …
All bad news? No ma'am: an AOS cluster makes sure the load is equally spread over the different nodes. You can add or remove nodes to the cluster online. But in fact … that's about it … and besides the load spreading, a cluster doesn't do more than a well considered config file does.



Other things you might not yet know regarding AOS's:
- Each active not-dedicated loadbalancing AOS instance requires one AOS license. You can install as many AOS instances as you wish, but the number of active instances is limited by the license. You cannot have more active AOS instances than your license allows on the same Ax environment.
- In the online users form you only get to see the SPID's of the users that are on the same AOS as you are
- You can prevent users to connect to an AOS using the 'reject new clients' button on the 'server instances' tab of the online users form. Hit 'accept new clients' to undo this.

Sunday, October 10, 2010

How to detect the cause of a deadlock

hi,

This is the situation I was in last week:

Four Ax processes (all batch threads) executing the same logic (executing the same class).
Different processes were blocking each other (deadlock situation).

All steps below were taken during the deadlock situation.

The quick and drastic solution would be to kill some processes in SQL server ... but it would be a matter of time before the same would happen again.
The better solution implied some processes were hanging in the production system during the research, but we did get the time to sort this out decently.

Overview of the SPID's, based on the online users form:
Ax session 14 -> SPID's 59, 66, 66, 6
Ax session 13 -> SPID's 67, 68, 69
Ax session 12 -> SPID's 57, 58, 54
Ax session 11 -> SPID's 56, 60, 65

A dump of the blocking sessions on SQL server (based on SP_WHO2, only the relevant SPID's - the ones blocked + the source SPID for the blocking - are shown here):


A quick look would point to SPID 57 as process causing the blocking.
I repeatedly ran 'SP_WHO2 57' over about 30 seconds, and found the CPUtime nor the diskIO changing over time.
On top SPID 57 has status 'sleeping'.

Another view on this situation, including the Ax sessions:

Ax session 14 is blocked by Ax session 12 (because SPID 66 is blocked by 57)
Ax sessoin 13 is blocked by Ax session 11 (because SPID 69 is blocked by 65)
Ax sessoin 12 is blocked by Ax session 11 (because SPID 54 is blocked by 65)
Ax sessoin 11 is blocked by Ax session 12 and 14 (because SPID 56 is locked by 57 and 65 is blocked by 66)

This very much looks like a deadlock situation : 12 blocked by 11, 11 blocked by 12
And even more: 11 is blocked by 14, and 14 is blocked by 12

Still: Ax session 12 with SPID 57 seems to be the cause of the problem
but SPID 57 is sleeping ... and not moving at all ... (remember CPUTime, DiskIO)

Since SPID 57 is sleeping, I'm guessing we need to look a the other SPID's from Ax session 12: SPID 54 is blocked by SPID 65, SPID 65 is blocked by SPID 66.
SPID 66 from Ax session 14 seems to be the one to focus on, because
- this is the one blocked by SPID 57 in Ax session 12.
- if we manage to unblock Ax session 14, Ax session 11 will most likely be able to continue (because Ax sessoin 11 is only blocked by SPID 66 from Ax session 14)
 ... which will resolve (or even better, if we manage to trace down the problem to it's origin: prevent the possibility on) the deadlock situation.

In the Queyr Analyser 'sp_lock 66' gives me:

So the one with status 'wait' is the interesting one.
This means SPID 66 wants to update (U) data in object 2062018477, but it's blocked by SPID 57 (SPID 66 blocked by SPID 57)

'sp_lock 57' in the Query Analyser gives us a whole lot more, but the only locks in common with SPID 66 are the ones on objectID 20062018477.

To find the locks that SPID's 57 and 66 have in common, I used the statement below :
(based on what's described at http://weblogs.sqlteam.com/mladenp/archive/2008/04/29/SQL-Server-2005-Get-full-information-about-transaction-locks.aspx)

SELECT locksForSPID57.*,
'<->' as '<->',
locksForSPID66.*
FROM
(SELECT  L.request_session_id AS SPID,
        DB_NAME(L.resource_database_id) AS DatabaseName,
        O.Name AS LockedObjectName,
        P.object_id AS LockedObjectId,
        L.resource_type AS LockedResource,
        L.request_mode AS LockType,
        L.request_type,
        L.request_status
FROM    sys.dm_tran_locks L
        JOIN sys.partitions P ON P.hobt_id = L.resource_associated_entity_id
        JOIN sys.objects O ON O.object_id = P.object_id
        JOIN sys.dm_exec_sessions ES ON ES.session_id = L.request_session_id
        JOIN sys.dm_tran_session_transactions TST ON ES.session_id = TST.session_id
        JOIN sys.dm_tran_active_transactions AT ON TST.transaction_id = AT.transaction_id
        JOIN sys.dm_exec_connections CN ON CN.session_id = ES.session_id
        CROSS APPLY sys.dm_exec_sql_text(CN.most_recent_sql_handle) AS ST
WHERE   resource_database_id = db_id()
AND L.request_session_id = '57') as locksForSPID57,
(SELECT  L.request_session_id AS SPID,
        DB_NAME(L.resource_database_id) AS DatabaseName,
        O.Name AS LockedObjectName,
        P.object_id AS LockedObjectId,
        L.resource_type AS LockedResource,
        L.request_mode AS LockType,
        L.request_type,
        L.request_status

FROM    sys.dm_tran_locks L
        JOIN sys.partitions P ON P.hobt_id = L.resource_associated_entity_id
        JOIN sys.objects O ON O.object_id = P.object_id
        JOIN sys.dm_exec_sessions ES ON ES.session_id = L.request_session_id
        JOIN sys.dm_tran_session_transactions TST ON ES.session_id = TST.session_id
        JOIN sys.dm_tran_active_transactions AT ON TST.transaction_id = AT.transaction_id
        JOIN sys.dm_exec_connections CN ON CN.session_id = ES.session_id
        CROSS APPLY sys.dm_exec_sql_text(CN.most_recent_sql_handle) AS ST
WHERE   resource_database_id = db_id()
AND L.request_session_id = '66') as locksForSPID66
WHERE locksForSPID57.DatabaseName = locksForSPID66.DatabaseName
AND locksForSPID57.LockedObjectId = locksForSPID66.LockedObjectId

I'll probably get jeered by the SQL experts for this kind of statement, but it does the job.

The next step then is to find out which table is behind object 2062018477.
'SELECT object_name(2062018477)' tells us this is the numberSequenceTTS table

This table is used by Ax for numbersequences defined as 'continuous'.

We would like to see the actual data that is being locked, but there seems to be no way to see the specific record(s) (KEY 6a018d6c5a90 I'd guess).
Therefor we're looking into the page (PAG 1:76393) that is not actually part of the problem (only KEY (ec0110e62862) with status WAIT is), but since SQL Server has this lock hierarchy (http://msdn.microsoft.com/en-us/library/ms189849.aspx) the page will include the data from KEY (ec0110e62862).
Again, if there's anyone out there who does know a way to identify wich specific data (resource type KEY or RID instead of complete pages) is held locked based on the sp_lock output ... please let me know!
For now, up to the data page:

DBCC traceon(3604) -- enables a specific trace flag to get the output in the console
DBCC PAGE(5, 1, 76393, 3) -- 5 = the ID of the DB we're working in, 1 is the datafile, 76393 the page, we get this from the SP_LOCK output
DBCC traceoff(3604) -- disables the specific trace flag (cover up your tracks)

A huge dump is created by 'DBCC PAGE' but we do recognize some things:
a compare between the 'TRANSID = ' lines from the dump and the records currently in the numberSequenceTTS table

SELECT *
FROM NumberSequenceTTS WITH (READUNCOMMITTED)

shows us 5652282410 and 5652282411 are the transID values to dig in to.
These are the onces tha appear in both the NumberSequenceTTS output and the page dump.
Or in other words: these are the records in use by SPID 57 while SPID 66 wants to use them as well.

This statement:
SELECT *
FROM NumberSequenceList WITH (READUNCOMMITTED)
WHERE EXISTS (SELECT RecId FROM NumberSequenceTTS WITH (READUNCOMMITTED) where NumberSequenceTTS.TRANSID = NumberSequenceList.TRANSID)
AND TRANSID != 0

tells us the lock is related to the TOB number sequence ... which is set to continuous .. but shouldn't be (or it is not strictly required anyway).

So the problem here is a number sequence that was set to be continuous and several processes fetching a new number from this number sequence simultaniously.
I adviced to make this specific number sequence not continuous and review all other continuous number sequences.

The solution is being applied ... fingers crossed ...