My first blog entry is going to be on an issue we have had with a customer a few times since early summer. Their CA Service Desk Manager installation is configured with Advanced Availability with BackGround and StandBy servers and 5 application servers.
Last week we had the 3rd occurrence of SDM becoming totally non-responsive. When checking the individual servers nothing seemed wrong, all processes running and nothing obvious in the logs either. Taking a closer look at the SQL Servers revealed that there was a lock on the call_req table. More than 180 queries waiting for the lock to be released. The query we executed to view locks, revealed that an ALTER INDEX [call_req_x1] ON [mdb].[dbo].[call_req] REBUILD WITH [MAXDOP=1] was the culprit.
This statement is part of a daily maintenance routine that has been set up by the customers SQL Team. It will check indexes for fragmentation and determine what to do with the index. Low fragmentation and nothing is done, medium fragmentation and the index is reorganized, high fragmentation and the index is rebuilt. The customer uses SQL Server 2012 Enterprise Edition because it has a feature to do index maintenance in the background with no need to stop the application(s) using the database being optimized. However, the optimization process does lock the object when it is replacing the fragmented index with the newly build index. The build process is happening in the background, but it is the actual replacing the index that puts on the lock. If the replacement coincides with one or more reads or writes of the object, the object becomes locked.
When the lock is resolved by terminating the locking process (killing the sql script), SDM springs into immediate action again and nothing more needs to be done. There might be a few things one wants to check such as Animator for any missed events etc., but nothing major.
So, the question now is how do we mitigate this moving forward? The options I see are few.
We can recommend the customer to shut down SDM while doing SQL maintenance. Not the best when the operation is 24/7 and Global.
Next option is to find out if there is a time of the day when there are fewer activities in SDM and thus minimizing the risk of collision. So, analyzing the activity level on the call_req table (the cr object in SDM) show that the Index optimization job runs in the early morning hours where there are a lot of users logging into SDM and still active Analysts in the Asia Pacific region. On the day of the last incident, there were close to 300 update/hour on the call_req table. Probably 150 or more Analysts logging into SDM during the same time. So, all in all early morning hours should be avoided for index optimization.
The analysis shows that around midnight the activity level is low, so this is where the index optimization is going to be rescheduled for.
Updates may follow id we see the issue again with this customer…