Resolve hang issue -- CA Disk DMSAR

Idea created by aw-bmw on Aug 8, 2016

    Due to an issue (high business impact) we raised a ticket to CA in order to solve a bug we discovered during post mortem analysis.


    Unfortunately CA development stated after long discussions that this bug is not supported by design.


    As we're convinced that no one should encounter any issues that might lead into production problems, we're going to explain the situation and solution below.


    If you're not familiar with the technical details, just quickly jump to "5. Real life analogue" which gives you an abstract description that is very easy to understand.


      1. Background and Impact Summary
      2. Management summary
      3. Technical Summary
      4. Problem Solving
      5. Real life analogue -- read this for a quick understanding


    1. Background and Impact Summary


    One day we had system critical problems (enqueue, “master catalog”) that caused the whole Sysplex to get into trouble. Besides "ordinary jobs" lots of DMSARs (on demand restore jobs) were also affected.


    While one of the systems in the Sysplex became almost unresponsive (commands entered on the console in order to cancel jobs where not processed anymore), it was still possible to kill jobs from another Sysplex member via Mainview -- which we did.


    This saved us -- and the bad LPAR became responsive again. Without this practice we would have IPLed at least one of the LPARs (which would have caused very high business impact).


    So far so good, but the big trouble began:


    As DMSAR jobs had also been removed from the system _all_ dependent jobs started hanging; users waiting for restores were blocked and lost their unsaved data.


    Lots of jobs were still waiting for a dataset restore but their related DMSAR was not existent anymore. During the next batch run all new jobs blocked as the old ones were still in the system waiting forever doing nothing…


    This caused a real business impact as thousands of jobs needed to be investigated for orphaned DMSAR relations – because the CA Disk (client) did not terminated.



    2. Management Summary


    The CA Disk is a classic client server architecture managing disk/tape data. If the user wants to read a dataset which is archived, the "CA Disk client" is starting a DMSAR for backend retrieval. This works so far without hassle.


    But, as soon as the server process (DMSAR) has any problem or you need to kill it, the depending clients (thus any batch job and any user session) will start hanging forever! They don't realise the server has died. As a consequence (both happened to us) you'll either run into serious batch problems or users are going to lose data.


    If the "CA Disk client" would talk(!) to the server process (and not just "fire and forget") -- which is BTW a common understanding in every client server constellation -- no business impact  due to endless wait would occur.


    Please keep in mind, we're not talking about a single system with 10 users. We have several Sysplex systems with DB2 (several thousand databases), IMS, CICS, millions of batch jobs etc. -- all business critical processes. It’s clear that restarting or shutting down such a system should be avoided as far as possible.


    Perhaps the "real life analogue" from section 5 would be helpful to understand the problem in a more abstract way. Probably would accept the behaviour there.



    3. Technical Summary


    As described above, one day we ran into system problems -- please read that section first for the general background.


    We used the Kill command which terminates any address spaces through MEMTERM. As Sysplex activity was endangered and one system was not responding (and everybody will clearly understand that running an IPL on a productive system during prime time should be avoided as far as possible), this is and was the only way to remove address spaces.


    The problem is, that the "CA Disk client" does not realise if the corresponding DMSAR address space is removed. Every time DMSAR is not responding for any reason (removed, hanging etc.) you will end up with hanging clients. And as they do not check for the serving process they will wait endlessly.


    They are not terminating! Never ever.


    This becomes problematic for users (as their session is lost and all unsaved data also) and batch jobs (as they will not start -- the old job is still "running"). If you only have one or two jobs, you can do manually, but if you have a high load batch system, you're into a big trouble.


    CA was stating that MEMTERM is not supported in the current design but force/arm. Having a look at the situation above one will clearly realise that this was not working.


    The "real life analogue" (see below) might give a good understanding how much inacceptable that is in real world.



    4. Problem Solving


    As often, it's unbelievable how simple a solution can be in order to prevent high impact issues.


    The following is not just an idea of us; it's a common solution implemented millions of times in various program codes across all platforms.


    After firing the "retrieve dataset from tape" request and waiting for the answer from the server, the “CA Disk client” starts polling the DMSAR server process in a periodic way  "are you there?" -- a classic keep alive.


    If the keep alive is not answered, the DMSAR client treats the DMSAR server as dead and will terminate with a corresponding error…


    Thus the client will terminate(!) and does not hang for ever if the server has any problem or just does not exist anymore.


    It's superfluous to mention that in this case

      • Operations team could handle all jobs in error according to the instructions
      • Users will not lose any data as their session will not be blocked endlessly by “CA Disk client”
      • Batch jobs will not hang or new ones will not collide with potentially old hanging ones


    The details of the polling algorithm are (of course) subject of discussion (frequency, timeout, combination etc.), but any implementation of a keep alive mechanism is better than what we have today.


    Simple, clean, highly efficient.



    5. Real life analogue


    Imagine, you’re visiting a butcher shop.


    You enter the store and tell the sales person that you would like to buy some minced meat.


    The salesperson tells you that she has nothing fresh on the counter and it will need to be prepared – “Would you like to wait until it is ready?”, she asks.


    “Yes, of course”, you answer.


    The sales person at the counter calls the assistant in the background “Please prepare 500 g of minced meat.“


    And then you wait for the meat to be prepared… meanwhile the sales person grabs your arms, holds them tight and does not move (waiting for the assistant’s response).


    In the meantime the assistant in the background has a problem. Maybe they have fainted, cut their finger, been kidnapped etc. Whatever the reason may be, they are not able to talk to the sales person anymore.


    And the sales person is waiting… (for a response from the assistant) … still holding our arms, still not moving.


    And you are waiting… (for the sales person)


    What would you do in real life? Would you wait? If so, for how long?


    Ah, of course, you would ask the sales person -- and the sales person would ask the assistant.


    ****, the sales person is not responding… (and still holding your arm so that you cannot leave the store without cutting it off!)



    CA is now saying that this butcher store "works as designed". And they do not support fainted assistants (or kidnapping etc.) but(!) friendly kidnappers who inform the sales person about the kidnapping.


    And the moral of the history:
    If the sales person would be empowered to talk to the assistant in order to ask if they're still busy, the customer does not need to cut off the arm.