Fail Over - Protect the MoM, the Collectors can deal

Idea created by bwcole on Feb 17, 2015
    Under review
    Score5
    • Hiko_Davis
    • sarbu02
    • Fred.K
    • jAmEs_shIn
    • bwcole

    Problem Space:

    With the agents setup to fail-over, with the help of the MOM to a new collector when a collector fails, the question becomes how to fail over the MoM.  Without the MoM, end users can not get to the metrics that might help them determine how deep and how far reaching an outage/critical failure has gotten.

     

    For those with CEM, fail-over for the APM database (postgresql) would also be helpful.

     

    Problem Description

    There are two base failure cases, internal and external.

    With the internal failure case, the MoM has stopped functioning and end users are unable to start their workstations or log into Webview.  By the time they have called the CA APM admin, the admin remote into the network to find out, why yes, the MoM is down.  Try to do CPR on the MoM, (Clue, Problem, Restart) by this time, the usefulness of the data to help with the critical event is either (hopefully) over or preventing the MoM from functioning.  If there was a way to have an active automatic failover MoM, hopefully that will cover 80% of the internal case.

     

    Externally, the case is the hosting OS/Server the MoM resides decides that 3:00 am is a good time to wake up your CA APM Admin and tell him how much you adore him/her.  Could they please fix what is broken.  In this case, a fail-over MoM on a different host, and not just a watch dog process that will restart the mom at the signs of problems might help prevent the adoring fans from calling your CA APM Estate waking up the misses.

     

    This topic was covered in 11/2013 web cast but haven't seen much of the topic since then.

     

    References:

    November 2013 - Webcast Replay - High Availability