DX Application Performance Management

  • 1.  Restarting an APM Cluster

    Posted Oct 17, 2017 12:38 PM

    So with our move from SLES to RHEL, found that the server init script behave differently.  We don't have a way to correlate server start/stops and did depend on sleeps.

     

    5:00   MOM - EMCtrl.sh stop and WVCtrl.sh stop called

          Sleep timer for 30 minutes   

          EMCtrl.sh start; WVCtrl.sh start

    5:05    Collectors - EMCtrl.sh stop

          Sleep timer for 15 minutes

          EMCtrl.sh start

    5:10

          Postgres Stopped - pg_ctrl stop

          Postgres Start - pg_ctrl start

     

    So at 5:30, assuming that the database and collectors have had 10 minutes each to start, the MOM processes start.

     

    Now with RHEL, the sleeps in the server init sequence will actually delay the server init.

     

    The question is, does anyone have a script, application or other that will do an APM cluster restart in a controlled manner?



  • 2.  Re: Restarting an APM Cluster
    Best Answer

    Posted Oct 17, 2017 03:34 PM

    I am not very good with scripting but when there was a requirement to automate this process I create following scripts to stop and start Cluster

    Best practice is to

    Stop: Stop MOM first than Collector.

    Start: Start Collector first then MOM. 

     

     

    Stop EM script

     

    for i in MOM_Hostname Collector1_Hostname Collector2_Hostname Collector3_Hostname Collector3_Hostname
    do
    echo "Started on $i"
    ssh $i "cd /opt/CA/APM/10.5.1.8/bin;./EMCtrl.sh stop;"
    sleep 5
    done


    Start EM Script


    for i in Collector1_Hostname Collector2_Hostname Collector3_Hostname Collector3_Hostname
    do
    echo "Started on $i"
    ssh $i "cd /opt/CA/APM/10.5.1.8/bin;./EMCtrl.sh start;"
    sleep 5
    done
    sleep 100
    for j in MOM_Hostname
    do
    echo "Started on $j"
    ssh $j "cd /opt/CA/APM/10.5.1.8/bin;./EMCtrl.sh start;"
    sleep 5
    done

     

    if you don't have rsync installed you have top provide password for every EM.



  • 3.  Re: Restarting an APM Cluster

    Broadcom Employee
    Posted Oct 17, 2017 03:50 PM

    Junaid,

     

    To add to your post, in the best practice of bringing up your Collectors first, its best to wait for each one to log the message "Enterprise Manager has started" and then start the MOM.  This way, each Collector is ready to accept agents when the MOM starts dishing them out.  If only one or two are available and the others are not, then MOM will flood those few and cause chaos.

     

    Thanks,
    Matt



  • 4.  Re: Restarting an APM Cluster

    Posted Oct 17, 2017 04:00 PM

    Thanks Matt,

    if you see my code it is starting all the collectors first one by one in a for loop. after that it will start MOM. this script is also has some delay before moving to MOM. All together MOM start after 5 minutes of all collector start.



  • 5.  Re: Restarting an APM Cluster

    Broadcom Employee
    Posted Oct 19, 2017 06:26 PM

    Also before MOM is actually started, any existing running agents will use the last allowed Collector list they previously received to start connecting to Collectors as soon as they become started. So it is best to get all the Collectors up to the started state a.s.a.p. to minimise any potential overloading of individual Collectors if only a subset of them have started.



  • 6.  Re: Restarting an APM Cluster

    Posted Oct 19, 2017 09:56 AM

    Thank you junwah,

     

    We don't have rsync and are not able to ssh between servers under our APM admin account so we have to depend on the server's scheduling setup.

     

    We created adapter shell scripts (MOM-EMCtrl.sh, MOM-WVCtrl.sh, Collector-EMCtrl.sh) to contain our sleeps so we don't pause/stall the server init process.  So the init will do a fork/call/independent process to call each of the adapter scripts.

     

    Stop

       00   MOM

       05   Collectors

       10   APM DB

    So we will be giving each 5 minutes to shutdown

     

    Start

        5:00     MOM - MOM-EMCtrl.sh ; MOM-WVCtrl.sh

                    Sleep 35 minutes

                    EMCtrl.sh start

                    WVCtrl.sh start

     

        5:05     Collectors - Collector-EMCtrl.sh

                     Sleep 15 minutes

                      EMCtrl.sh start

     

         5:10     APM DB

                        pg_ctrl start . . .

     

         5:20      APM DB should have started

         5:30      Collectors should all have started

         5:35      MOM EM and WebView started

     

    I did look into looking into the log files for the "started" message or even port pinging 5001 on the collectors so the MOM knows to start but ran into what if a collector fails, should the MOM start?

     

    If the MOM starts, there is a management module with alerts on the number of collectors calculator that if one didn't start, we will get an error email every 6 minutes.  This message also go to the NOC so we will get a wake up call to log in and diagnose what is going on.

     

    Then, what if the MOM does not start?  We created a HP SiteScope alert on the MOM EM port 5001 and it will check every 10 minutes.  If there is an error, SiteScope will check back every two minutes.  After 3 total errors, an email alert is sent to our NOC and we get a wake up call.

     

    I've got another discussion thread about the APM DB and the APM status console, how to alert when the EMs are having issues with the APM DB.  Looking like I'll need to create HP SiteScope alerts for the Postgres database also.

     

    Hopefully we won't have more than one or two collectors at the same time fail to start, but this is occurring outside of our typical high load times so if we lost 2 of 7 collectors, the APM cluster should be able to handle the load till we are able to get the two failed collectors running.

     

    Wished that the APM would do more in terms of it's own infrastructure alerting, but I guess that the APM isn't an infrastructure monitoring system.



  • 7.  Re: Restarting an APM Cluster

    Broadcom Employee
    Posted Oct 17, 2017 03:37 PM

    You could use the WebView script as a reference to have it test the port connection to the EM/MOM so that each collector will sleep until the MOM is available for connections.