DX NetOps

  • 1.  Notifier in Fault Tolerance

    Posted Mar 11, 2015 04:58 AM

    I have a Spectrum 9.3 in fault tolerance and distributed environment.

    The Notifier is running in one of the SpectroServer principal, and send the alarms for all landscape. I´m not sure what is the best change to configure the Notifier in the redundancy situation.

     

    I have saw other people that have put the "if " condition in the scripts on the secondary SS, and if the hostname in the alarm is the first SS, the script of Notifier say "nothing to do". I think this could send for the alarms in the other landscape the same mail from the primary and secondary SS.

     

    I have three SANM applications, with three SANM policies and three setscripts in the principal SS. In the secondary SS I copied the NOTIFIER directory from the primary SS. I did not put to run the three notifier process in the secondary because I think this could send the alarma mails for duplicated as I already said. I need to found the way to the secondary send mails only when the SS principali is down, and its work for the other landscape, and with the three application SANM...

     

    Any idea?

     

    Saludos,

    Susana



  • 2.  Re: Notifier in Fault Tolerance
    Best Answer

    Posted Mar 12, 2015 02:02 AM

    Hello Susana,

     

    May be you can use the precedence attribute instead of hostname

     

    1.Primary spec DB will have precedence 10 (attribute 0x12c0a in every model)

    2.Assume your secondary Spectrum is precendence 20

    3.On both servers add this to $specroot/Notifier/.alarmrc

    EXTRA_ATTRS_AS_ENVVARS=0X12C0A

     

    4.In setscript and clearscript in the bit just after

     

    if [ "$SENDMAIL" = "True" ]

    then

    RECIPIENTS=$VARFORMAIL

    ........."

    RECIPIENTS="NotificationData/RepairPerson"

    fi

     

    On primary add this -

    if [[ "$SANM_0X12C0A" = "20" ]]

    then

    echo "SS Secondary is running"

    echo "Precedence = $SANM_0X12C0A"

    exit 0

    fi

     

    On secondary add this

     

    if [[ "$SANM_0X12C0A" = "10" ]]

    then

    echo "SS Primaryis running"

    echo "Precedence = $SANM_0X12C0A"

    exit 0

    fi

     

    save the set script and recycle Alarm Notifier

     

    Outcome:

     

    whenever an alarm is generated ,the model in the DB is checked and the attribute 0x12x0a is read - if its 10 (primary server precedence) then Primary Alarm notifier sends the email and secondary will write a line to the notifier log file saying primary is running

     

    If 0x12x0a is 20 then Seconday server Alarm notifier will send the mail and the primary would write to the notifier file saying secondary is running

     

    HTH

     

    Kalyan



  • 3.  Re: Notifier in Fault Tolerance

    Posted Mar 16, 2015 11:00 AM

    Thank you, your idea has been very helpful for me!!

     

    Regards!

     

    Susana



  • 4.  Re: Notifier in Fault Tolerance

    Posted Dec 14, 2015 06:25 PM

    We handled this a little differently.  We wanted to account for the case where the AlarmNotifier process could die or fail even when the SpectroSERVER process was still running (a situation we've seen on numerous occasions, particularly with default type logging when NOTIFIER.OUT exceeds 2GB).  Also note, we have a very large distributed Spectrum environment (over a dozen primary and over a dozen fault-tolerant SpectroSERVERS).

     

    We configure the same custom Notifier scripts (SetScript, etc.) on the designated Primary and Secondary (Fault-Tolerant) Spectrum systems.  Inside the scripts is a check to look for the file "$HOME/Notifier/.SpectrumAlert.stop".  If that file exists, they will log the alert, but not actually generate a ticket to our ticketing system.

     

    The primary system should never have that file, unless we're manually placing it for testing.  The secondary will always have that file, unless there is a problem with the primary.  We verify it by running a cron job script every 5 minutes that does the following:

     

    1. Secondary connects to the Primary via SSH and runs a "health check" script.
      1. If the SSH attempt fails, the .SpectrumAlert.stop file is rm'ed (enabling alerting on the secondary).
    2. The health check script verifies the following:
      1. The AlarmNotifier process is running
      2. That the AlarmNotifier process has generated an alarm within the last 180 seconds (we have a big Spectrum environment; we never go more than a minute or so without at least a minor alarm tripping somewhere)
      3. If the health check comes back with a success, the .SpectrumAlert.stop file touch'ed (disabling alerting on the secondary)
      4. If the health check comes back with a fail, the .SpectrumAlert.stop file is rm'ed (enabling alerting on the secondary)
    3. A ticket is generated through a Spectrum process monitor on the AlarmNotifier process
      1. If the remote connection failed or the health check failed, a notification is sent to inform us that AlarmNotifier is down on the primary Spectrum Server via out-of-band notification (e-mail).  This is a fail-safe to ensure that we know about any alerting issues between Spectrum and the ticketing system

     

    Because it runs via cron every 5 minutes, it will automatically enable or disable alerting between the primary AlarmNotifier and the secondary AlarmNotifier without us having to intervene.  It also covers every failure scenario we could think of, and helps