DX Unified Infrastructure Management

  • 1.  Monitoring NMS

    Posted Sep 03, 2014 09:50 AM

    Hi!

     

    I would like to monitor if NMS is still running, my first idea was yust writing a custom-probe which connects cyclic to the bus, if this fails I push a mail (no not via emaigtw, but 'handcoded' via smtp/sendmail etc.)

     

    My other idea was simply run some python, which connects to 48000 of the main-hub if thats "connecting" all be or should be fine and perhaps also to 48002, does someone know a simple handshake for these ports, so that I can close them correctly?

     

    Does someone has implemented something like that?

     

    cheers

    Matthias



  • 2.  Re: Monitoring NMS

    Posted Sep 03, 2014 01:45 PM

    Hi,

     

    I have something close to this in place, though it doesn't monitor the BUS or ports. I have scheduled a script every 5-15 (can't remember) minutes that checks the size of some queue files. If those queue files exceed a certain threshold, it sends an e-mail and SMS independently of NMS.

     

    -jon



  • 3.  Re: Monitoring NMS

    Posted Sep 03, 2014 02:10 PM

    What about the following:

     

    #1: Cronjob the runs nimalarm with some magic subsysid

    #2: auto-operator rule that runs a script that touches a file

    #3: Another cronjob that sends and email if age of file is too old?

     

    Or something similar based on same concept.



  • 4.  Re: Monitoring NMS

    Posted Sep 03, 2014 10:55 PM

    Not only are there a lot of different ways NMS could fail (most of which are probably quite unlikely), but I think the "right" answer for how you keep tabs on it depends much on how you interact with it.

     

    We currently have NMS integrated with our external service desk, and the service desk reports to us if NMS is down in such a way that it cannot send alarms over. We are in the midst of a project to convert to a new service desk, and the integration with that one is very different from our current one. Therefore the way it detects a failure is quite different.

     

    In both of those cases, the philosophy is the same. We try to make sure we can get the service desk to tell us when it detects a problem, which should cover issues specific to the integration or issues that are serious enough to bring down any of a few key compenents like the nas, emailgtw, and message bus on the core hub. We depend on those few key components to tell us if anything else is wrong with the wider environment of hubs, robots, and probes, in some cases with the help of scripted checks that we have put in place and generate alarms. I suspect there are still some types of failures that would be subtle enough to slip through the cracks.

     

    If you are looking at alarms directly in Nimsoft rather than an external service desk, that sounds simpler to me. But then again, it depends on how the console behaves when there are failures.

     

    I have a few different heartbeat checks in NMS, somewhat similar to the one described by Anders. They do not result in touching files, but the concept is the same. One is completely internal to Nimsoft, and two are from external systems integrated with Nimsoft. The integration with our new service desk will also rely on a heartbeat like that but going out of Nimsoft separate from alarms.



  • 5.  Re: Monitoring NMS

    Posted Sep 04, 2014 09:31 AM

    Hi!

     

    Thanks a lot for your ideas and descriptions, and Keith, yes you are right, there are many ways I ignored the point of an ServiceDesk compleetly eventhough we have some similiar construct.

     

    O.K. I'll back to the "thinkzone".

     

    Thanks

    Matthias