DX Unified Infrastructure Management

  • 1.  SDGTW Probe - Resolving ServiceNow tickets - Not Acknowledging Nim Alarms

    Posted Oct 21, 2018 12:53 PM

    Hi,

     

    We are in the process of going live with our new ticketing system, ServiceNow. Everything is working as expected, except for 1 very important piece, which is when we Resolve a Ticket in ServiceNow, that alert is not being Acknowledged in Nim Alert Console.

     

    We have 2 environments, one for testing and one for production.  The alerts that we are sending over to production ServiceNow from test Nim environment, are being Acknowledged when the ticket is Resolved.  So everything works as expected there within Test.  But when the same alert is sent from production Nim environment, to the production instance of ServiceNow, arlarms are NOT being Acknowledged when ticket is resolved.

     

    After digging through the Trace logs of the sdgtw probe, I found the issue, or at least something pointing to what the issue may be, which is:

     

    Lab/Test environment log file for sdgtw probe, which Acknowledges the alert in Nim properly----------

     

    Oct 19 12:43:22:970 [ServiceNow, sdgtw] user idadministrator

    Oct 19 12:43:22:993 [ServiceNow, sdgtw] alarmlist[com.nimsoft.events.api.model.Alarm@7fd43795]

    Oct 19 12:43:23:117 [ServiceNow, sdgtw] [clearAlarmsForClosedIncidents ()]: Alarms acknowledgment successfully using AlarmService API aId[DV11899587-80892]

    Oct 19 12:43:23:117 [ServiceNow, sdgtw] Clearing the cache for alarm with Id - DV11899587-80892 and incidentId 62d2aa74db55e780995a791c8c9619fd

     

    Production log file for sdgtw probe, which IS NOT Acknowledging the alert in Nim properly----------

     

    Oct 19 14:59:17:981 [ServiceNow, sdgtw] user idadministrator

    Oct 19 14:59:18:101 [ServiceNow, sdgtw] alarmlist[]

    Oct 19 14:59:18:101 [ServiceNow, sdgtw] Clearing the cache for alarm with Id - SV16579396-19936 and incidentId 32537abcdb55e780995a791c8c9619ae

     

    As you can see, the alarmlist[] is empty, and we don't get the successful Acknowledgement message. After digging some more, I found the following Error in the log file for production Nim:

     

    Error while connecting AlarmService API Reason: (11) command not found, Received status (11) on response (for sendRcv) for cmd = 'dispatcher' ST

     

    Its worth noting that I am currently on our Secondary Hub within our Nim UIM environment. But the Secondary hub is acting as our primary, which houses the following probes to support the sdgtw to servicenow integration:

     

    • NAS
    • PPM
    • TRELLIS - When I noticed that AlarmServiceAPI error, I thought for sure I found the issue. So I checked the Probe Utility for the trellis and it only HAD: alarm-routing-service / nas services / trellis container core services.  I then dropped the following probes:  nas-api-service / das access services

     

    Unfortunately, that did not resolve the issue.

     

    Perhaps there is another probe that is needed for handling dispatching (maybe the UDM_Manager)? 

     

    Additionally, I did open up a case to CA Support and they said that, per the documentation, the Secondary hub does not support the sdgtw probe.  But being that are secondary is acting as the primary, I suspect we may just be missing a probe to facilitate the communications for Acknowledging an alert.

     

    Any assistance would be greatly appreciated!

     

    Thanks,

    Chris A.



  • 2.  Re: SDGTW Probe - Resolving ServiceNow tickets Not Acknowledging Nim Alarms

    Posted Oct 21, 2018 03:33 PM

    After a bit more digging, I noticed that the trellis probe is encountering issue's upon startup-

     

    Oct 21 14:08:57:615 [main, trellis] Initiator 'com.ca.trellis.persist.relational.DataSourceInitiator' threw an exception during application.
    Oct 21 14:08:57:615 [main, trellis] Reason:
    Oct 21 14:08:57:616 [main, trellis] com.lift.SystemException: configuration

     

    Caused by: (4) not found, Received status (4) on response (for sendRcv) for cmd = 'nametoip' name = 'data_engine'

     

    Oct 21 14:08:59:004 [main, trellis] Initiator 'com.ca.trellis.persist.relational.PersistenceUnitInitiator' threw an exception during application.
    Oct 21 14:08:59:004 [main, trellis] Reason:
    Oct 21 14:08:59:004 [main, trellis] com.ca.trellis.spi.deployment.DeploymentException: Referenced object identified by 'tnt2-ds' did not existPlease fix your configuration

     

    Oct 21 14:08:59:980 [main, trellis] Caught exception while trying to start Trellis. The probe should be responsive, but Trellis isn't
    Oct 21 14:08:59:980 [main, trellis] java.lang.IllegalStateException: org.springframework.context.annotation.AnnotationConfigApplicationContext@e4f8592 has not been refreshed yet

     

    So to test, I started up the data_engine probe on the Secondary, as it appears to be a requirement per the documentation, and the trellis looks better, with the exception of the ACE probe, which I am not sure if it is required for the sdgtw:

     

    Oct 21 14:51:17:308 [main, trellis] Creating Shift Context
    Oct 21 14:51:17:376 [main, trellis] Registering service: class com.nimsoft.events.nas.NasAlarmServiceImpl
    Oct 21 14:51:18:272 [main, trellis] Creating Shift Context
    Oct 21 14:51:18:273 [main, trellis] Registering service: class com.ca.uim.services.ugs.DefaultGroupService
    Oct 21 14:51:18:273 [main, trellis] Registering service: class com.ca.uim.tnt2.services.DefaultLegacyGroupService
    Oct 21 14:51:18:273 [main, trellis] Registering service: class com.ca.uim.ugs.metadata.FlywayMigrationService
    Oct 21 14:51:18:273 [main, trellis] Registering service: class com.ca.uim.tnt2.services.DefaultComputerSystemService
    Oct 21 14:51:18:273 [main, trellis] Registering service: class com.ca.uim.tnt2.services.DefaultConfigurationItemService
    Oct 21 14:51:18:925 [taskScheduler-1, trellis] ACE could not be located. Not configuring
    Oct 21 14:51:19:120 [main, trellis] ****************[ Starting ]****************
    Oct 21 14:51:19:120 [main, trellis] 2.01
    Oct 21 14:51:23:748 [main, trellis] Failed to contact ACE. Configuration

     

    After 'Resolving' a ticket within ServiceNow, I am now getting a different alert in the trace log of the sdgtw...it appears to be ignoring it now:

     

    Oct 21 14:30:18:368 [ServiceNow, sdgtw] responseCode :: [200] response messege :: [OK]
    Oct 21 14:30:18:375 [ServiceNow, sdgtw] Incident found for closing [com.ca.integration.normalization.omodel.Incident@14982fb2]
    Oct 21 14:30:18:375 [ServiceNow, sdgtw] Completed executing the filter. Number of records returned - 1
    Oct 21 14:30:18:375 [ServiceNow, sdgtw] Ignoring the incidentId '198d782ddb992b80995a791c8c961905' as it is not associated with any Alarm.

     

    But the thing is, there is an Alarm, with that id, in the console.  Not sure why it is ignoring it.

     

    ...and now the trellis probe is kicking out some more interesting log messages, its repeating this:

     

    Oct 21 14:58:57:579 [attach_socket, trellis] Dispatcher caught unchecked service exception.  This could be normal behavior, but you may want to examine it anyway

     

    Additionally...while comparing the production Trellis to test Trellis...both of them receive the "ACE could not be located. Not configuring".  But the prod Trellis, receives the "Failed to contact ACE".   So I looked at the ACE logs for both prod and test, and they both have:

     

    Oct 21 15:11:24:679 ERROR [attach_socket, com.nimsoft.nimbus.NimServerSession] Exception in NimServerSessionThread.run. Closing session.
    Oct 21 15:11:24:680 ERROR [attach_socket, com.nimsoft.nimbus.NimServerSession] (2) communication error, Error when trying to send on session (S) com.nimsoft.nimbus.NimServerSession(Socket[addr=/10.240.135.14,port=56388,localport=48033]): Software caused connection abort: socket write error

     

    ...I decided to restart the ACE probe, cause why not...and only the production Trellis received the following:

     

    Oct 21 15:05:41:368 [attach_socket, trellis] An exception occurred while processing a message from Socket[addr=/10.240.135.14,port=56171,localport=48043].
    Oct 21 15:05:41:368 [attach_socket, trellis] (120) Callback error, Exception in callback for public void com.ca.trellis.shift.core.TrellisDispatchCoordinator.dispatch(com.nimsoft.nimbus.NimSession,com.nimsoft.nimbus.PDS) throws com.nimsoft.nimbus.NimException: No qualifying bean of type [com.ca.trellis.shift.core.ShiftDispatcher] is defined: No qualifying bean of type [com.ca.trellis.shift.core.ShiftDispatcher] is defined

     

    Looks like we have circled back around to this 'dispatcher'.  Does anyone have any insight on this one?



  • 3.  Re: SDGTW Probe - Resolving ServiceNow tickets - Not Acknowledging Nim Alarms

    Posted Oct 22, 2018 01:00 AM

    One more piece of info-

     

    I was just able to replicate the original issue in the lab by deactivating the Trellis probe, here are the logs:

     

    Oct 21 23:52:10:141 [ServiceNow, sdgtw] responseCode :: [200] response messege :: [OK]
    Oct 21 23:52:10:145 [ServiceNow, sdgtw] Incident found for closing [com.ca.integration.normalization.omodel.Incident@4a1156ed]
    Oct 21 23:52:10:145 [ServiceNow, sdgtw] Completed executing the filter. Number of records returned - 1
    Oct 21 23:52:10:145 [ServiceNow, sdgtw] Closing the alarm with Id - CQ03113799-12121 associated with the incidentId 7fcecfe9db196f40be427b668c961900
    Oct 21 23:52:10:154 [ServiceNow, sdgtw] user idadministrator
    Oct 21 23:52:10:157 [ServiceNow, sdgtw] alarmlist[]
    Oct 21 23:52:10:157 [ServiceNow, sdgtw] Clearing the cache for alarm with Id - CQ03113799-12121 and incidentId 7fcecfe9db196f40be427b668c961900

     

    The empty "alarmlist[]" is the same thing we were experiencing before moving the 'sdgtw' onto the the same hub server as the data_engine.  So the Trellis probe definitely has something to do with this issue.

     

    Any assistance on this matter would be greatly appreciated!

     

    Thanks,

    Chris A.



  • 4.  Re: SDGTW Probe - Resolving ServiceNow tickets - Not Acknowledging Nim Alarms
    Best Answer

    Posted Oct 24, 2018 11:04 AM

    Alrighty...got her figured out-

     

    The short of it, we ended up having to relocate our NAS probe back to the primary.  As per the documentation, the SDGTW probe only works on the primary server.  In our environment, that is not so cut and dry being that we have had to offload probes to other servers in light of breaching subscribers limits.

     

    The probes that are absolutely necessary for the SDGTW probe to operate are:

     

    • data_engine (and all of its dependent probes)
    • nas
    • udm_manager
    • trellis (and a couple of additional probes that latches themselves onto the trellis, which are: nas-api-service & das)
      • Its worth noting that in some cases, these 2 additional probes are not, by default, attached to the trellis, which will require that you drop them from the archive to the server that your sdgtw probe resides on.
      • To find out if you need to drop them:
        • Login to to your adminconsoleapp > navigate to the trellis probe
        • Click the 3 dots to the left of trellis > select View Probe Utility in New Window
        • In the left pane, select "list_deployments" > to the left, click the big green play button
        • In the Find field, type 'key' > this will highlight the keys
        • Look for 'das', which is the directory access services & look for nas-api-service

          

    CA Support is working on getting this info added to the sdgtw probe documentation which will be helpful for non-typical environments.  They will also add in there how each of these probes work together to facilitate communications.



  • 5.  Re: SDGTW Probe - Resolving ServiceNow tickets - Not Acknowledging Nim Alarms

    Posted Oct 24, 2018 11:17 AM

    Figured I should also share the following-

     

    One of the issues we ran in to was Priority/Severity Mapping.  It turned out that ServiceNow and the SDGTW probe were configured properly.  The issue was actually our probes.  In our old ticketing system, we used prefixes to identify the Severity, example:

     

    [P1:S1] Disk Space on device $servername has breached the 10% free space threshold

     

    ...where the [P1:S1] would serve as a High level ticket. P0:S0 would be critical, so on and so forth.  So it was never much of a concern for us to make sure our probe message pools had the correct severity configured for each message, being that we used that identifier.  

     

    But in the new system, this created a problem, being that the actual severity levels are dictated by values:

     

    0 (clear)                

    1 (information)

    2 (warning)

    3 (minor)

    4 (major)

    5 (critical)

     

    To fix this, we used the NAS Preprocessing engine to change the severity.  For example:

     

    If an inbound alarm had a prefix of P1:S1, but the NIM Level was set to Informational or something other than 4-Major...the preprocessing rules we now have built force it to the correct NIML value.  Here is a really good link on how to accomplish that if severity levels are causing issues for you:

     

    How to change alarm severity using a NAS script - CA Knowledge