Why synchronize CMDB data to UIM?
The foremost reasons for synchronizing CMDB data with UIM device data are
- Enriching alarms with CMDB data. In my case, this mean adding device ownership and CI id data so that tickets can be created more efficiently, and also checking device contract for service hours when an alarm comes in. Checking service hours enables us to further tag messages for further processing. For example, sending an SMS if this alarms needs to be reacted to outside office hours.
- SLA automation. SLA management can be a nightmare and the interfaces provided to do so are inadequate. Managing SLAs for thousands of devices can be very frustrating. CMDB integration's part in this is to bring device contract data into the UIM database: Operating Hours and SLA Compliance %
- Validate CMDB data / verify correct devices are monitored. Checking for discrepancies between the two inventories can greatly help keeping data up to date and correct.
How to choose what to synchronize?
In an alarm, as it stands, there are three obvious fields from which you can try to deduce the device that is being alerted on: Robot, Host Name and Source. The values of these fields can depend on number of variables including, but probably not limited to: remote or local monitoring, name resolution and developer of the probe. It seems like a long time ago someone created this design and along the way the use of these fields has gotten quite inconsistent, thus the "developer of the probe" is included. In short this means, that none of these fields provide reliable enough information of the actual device being monitored.
The thing that might not be so obvious is, that most alarms also carry dev_id and met_id data (TNT data model). You will have a hard time finding these in alarm consoles. These are unique identifiers for devices and metrics, for which data is stored in the UIM database. The information about a device (pointed to by dev_id) can be found in the table CM_DEVICE where each entry is identified by the unique dev_id key. This table links back to CM_COMPUTER_SYSTEM by key cs_id and forward to CM_CONFIGURATION_ITEM (among others, I wont list all of them here). CM_DEVICE table contains better information about the device being alerted on: dev_ip and dev_name. As these fields rely on the agents ability to resolve the name, it is worth mentioning you can not rely the data to be just an IP or a name: the dev_ip field may well be null and the dev_name field may just as well be an IP.
It is conceivable that you might get alarms that do not contain dev_id. I did research on a couple of UIM environments and came to the conclusion that the amount of these alarms is acceptable and can be further lessened by making sure probes are up to date. As the device information contains much more accurate information about the target, I decided to base my CMDB synchronization on devices.
Storing CMDB data
To avoid repetitive calls to the authoritative CMDB (Service-Now in my case), I wanted to store the stuff I need into the UIM database. Here's a simplified representation of the tables used to store the data:
Integration tables are custom tables, s_time_specification is a standard table used for SLAs
I chose to store data partly in the existing SLM related tables, and additionally created two tables of my own, I'll call them here INTEGRATION_DEVICE_MAP and INTEGRATION_SERVICE_OFFERING. The first one contains a row for each dev_id that contains necessary unique identifiers from the CMDB: Configuration Item id, owner id and service offering id. The second table stores information about Service Offerings. This row contains SLA compliance % and id that refers to an entry in S_TIME_SPECIFICATION, which is the table where part of SLA operating period data is stored. This way I can use this data in SLAs, as well as when enriching alarms with operating hours data. Additionally data from CMDB is inserted into these tables used by SLM: S_OPERATING_PERIOD_NAME, S_OPERATING_PERIOD_DESCRIPTION, D_OPERATING_PERIOD and S_OPERATING_PERIOD. By populating the information into these tables, I'm also preparing for automatic SLA creation. I'll write more about that in another post.
Adding devices to INTEGRATION_DEVICE_MAP
There are three ways that CMDB syncronization is triggered
- Periodic check every x hours. This checks every device in CM_DEVICE table that doesn't have an entry in INTEGRATION_DEVICE_MAP
- Another integration component Enricher informs CMDB probe that it doesn't have information for a device. This checks just one device at a time.
- Manually using a callback
Removing devices - foreign key
INTEGRATION_DEVICE_MAP needs to have an entry corresponding to each entry in CM_DEVICE table. Therefore it might make sense to create a foreign key constraint referring to CM_DEVICE.dev_id and have deletes cascade from there. However, in my experience the CM_DEVICE table and others might need to be emptied from time to time in order to rectify UIM discovery data. If the deletes were cascaded down to the integration tables, they would also be emptied. As the CM_DEVICE table then gets populated with mostly same information again, the integration would need to query the CMDB again possibly for tens of thousands of devices. For this reason, I decided to write my own reporting method to check data between CM_DEVICE and INTEGRATION_DEVICE_MAP. Moreover, this enables to me to create a workflow to check that removed devices have also been updates in the CMDB and possibly alert on any mismatches.
Keeping data up to date
There are several things to consider about keeping the integration data up to date. Devices are retired, devices may change service offerings (contracts), IP address etc. On UIM side this is not so much of a problem, you'll get new devices if certain parameters change and you can remove the old ones based on discovery time stamps. The hard part is synchronizing the changes in the CMDB data to UIM side. The methodology to do this depends also on the CMDB platform. The easiest way to do this is probably to just scan the custom integration tables and check their status from the CMDB. However, that would be a rather brute solution and as the number of devices might be quite large, it might also put significant load on the CMDB services.
As Service-Now supports outbound REST and SOAP messages, this is the approach I will be using. I have yet to implement this feature as it's further down the roadmap, but the initial plan is to do something like this: Service-Now sends a rest message to a robot that's running the wasp probe with web services installed. I'm saying robot with wasp probe because it doesn't necessarily need to be an UMP instance. You can install the wasp probe and the web services package for it, or whichever components you need. From there I will likely do a callback to the CMDB probe, which then takes the appropriate action depending on the changes.
There is also the matter of validating CMDB device list against that of UIM. This will also depend largely on the way the CMDB is constructed, basically you will need a method to indicate whether a device should be monitored or not. That could be done several ways: for example implication by contract, or simply by having a "monitored" field on the CMDB device structure. Moreover, also in the UIM end you need to be able to tell if it is alright that a UIM device could not be matched into a CMDB device.
UIM alarm_enrichment vs custom enrichment
First I must say that my experience with alarm_enrichment is limited, especially with the more recent iterations, and I do not therefore have thorough knowledge of it's workings and current features.
Once you have the required data in a data source, you need to do something with it. The primary use would be to enrich alarms with that data. CA's method of doing this is the alarm_enrichment probe. The alarm enrichment probe listens to alarm messages and queries the data sources for each alarm message. That means, if an alarm has count 15, the the source will have been queried 15 times. After the alarm message has been processed, alarm_enrichment posts an alarm2 message, which will then be processed by nas. This means that if alarm_enrichment can't process it's queue for some reason, you'll not get any alarms.
As I've had some trouble with alarm_enrichment stability, have a philosophical problem with intercepting alarms before they're published, and dislike querying the CMDB more often than is really needed, I decided to go for my own Enrichment probe. Additionally, I need to do some logic within the enriching component that alarm_enrichment can't do as far as I know. Also, I can't query my CMDB on a DB level, I must go through web services which is one more reason that I need the CMDB probe.
Enricher is what I named my probe that does alarm enrichment. It populates CMDB data and information derived from that in to the custom_1-5 fields of an alarm. The data can then be used when inserting tickets in to the ticketing system. Also, it can be used for message processing, and for example to open tickets of devices in the ticketing system/cmdb through USM URL actions. The Enricher probe holds the integration tables in memory, to be able to process messages quicker.
The probe attaches to a custom queue on the hub that contains alarm_new and custom cmdb_refresh message that is posted by cmdb probe to inform other probes that the integration tables have changed and need to be refreshed, as the Enricher probe holds these tables in memory.
When does enrichment occur?
Instead of listening to alarm messages like alarm_enrichment does, the probe listens to alarm_new messages, which are published by nas whenever an alarm occurs that doesn't fit the suppression criteria of an existing alarm: in short, when a new alarm comes in. This means, that alarms can be processed normally whether the enriching component is working or not, but at the same time creates some other issues to consider. More about them in Issues to consider.
Enrichment can also occur on demand. For example, if the Enricher probe doesn't have CMDB data for the dev_id in the alarm, it will then notify the CMDB probe by posting a message that request data for the device. You could also use a callback to do this, but as messages from the queue are processed in a single thread, doing a callback will delay processing of the queue until the callback returns a reply (or timeouts). But if you post a message, you can then be done with that message and process the next one. Once the CMDB probe is done processing the device, it will post a cmdb_refresh message (if a device was found) that includes the nimid of the alarm, which the Enricher will pickup and then can enrich the alarm normally.
Updating data to alarm messages
There are two basic ways to add information to alarms:
- Nas provides a set_alarm callback. There are two issues with this approach:
- You can only set one field at a time, so if you need to put information into five fields, you need to do five callbacks
- The previously mentioned callback issue. Doing callbacks is possibly slow and can have significant impact on processing the queue, which might snowball into bigger issues
- Post a new message that matches the suppression criteria of an alarm. This approach is significantly faster as you, once again, only need post a message and have nas take care of the rest.
I decided to go with the second option. The alarm_new message includes all the data you need to hit the suppression again and have the fields updated. Here's a list of fields that I pick up from the old message: source, origin, domain, robot, prid, suppression, supp_key, dev_id, met_id, message, sid, i18n_data (pds), i18n_token. Additionally I add data to the custom_1-5 fields. You need to mind which part of the message you pick the data from: A message is roughly divided into two parts: Header and Data. For an alarm_new message, the header will have a nimid value that is unique to that alarm; the data part (and specifically userdata within that) will contain another nimid which is the nimid of the alarm, as seen in alarm consoles.
Structure of alarm message to update an exiting alarm message
After creating a message such as above, it is a simple thing to post it with spooler's post_raw callback and be done with it.
Word on alarm_update
Alarm_update is the message that nas publishes to tell consoles the updated status of an alarm. There are two cases in which the message is posted:
- Message count. This can be configured in nas. Every Xth occurrence of an alarm, nas posts the alarm_update.The default value for it is 100.
- One of the predetermined fields in the alarm updates. A change in one of these fields will launch the update: message, subsystem id, severity, source, domain, robot, or suppkey. Obviously you don't want to mess with most of the fields there, but I'd argue that it's rather safe to add a dot at the end of the message, or something of the sort. Naturally the next alarm in the cycle will launch an update again, as the message needs to be updaded again.
In this context, I'm not currently modifying anything. But I am modifying the message at a later stage, after the ticket is created and the custom_fields have their final data.
Issues to consider
Here are some issues and considerations. Some of them partly arise from the design I have chosen. I'll try to explain how I try to mitigate them.
Delayed reading of the Enrichment queue
Imagine that for some reason the probe hasn't been able to process it's queue for a while. During that period new alarms have occurred and they have already been closed off. Now the probe is up and running and starts to process these alarms from the queue. There's a risk that it'll enrich and repost a message that doesn't exist anymore. To avoid this situation, I have created thresholds for the age of an alarm, and if that threshold is exceeded, the probe checks if the alarm still exists. The probe maintains a list of alarms in memory, and this list also has an age threshold. If an alarm needs to be checked, the age of the list is first checked against that threshold and refreshed if necessary. If the alarm doesn't exist, it will be discarded.
Getting messages from a queue must happen in a single thread, but nothing prevents multiple threads from accessing different queues. A more elegant and fail safe solution to this problem might be to track alarm_close messages from another queue and keep them in memory. This would add some additional considerations, such as thread safety (the list of closed alarms is accessed by two threads) and queue monitoring (you don't want the closed alarm list to build without limits). I might go for this solution further down the road.
Alarms without dev_id
Not all alarms have dev_ids. Most of them do, but not all. If using dev_id as basis for all lookups, you need to acknowledge this and plan for it. These seemed to be rather rare in my environments, and in those cases it seemed like I could work around them by with a couple of AO profiles.
Using callbacks vs posting messages
When processing messages from a queue and acting on that, you want to be fast. If you do callbacks in the thread that processes messages, you risk delays because callbacks might be slow to reply. Building your probes to read messages from queues (or subscribe to them) is slightly more complicated, but in time sensitive scenarios such as this, I believe it is favorable.
Examining message structures
The way that a message is divided into header and data varies between Dr. Nimbus and the SDKs, so this is something to be aware of. Dr. Nimbus is a useful tool, but examine messages closely with the SDK to be sure.