We do several things - mostly using preprocessor scripts and AO profiles.
Our situation is a little different that most because we essentially manage an application on a specific set of hardware. As opposed to the traditional monitoring of a server. Sort of like if you ran a DBA team for a large organization and your SLA to your database users was availability of the database software - not the server itself.
And because we manage multiple products that reside on potentially a single robot there are some housekeeping things we need to do.
Consider what happens with CDM disk usage alarms in the scenario where you have a Sybase install on drive D:, Oracle on drive E:, and the OS on drive C:. CDM issues related to drive D: go to the Sybase team, E: goes to oracle, C: goes to your server team. We place a product indicator in each message so that our ticketing system can route the message to the correct support desk. This is part of the message setup for the drive and works well. The problem is the close since there's only a single default closure message, there's no way to modify that to include the product. So we use preprocessor scripts to look up the correct product for probes where there's not full control over the message and correct it there.
We have almost 4000 monitored systems and dictate a uniform monitoring profile. The problem is that some customers disregard best practices. So, again with CDM, we dictate a 95% low threshold but we have customers who intentionally violate that. In these known cases, we also use preprocessor scripts to throw these alerts out when thy arrive. There is no value in knowing that an error exists and will continue to exist indefinitely. Yes, we could manage this by disabling the alert on the specific CDM probe but putting it in a preprocesor script lets us manage all the exceptions in one place.
Because customers come and go but the "going" isn't always amicable, it is often the case that we lose access to a customer hub system but they don't disable the Nimsoft software. We also prevent the creation of cases for the known list of inactive customers even though those systems continue to send us data. Presumably maintenance mode could be used for this too but I have never had the success of getting that process to work as expected. Neither could CA so I don't think it was just me.
There's also a weird case where we might have something like payroll and inventory systems using the same instance of Oracle. In that case we'll take it one more step and figure out which product the oracle message belong to and further adjust the message.
For probes like syslog and net_connect, the messaging reflects the robot name of the system running the probe, not the location that the actual event is occurring. We resolve the correct robot name in the preprocessing script and then let the event store so that it is associated with the correct system.
Because of the limitations of what you can do in preprocessing there are a couple things that are handles as AO profiles instead.
Our product relies on network attached storage and in some cases a single server might have 400+ filesystem mounts. If you have a switch failure you then get a CDM event for each unreachable filesystem and one from net_connect saying it can't connect to the storage server. So we have a fairly complicated LUA script that evaluates the outstanding CDM and net_connect alerts and figures out if there are "too many" and if so, rolls them all into a single alert - mostly by setting the unnecessary alerts invisible and creating a meta alert that indicates that something major is going on. This avoids the situation of getting 401 email pages at 3:00am on a Sunday morning.
We also introduce a problem or trouble id into the message - either in the probe config or in the preprocessor script if the probe config doesn't allow modifying the messages generated. We use that problem id to then query the alarm history table to check if this is a repeat event. There's a different level of attention to be placed on a net_connect failure that has never been recorded before and one that has happened 20 times in the last 24 hours. So, if it meets the criteria of "too often", the priority of hte event is artificially increased and some additional text is added to the message to indicate that this is a repeat offender and needs additional attention.
At this point we probably have an event that is usable. All of our events are sent to Salesforce because we use that CRM system for support ticketing. We use an AO with a LUA script to craft that message and at that point of generation, we look up any additional support information that might be relevant to the event based on problem id. This might include a lookup of KB articles, suggested troubleshooting steps, some canned events that are traditionally always done (like df -k on a linux system), etc.
Bah - that's a lot of crafting....
-Garin.