We have a group in MCS which is used to apply monitoring probes for all servers in our enterprise such as CPU, Disk, Memory, and Services. These probes were created on a template server setting thresholds and alarm levels and then copied to the group that contains all of our hosts. We need a method to create exceptions to this enterprise monitoring template in cases where a server cannot operate within the same parameters as all other servers. As an example, we monitor our anti-virus services across all servers to ensure they are started. If they are not in a started state a critical alarm is created. There may be a situation where an application running on a server is not compatible with that anti-virus application for whatever reason and that server will have the service in a disabled state until a root cause for the incompatibility can be established and the service can be enabled. Putting the entire server into maintenance mode is not a viable option as we still need to monitor the other services and CDM thresholds. Setting the anti-virus alerts to invisible also does not work in our situation as we drive all of our critical alerts to a central monitored Spectrum console and we cannot filter out invisible alerts, if it is critical, it is registered in the Spectrum console.
Possible Solution
For each of the profiles within the enterprise monitoring template we have created an "exception" group. The monitoring profile is copied to the group for each specific service, given a higher group profile priority and the alarm severity is changed to information.
Problem
In our testing, moving a server that is currently alarming for anti-virus service down into the new group behaves as expected, the critical service alarms are changed to informational alarms. The issue is when we move the server out of the group so that it should only be receiving the profile from the original, enterprise wide group which creates the alarms in a critical state. The alerts continue to be created as informational. Viewing the profile status of the alarm on the server shows that it is derived from the group that has the alarm as critical. MCS continues to create informational alarms even when the server is no longer attached to that profile.
Questions
How often are profiles within an MCS group reconciled so that members of the group will receive the profiles attached to the group? Can this be modified to a specific time interval? If so, what are the downsides of shortening the interval? Can a reconciliation be manually executed?
Is there a better procedure than the one I'm describing? I suspect this would be an issue in any enterprise, how are others handling this?