DX Unified Infrastructure Management

  • 1.  Global Monitoring Profile Exception Questions and Issues

    Posted Aug 24, 2018 09:12 AM

    We have a group in MCS which is used to apply monitoring probes for all servers in our enterprise such as CPU, Disk, Memory, and Services. These probes were created on a template server setting thresholds and alarm levels and then copied to the group that contains all of our hosts. We need a method to create exceptions to this enterprise monitoring template in cases where a server cannot operate within the same parameters as all other servers. As an example, we monitor our anti-virus services across all servers to ensure they are started. If they are not in a started state a critical alarm is created. There may be a situation where an application running on a server is not compatible with that anti-virus application for whatever reason and that server will have the service in a disabled state until a root cause for the incompatibility can be established and the service can be enabled. Putting the entire server into maintenance mode is not a viable option as we still need to monitor the other services and CDM thresholds. Setting the anti-virus alerts to invisible also does not work in our situation as we drive all of our critical alerts to a central monitored Spectrum console and we cannot filter out invisible alerts, if it is critical, it is registered in the Spectrum console.

     

    Possible Solution

    For each of the profiles within the enterprise monitoring template we have created an "exception" group. The monitoring profile is copied to the group for each specific service, given a higher group profile priority and the alarm severity is changed to information.

     

    Problem

    In our testing, moving a server that is currently alarming for anti-virus service down into the new group behaves as expected, the critical service alarms are changed to informational alarms. The issue is when we move the server out of the group so that it should only be receiving the profile from the original, enterprise wide group which creates the alarms in a critical state. The alerts continue to be created as informational. Viewing the profile status of the alarm on the server shows that it is derived from the group that has the alarm as critical. MCS continues to create informational alarms even when the server is no longer attached to that profile.

     

    Questions

    How often are profiles within an MCS group reconciled so that members of the group will receive the profiles attached to the group? Can this be modified to a specific time interval? If so, what are the downsides of shortening the interval? Can a reconciliation be manually executed?

     

    Is there a better procedure than the one I'm describing? I suspect this would be an issue in any enterprise, how are others handling this?



  • 2.  Re: Global Monitoring Profile Exception Questions and Issues

    Broadcom Employee
    Posted Aug 24, 2018 10:25 AM

    As I understand it, the lower priority profiles should get reapplied soon after the higher one is deleted.

     

    Maybe your MCS isn't performing very well.

     

    default setting of MCS is no multi threading which is bad

    set these keys to turn on...

    device_processing_threads = 10

    config_deployment_threads = 10

     

    Maybe that profile is in Error and stuck (if it gets to 30 retries it will stop)

    you could update the profile…

    (update ssrv2profile set status='new' , retries=1 where status = 'error' )

     

    Hope this helps



  • 3.  Re: Global Monitoring Profile Exception Questions and Issues