CA on CA Tech Tip - Monitoring CA Unified Infrastructure Management

Document created by Alquin Employee on May 2, 2016Last modified by SamCreek on Dec 17, 2016
Version 3Show Document
  • View in full screen mode

Hi Infrastructure Management Community,

As promised I am sharing how we monitor CA Unified Infrastructure Management from the CA on CA Webcast Featuring CA Unified Infrastructure Management https://communities.ca.com/community/ca-infrastructure-management/blog/2016/02/10/webcast-recap-ca-on-ca-featuring-ca-unified-infrastructure-management-uim

 

In order to ensure that Unified Infrastructure Management is operating at optimal levels we monitor it using a variety of mechanisms with the intent of detecting problems before they occur. This includes monitoring with independent solutions that can alert us of problems independent of Unified Infrastructure Management. At a high level the monitoring of CA Unified Infrastructure Management is done at the following levels

  • Synthetic transaction monitoring
  • Unified Management Portal Java application performance management
  • Hub availability management
  • Probe availability
  • Server management
  • Database availability and performance
  • Miscellaneous options
  • The ability to login to UMP
  • The ability to view the noc site page

 

Transaction Name

Description

Frequency (seconds)

Notification Method

UIM_Internal

Monitors the Unified Management Portal URL for availability within the CA network using CA on premise monitoring stations

300

xMatters (SMS)

App Synthetic Monitor (email & phone)

UIM_External

Monitors the Unified Management Portal URL using a scripted transaction for availability from the Internet using the publicly available monitoring stations. The transaction tests the following

  1. The ability to connect to UMP from the Internet
  2. The ability to login to UMP
  3. The ability to view the noc site page

NOTE: This also validates the internal site is working as the access from the Internet is via an Apache based proxy

300

xMatters (SMS)

App Synthetic Monitor (email & phone)

UIM_certificate_check

Validates that the Unified Management Portal SSL certificate has not expired

NOTE: This also validates the internal site certificate as the same certificate is used for both

300

xMatters (SMS)

App Synthetic Monitor (email & phone)

 

Unified Management Portal Java Application Performance Management

Most users will interact with Unified Infrastructure Management using the Unified Management Portal so it is critical that this is available and performing optimally. To ensure that this is the case we have instrumented it with the CA Application Performance Management Java Agent.

Metric Monitored

Description

Threshold

Backends average response time

Response Time is the time it takes for a request to complete. This time provides a basic measurement of the called backend e.g. database response speed

Caution alarm when value exceeds 2500 for 3 minutes

Danger alarm when value exceeds 4000 for 3 minutes

Backends errors per interval

Errors are the number of exceptions reported by calls to backend sources e.g. failed SQL database calls

Caution alarm when value exceeds 15 for 3 minutes

Danger alarm when value exceeds 25 for 3 minutes

Backends stall count

Stalled requests are those which have not completed within a specified time threshold. If a request is counted as stalled, that does not mean it is hung and will never be completed, but that its execution exceeded the stall threshold.

Caution alarm when value exceeds 3 for 3 minutes

Danger alarm when value exceeds 5 for 3 minutes

Frontends Average Response Time

Response Time is the time it takes for a request to complete. This time provides a basic measurement of application response speed

Caution alarm when value exceeds 10000 for 3 minutes

Danger alarm when value exceeds 12000 for 3 minutes

Frontends concurrent invocations

Invocations are requests handled by the application and its various parts. Concurrent invocations are the requests being handled at a given time.

Caution alarm when value exceeds 3 for 3 minutes

Danger alarm when value exceeds 5 for 3 minutes

Frontends errors per interval

Errors are the number of exceptions reported by JVM and HTTP error codes

Caution alarm when value exceeds 8 for 3 minutes

Danger alarm when value exceeds 10 for 3 minutes

Frontends responses per interval

Responses Per Interval reflects the number of invocations finished in that interval. It is a measure of data throughput and thus of application performance

Caution alarm when value exceeds 100 for 3 minutes

Danger alarm when value exceeds 250 for 3 minutes

Frontends stall count

Stalled requests are those which have not completed within a specified time threshold. If a request is counted as stalled, that does not mean it is hung and will never be completed, but that its execution exceeded the stall threshold.

Caution alarm when value exceeds 3 for 3 minutes

Danger alarm when value exceeds 5 for 3 minutes

Tomcat heap used (%)

Identifies the percentage of the available heap memory that is used on the computer where the agent is deployed.

Caution alarm when value exceeds 95 for 3 minutes

Danger alarm when value exceeds 98 for 3 minutes

Tomcat Java process CPU utilization (%)

Stalled requests are those which have not completed within a specified time threshold. If a request is counted as stalled, that does not mean it is hung and will never be completed, but that its execution exceeded the stall threshold.

Caution alarm when value exceeds 65 for 3 minutes

Danger alarm when value exceeds 70 for 3 minutes

 

NOTE: Adjust the thresholds for your environment and expected levels of performance.

Hub Availability Management

In order for Unified Infrastructure Management to function correctly the hubs must be available and functional to ensure that this is the case we must monitor the availability of each hub in the infrastructure. This is done using a number of methods. This is covered in the CA UIM monitoring best practice guide.

Method Description

Solution used

Frequency (seconds)

Making sure that the hub server is actually reachable

CA Spectrum

300

Making sure the hub.exe process is running

CA UIM processes probe

120

Making sure that from the primary hub we can connect to the port of the secondary hubs

UIM netconnect probe

120

One of the core features of UIM architecture is the messaging capability. Messaging is built on top of queues if the queues back up then UIM is not working optimally as alarm and QOS messages are not reaching their destinations in a timely manner. This can result in increased values for mean time to repair. Each hub has a built in mechanism for monitoring each queue and will raise an alarm when each queue is backed up so that it can be addressed.

Probe Availability

CA Unified Infrastructure Management uses software probes to monitored the health of other applications and infrastructures. A lot of these probes are built using Java. The first thought is to monitor these probes in the same way that we monitor the wasp probe with the CA APM Java Agent. However, the overhead would be incredibly high since on some systems we have up to 10 java based probes (20+ on the primary hub). As an alternative and since these probes are not interactive applications we monitor them using a different method as described below

Method description

Solution used

Frequency (seconds)

Monitor that the probe is running by monitoring the process name or in the case of java.exe process and the command line arguments to ensure we are uniquely watching the correct process

CA UIM processes probe

120

Monitor the probe log to see that it has not aborted because it ran out of critical heap memory by searching for text "*java.lang.OutOfMemoryError*"

NOTE: Some Java based probes do not fail when they run out of memory so this method of monitoring is critical

CA UIM logmon probe

120

Monitor the CPU and memory utilization of the probes to ensure that they are operating within healthy limits this is especially critical for the core probes on the primary hub which can over utilize resources at time not leave enough for others to function

CA UIM processes probe

120

Monitoring the probe log for out of memory errors has been especially valuable for the health of the vmware and ibmvm probes which monitor the health of the VMware vSphere and IBM PowerVM environments respectively.

Server Management

The core CA Unified Infrastructure Management solution is deployed in Windows and Red Hat Enterprise Linux. To ensure that these systems are performing adequately they are monitored for core performance metrics using the CPU, Memory and Disk probe. The metrics collected for Windows Server and Red Hat Enterprise Linux are standard. In addition, to the alarms that are sent when thresholds are violated a daily email is sent with a report showing the performance of each server within the core infrastructure so plans can be made ahead of time right size the infrastructure.

Database Management

The CA Unified Infrastructure Management solution relies on MySQL to store the performance metrics collected from the infrastructure and application for display by the Unified Management Portal. The metrics collected to determine the health of the MySQL database. Note also that an alarm will be triggered by the data_engine probe in the event that it is not able to communicate with the MySQL database. Depending on your database type you will utilize the proper DB monitoring probe (mysql, sqlserver or oracle).

 

Disclaimer: I work for CA Technologies in the IT department in the Tools and Automation Group. I am not a product manager or product developer and has such cannot provide product insight beyond what is in the publicly available documentation.

 

6 people found this helpful

Attachments

    Outcomes