Hi Infrastructure Management Community,
As promised I am sharing how we monitor CA Unified Infrastructure Management from the CA on CA Webcast Featuring CA Unified Infrastructure Management https://communities.ca.com/community/ca-infrastructure-management/blog/2016/02/10/webcast-recap-ca-on-ca-featuring-ca-unified-infrastructure-management-uim
In order to ensure that Unified Infrastructure Management is operating at optimal levels we monitor it using a variety of mechanisms with the intent of detecting problems before they occur. This includes monitoring with independent solutions that can alert us of problems independent of Unified Infrastructure Management. At a high level the monitoring of CA Unified Infrastructure Management is done at the following levels
- Synthetic transaction monitoring
- Unified Management Portal Java application performance management
- Hub availability management
- Probe availability
- Server management
- Database availability and performance
- Miscellaneous options
- The ability to login to UMP
- The ability to view the noc site page
Transaction Name | Description | Frequency (seconds) | Notification Method |
UIM_Internal | Monitors the Unified Management Portal URL for availability within the CA network using CA on premise monitoring stations | 300 | xMatters (SMS) App Synthetic Monitor (email & phone) |
UIM_External | Monitors the Unified Management Portal URL using a scripted transaction for availability from the Internet using the publicly available monitoring stations. The transaction tests the following - The ability to connect to UMP from the Internet
- The ability to login to UMP
- The ability to view the noc site page
NOTE: This also validates the internal site is working as the access from the Internet is via an Apache based proxy | 300 | xMatters (SMS) App Synthetic Monitor (email & phone) |
UIM_certificate_check | Validates that the Unified Management Portal SSL certificate has not expired NOTE: This also validates the internal site certificate as the same certificate is used for both | 300 | xMatters (SMS) App Synthetic Monitor (email & phone) |
Unified Management Portal Java Application Performance Management
Most users will interact with Unified Infrastructure Management using the Unified Management Portal so it is critical that this is available and performing optimally. To ensure that this is the case we have instrumented it with the CA Application Performance Management Java Agent.
Metric Monitored | Description | Threshold |
Backends average response time | Response Time is the time it takes for a request to complete. This time provides a basic measurement of the called backend e.g. database response speed | Caution alarm when value exceeds 2500 for 3 minutes Danger alarm when value exceeds 4000 for 3 minutes |
Backends errors per interval | Errors are the number of exceptions reported by calls to backend sources e.g. failed SQL database calls | Caution alarm when value exceeds 15 for 3 minutes Danger alarm when value exceeds 25 for 3 minutes |
Backends stall count | Stalled requests are those which have not completed within a specified time threshold. If a request is counted as stalled, that does not mean it is hung and will never be completed, but that its execution exceeded the stall threshold. | Caution alarm when value exceeds 3 for 3 minutes Danger alarm when value exceeds 5 for 3 minutes |
Frontends Average Response Time | Response Time is the time it takes for a request to complete. This time provides a basic measurement of application response speed | Caution alarm when value exceeds 10000 for 3 minutes Danger alarm when value exceeds 12000 for 3 minutes |
Frontends concurrent invocations | Invocations are requests handled by the application and its various parts. Concurrent invocations are the requests being handled at a given time. | Caution alarm when value exceeds 3 for 3 minutes Danger alarm when value exceeds 5 for 3 minutes |
Frontends errors per interval | Errors are the number of exceptions reported by JVM and HTTP error codes | Caution alarm when value exceeds 8 for 3 minutes Danger alarm when value exceeds 10 for 3 minutes |
Frontends responses per interval | Responses Per Interval reflects the number of invocations finished in that interval. It is a measure of data throughput and thus of application performance | Caution alarm when value exceeds 100 for 3 minutes Danger alarm when value exceeds 250 for 3 minutes |
Frontends stall count | Stalled requests are those which have not completed within a specified time threshold. If a request is counted as stalled, that does not mean it is hung and will never be completed, but that its execution exceeded the stall threshold. | Caution alarm when value exceeds 3 for 3 minutes Danger alarm when value exceeds 5 for 3 minutes |
Tomcat heap used (%) | Identifies the percentage of the available heap memory that is used on the computer where the agent is deployed. | Caution alarm when value exceeds 95 for 3 minutes Danger alarm when value exceeds 98 for 3 minutes |
Tomcat Java process CPU utilization (%) | Stalled requests are those which have not completed within a specified time threshold. If a request is counted as stalled, that does not mean it is hung and will never be completed, but that its execution exceeded the stall threshold. | Caution alarm when value exceeds 65 for 3 minutes Danger alarm when value exceeds 70 for 3 minutes |
NOTE: Adjust the thresholds for your environment and expected levels of performance.
Hub Availability Management
In order for Unified Infrastructure Management to function correctly the hubs must be available and functional to ensure that this is the case we must monitor the availability of each hub in the infrastructure. This is done using a number of methods. This is covered in the CA UIM monitoring best practice guide.
Method Description | Solution used | Frequency (seconds) |
Making sure that the hub server is actually reachable | CA Spectrum | 300 |
Making sure the hub.exe process is running | CA UIM processes probe | 120 |
Making sure that from the primary hub we can connect to the port of the secondary hubs | UIM netconnect probe | 120 |
One of the core features of UIM architecture is the messaging capability. Messaging is built on top of queues if the queues back up then UIM is not working optimally as alarm and QOS messages are not reaching their destinations in a timely manner. This can result in increased values for mean time to repair. Each hub has a built in mechanism for monitoring each queue and will raise an alarm when each queue is backed up so that it can be addressed.
Probe Availability
CA Unified Infrastructure Management uses software probes to monitored the health of other applications and infrastructures. A lot of these probes are built using Java. The first thought is to monitor these probes in the same way that we monitor the wasp probe with the CA APM Java Agent. However, the overhead would be incredibly high since on some systems we have up to 10 java based probes (20+ on the primary hub). As an alternative and since these probes are not interactive applications we monitor them using a different method as described below
Method description | Solution used | Frequency (seconds) |
Monitor that the probe is running by monitoring the process name or in the case of java.exe process and the command line arguments to ensure we are uniquely watching the correct process | CA UIM processes probe | 120 |
Monitor the probe log to see that it has not aborted because it ran out of critical heap memory by searching for text "*java.lang.OutOfMemoryError*" NOTE: Some Java based probes do not fail when they run out of memory so this method of monitoring is critical | CA UIM logmon probe | 120 |
Monitor the CPU and memory utilization of the probes to ensure that they are operating within healthy limits this is especially critical for the core probes on the primary hub which can over utilize resources at time not leave enough for others to function | CA UIM processes probe | 120 |
Monitoring the probe log for out of memory errors has been especially valuable for the health of the vmware and ibmvm probes which monitor the health of the VMware vSphere and IBM PowerVM environments respectively.
Server Management
The core CA Unified Infrastructure Management solution is deployed in Windows and Red Hat Enterprise Linux. To ensure that these systems are performing adequately they are monitored for core performance metrics using the CPU, Memory and Disk probe. The metrics collected for Windows Server and Red Hat Enterprise Linux are standard. In addition, to the alarms that are sent when thresholds are violated a daily email is sent with a report showing the performance of each server within the core infrastructure so plans can be made ahead of time right size the infrastructure.
Database Management
The CA Unified Infrastructure Management solution relies on MySQL to store the performance metrics collected from the infrastructure and application for display by the Unified Management Portal. The metrics collected to determine the health of the MySQL database. Note also that an alarm will be triggered by the data_engine probe in the event that it is not able to communicate with the MySQL database. Depending on your database type you will utilize the proper DB monitoring probe (mysql, sqlserver or oracle).
Disclaimer: I work for CA Technologies in the IT department in the Tools and Automation Group. I am not a product manager or product developer and has such cannot provide product insight beyond what is in the publicly available documentation.