Server CPU Monitoring using CDM – Best Practices
Monitoring needs are different for every company: keeping CPU usage low and ensure room for expansion is optimal in same cases while other organizations prefer to use their equipment efficiently keeping the CPU usage above 70%, so not always “lower is better”.
This documents intends to give some best practices to monitor CPU utilization and to easily detect CPU bottlenecks.
- While Average CPU Utilization for the device as a whole is important to detect how busy is the system, it is also necessary to check CPU utilization for individual processors: There are single threaded applications that can take up to 100% of a single core and this can be missed if looking only at total average CPU usage.
- A high CPU queue length (system load in Unix systems) indicates processes are waiting for CPU and this is a clear indicator of problems. Note that this queue can develop when utilization is well below 90% so CPU queue length should be a must in CPU monitoring as reported in several studies.
- A basic rule (valid for several OS flavors) to detect a CPU bottleneck is to monitor if the CPU queue length is at least twice the value of number of processors. CDM probe can handle this condition: If running on a multi-CPU system, the queued processes will be shared on the number of processors. For example, if running on a system with four processors and using the default Max Queue Length value (4), alarm messages will be generated if the number of queued processes exceeds 16.
- Make use of the built-in detection for predictive alarms (TTT – Time To Threshold) to proactively detect CPU bottlenecks before they happen and the TOT (Time over Threshold) to filter out spikes and focus on problematic situations.
- Identify the top consuming processes of a server by configuring CpuErrorProcesses and CpuWarningProcesses metrics in the cdm probe. This feature is fundamental to determine the main applications impacting performance.