Apache Cassandra is a distributed, scalable and fault-tolerant database system. If you use Cassandra, you care about scalability and monitoring is crucial to avoid bottlenecks.
This document intends to give a set of basic best practices to monitor Cassandra nodes by using the UIM Cassandra_monitor probe and highlight the most important metrics among the 120+ that the probe can collect. This probe uses a “pull” approach to retrieve metrics via OS commands, API and JMX calls.
The key areas to focus when monitoring Cassandra nodes are:
1. Latency: Monitoring latency is critical as it can directly impact User Experience. Latency will give an idea of how fast are nodes responding.
These two metrics will give a high level overview of overall latency:
NOTE: Read Operations are usually slower than writes.
2. Throughput: Look for DiskReads and DiskWrites on the StorageVolumesNode. This can give us an idea of how busy is this particular node.
3. Disk usage: Relevant to decide when to add more nodes before running out of storage.
NodeTotalDiskSpaceUsed: Total disk space that is used by the column families on the node (column families can be seen as SQL tables).
FileSystemFree: Total free kilobytes on the file system. This metric is important on individual drives (JBOD configuration) when using large partitions in the DB as some disks can run out of space even if TotalDiskSpaceUsed metric indicates that there is still space left.
4. Saturations: Any kind of overhead or saturation in the cluster will have a direct impact on database performance.
NodePendingTasksWrites and NodePendingTasksReads: To monitor these metrics is critical to detect potential problems. If incoming tasks cannot be allocated the queue of pending tasks will grow degrading Cassandra performance.
Pending Compaction Tasks: Compaction performance is an important aspect of knowing when to add capacity to your cluster. The "sweet-spot" for this metric is zero and above 20 will start causing degradation on cluster performance.
Note that some metrics might not be relevant depending on the compaction strategy used. For instance, SSTable count is not critical when using a TimeWindow Compaction Strategy, where all SSTables are compacted into a single SSTable.
These 4 categories are a basic set to get started with Cassandra monitoring and get visibility of the Cluster health. Additional metrics can be monitored based on specific needs. Also, the aggregation of some of these metrics can provide richer information when your data is distributed in many column families.