We’re all familiar with these axioms:
- Familiarity breeds complacency
- Ignorance is bliss
- What you don’t know won’t hurt you
While this axiom is not as familiar as the ones above, it’s just as important for us techies:
- Alarm fatigue (also known as alert fatigue) results from exposure to frequent alarms (alerts) and leads to desensitization, which causes longer response times and/or missing important alarms.
Alarm fatigue also occurs in many other industries, including construction and mining (where backup alarms sound so frequently that they become senseless background noise) and healthcare (where monitors tracking vital signs sound alarms so frequently and for such minor reasons that they lose the urgency and attention-grabbing power they ought to have).
In performance management, alarm/alert fatigue and desensitization present real dangers to people whose job is keeping apps running 24x7, meeting SLAs, and reaching financial targets. If monitoring teams get lazy or comfortable and start to ignore alarms, they could miss an important alert.
In one of my previous blogs, “APM Monitoring Governance: The Jurassic Park Conundrum,” I discuss the disadvantages of monitoring too much “stuff” in an application. Application performance can be affected by the monitoring tool itself, so we need to place limits on the number of monitors, using key performance indicators as a guide in selecting monitors. Monitoring governance and alarm/alert fatigue are different fields, but they’re more closely related than some may think. Too many metrics with no governance can also lead to an overabundance of alerts, many of which are ignored. Too many alerts can cause performance degradation in the application as well as the monitoring tool.
In my visits to customers using CA APM, I often notice that when many alerts are triggered, notifications are sent and promptly ignored, and no action is taken other than acknowledging the alarm. What are the reasons for that behavior? We could speculate that:
- The alerts were set up with good intentions but their frequency drove those who monitor them into desensitization via alert fatigue
- Thresholds were set incorrectly, causing false positives
- The metric causing the alert is minor and doesn’t affect performance
- The monitored metrics are mandated by corporate policy
- The application team wants to see everything possible and the ops team has not worked with them to define and take actions relevant to a particular alert
If these or other reasons result in ignored alerts, we must ask, “Why monitor that metric at all?” The danger of desensitization through alert fatigue is real when people responsible for monitoring are swamped with frequent alerts that mean little or nothing to them. The constant action of clicking to close or acknowledge an alert will eventually result in frustration—perhaps even disabling an alert. This could spell disaster if a real alert is missed and a crash results.
CA APM provides early warnings on events that can cause crashes or application inconsistencies that can affect user experience. Alerts, to be meaningful, should spur operations or support personnel into action. They may take action themselves based on runbook entries, or pass off actions to developers, DBAs or the network team. The point is that alerts, to be useful, should result in some form of action. Without that basic premise, monitoring loses significance and value. Planning ahead, collaborating with application teams, and working with monitoring experts, such as those in CA Services, can lead to fewer and more meaningful alerts and corrective actions that prevent catastrophic failures.
Now that’s an axiom we can all learn to love.