Does anyone automate the process of evaluating repeat alarms and generate alarms off that information?
I'm hoping that someone has already built this wheel and I could just borrow it but if not, it seems like something that might be of general value.
The business case here stems from this example: One of the products we manage includes Oracle. Part of Oracle's nightly maintenance is a backup job. That backup job creates about a 400GB backup file in about a 3 minute span. Also, old backup files are deleted once they reach a particular age. The problem the particular server had was that the free space in the destination filesystem varied between 350GB and 450GB so sometimes the backup fit and sometimes it did not. We're checking disk space every 15 minutes and averaging 5 samples. We're also watching the Oracle logs for the success and failure messages.
The hole in this process should be pretty obvious. We only very rarely throw the disk full alert because the averaging is across 5 samples and at most only one sample would catch the disk usage close to the peak. Conflicting the detection of the problem is that many times the disk fills to 99% with a successful backup, generates an alert, then the old backups files are purged and the alert goes away. Watching the Oracle log, with the full disk situation, you don't always get the "FAILURE" message in the log, sometimes for whatever secret Oracle reason the backup might still be successful even if there's a failure mid process.
So what we'd see on the support rep side of things is a case or two every couple days for Oracle backup failures. Because the priority of backups in the whole scheme of things is low, it might be 24 to 48 hours until it is looked at and by that time there's possibly been a successful backup. Similarly, with the disk space issue, because the backup process deletes the failed backup and old backups, there's almost always "free space" when the support agent looks at the system. In either case the case gets closed and nothing gets fixed because there's no obvious problem happening at that instance. I know that fundamentally the issue here is work flow on the humans' part because they should be asking why the alert happened in the first place, not if the problem was still happening and not looking further if it's not but that kind of curiosity is impossible to train into a person. And the typical support management metrics used to measure performance usually equate number of cases closed with success and fail to account for the cleverness someone might demonstrate in eliminating one repeat issue.
So, long story but what I want to do is when an alert comes in, evaluate how often the same alert has happened in the past and then adjust the alert message to indicate that this is a repeat of a recent issue and adjust the priority and maybe notify a different set of people.
One thing that we've done already for reporting purposes is to modify every alert message in our environment to include a unique problem ID. So all CDM "free disk space < 5%" messages have a unique ID AB123456 so that they can easily be identified.
The approach that I had been considering is to create an AO profile that would run on 5 seconds after arrival (makes sure it is stored successfully but before any of the downstream automation) that reads the triggering alarm, pulled the problem id out of the message along with the robot name. Then query the alarm summary table and count the matching messages over the past configurable number of days. If that count was greater than a configurable threshold, update the current alarm message and priority.
It seems fairly easy to do. I was wondering if anyone else had done this and if so, how did you do it and was it successful?