shakeel.sorathia

Determine alarms that have increased severity since first alert

Discussion created by shakeel.sorathia on Jul 23, 2009
Latest reply on Jul 25, 2009 by keith_k
Hi, we're writing a somewhat automated SOP management for our alarms using Lua.  (Doing it with auto operators simply wouldn't achieve what we need it to achieve because of many of the rules we are trying to process).  The problem that I'm currently coming across is determining when an alarm has increased it's severity.  Let me give an example.

Let's say we have 2 thresholds setup for a disk usage alarm, warning and major. Warning at 80% major at 90%.  So let's say that the disk hits 80%, that would issue a warning.  Our nas with the lua scripts would determine based on a subcription model, what to do with the alarm and who to inform.  In this case, this would result in a simple email to the team that is managing the system.  Let's say that it's the middle of the night, and a warning is sent, but no one is here to see the alarm so no one does anything about it.  Then a couple hours later, the disk fills up to 90%, and sends a major alarm.  Based on our model, a major alarm would immediately go to our 24/7 monitoring team who would then reach out to the appropriate oncall people to deal with the alarm.  The problem that I'm having is that the first time we saw the alarm, we escalated via email to the team, and assigned it that way, however, when it changes severity, I don't see a way to see which alarms these are.

I've looked at the alarm structure, and I see a column for prevlevel, but that seems to only be for the most recently suppressed alarm, so if we got two majors for the same alarm before our script could run, then we wouldn't see those alarms as the prevlevel would be set to the same value as the current level.  I've looked at the transaction history, but to have to pull up the transaction history for every single alarm and then look at the current severity, and the initial severity seems like a ton of processing work given the number of alarms that are normally active in our environment.

I'm trying to find out if there is another way to do this, or if anyone has other ideas to go about doing this?  One thing to mention though is that creating an auto-operator that would fire if the disk alarm is major is not an option.  We simply have too many different types of alarms to have to create AO's that would handle the matrix of alarms.  However, if there was a way to create an AO or trigger that would fire on a state change for any alarm, that would be acceptable as we can feed that into our scripts for auto-escalations.

Any help would be appreciated.

Shakeel

Outcomes