Ever wanted to detect and act on an eventstorm from logmon, ntevl or syslog ?

Discussion created by carstein.seeberg on Dec 11, 2009

I spoke to our support guys the other day.  They brought up an issue that many customers want to resolve.  In short and simple language:

      ".... send alarm or do an action if N number of alerts happened in M number of minutes."

The way I solved this is by a single AO profile and a single script.  The AO profile will scan all open alarms comming from the 'ntevl' probe (this can be extended to 'ntevl|logmon' etc.) with suppcount > 3 every 5 minutes.

The NAS 3.31 (comming soon) allows AO profiles to send arguments to the script.  But the solution can well be put into work in the current NAS.

The script will escalate the matching alarm to a MAJOR severity, and generate a secondary alarm (as an example).


-- Function to scan the transaction-log for the number of suppressions in a moving time window
-- Examples:
--  local nid = "UN29351917-74961"
--  printf("num. of suppressed transactions last (default: 15 'minutes'): %d", numSuppAlarmsLast(nid))
--  printf("num. of suppressed transactions last 5  (default: 'minutes'): %d", numSuppAlarmsLast(nid,5) )
--  printf("num. of suppressed transactions last 15 min   : %d", numSuppAlarmsLast(nid,15,"minute") )
--  printf("num. of suppressed transactions last hour     : %d", numSuppAlarmsLast(nid,1, "hour") )
--  printf("num. of suppressed transactions last day      : %d", numSuppAlarmsLast(nid,1, "day") )
function numSuppAlarmsLast(nimid,num,unit)
   if nimid==nil then error ("numSuppAlarmsLast: no nimid!") end
   if num==nil then num=15 end
   if unit==nil then unit="minutes" end

   if unit~="minute" and unit~="hour" and unit~="day" and unit~="minutes" and unit~="hours" and unit~="days" then
      error ("numSuppAlarmsLast: unit is one of minute(s), hour(s) or day(s)!")
   local sql = "SELECT COUNT(type) as nsupp FROM NAS_TRANSACTION_LOG WHERE nimid='"..nimid.."' AND type = 2 AND time >= datetime('now','localtime','-"..num.." "..unit.."')"
   local al  = alarm.query (sql)
   return al.nsupp

if SCRIPT_ARGUMENT == nil then
   SCRIPT_ARGUMENT = "15 minutes"

-- NAS 3.31 supports AO arguments, expect argument on the form: num unit
-- e.g 15 minutes
args = split(SCRIPT_ARGUMENT)

-- Get the current alarm-record
a = alarm.get()
if a == nil then error ("Missing current alarm-record!") end

n = numSuppAlarmsLast(a.nimid,args,args)

printf("%s has %d suppressed transactions the last %s %s", a.nimid, n, args, args)
if n>5 then
   action.escalate (NIML_MAJOR,a.nimid)
   nimbus.alarm    (NIML_MAJOR,"Check the logs at '"..a.hostname.."'",a.nimid)