It would be nice if 'samples' could be a configuration option.
Check interval '300' seconds(this is in place now)
Samples option: 3
= If the process is down for 3 straight 5 min checks, then issue the process down alarm.
I shouldn't lump this in here but ... One addition to this is if you are checking processes on a schedule and have PID tracking on. The current probe will alarm if the process id changes outside the checking schedule.
The use case here is say you have a nightly restart of SQL server that happens as 00:30 and you have a profile that tracks sqlserver.exe and a schedule that runs the test from 01:00 to 23:59.
With the current probe you get an error at 01:00 that sqlserver.exe restarted. This is undesirable because it is reporting an event that happened outside the scheduled window.
There should be an option to indicate that the checked PID should be reset at the start of the schedule to avoid this behavior.
This is a tangential issue to yours because what should the behavior be if you test down for one test right at the end of the schedule? If it's down two samples at the beginning of the next window do you generate an alert or not? I could argue it both ways, not sure which is more right.
For your use case, I'd suggest that you set up an on arrival AO profile that sets the alarm invisible if count is < 3. It's not perfect because processing delays can cause the alarm to appear and then disappear again and you don't get the rolling 900 second window but it's close. You might also be able to do this with the SLA processor and the QOS generated on the process status. That will do the rolling average for you but it happens at a cost and the processing is periodic, not instant. Or maybe time over threshold might do it too though that's difficult to distribute to many systems but feasible for one.
Retrieving data ...