We measure the success or time savings of monitoring. We had ten years worth of case history from manual problem detection and we picked about a hundred or so problem classifications and then analyzed the length of time it took from the point the problem was determined to begin happening to resolution. We then went through the same exercise after monitoring was in place for a long enough period of time to support analysis. Where one of our hurdles is acceptance of automated monitoring, we needed to be able to show value. From this set of numbers we then had a baseline where we could show a theoretical number of downtime minutes avoided because of the presence and attention to monitoring.
We actually found that there was close to a factor of eight difference between the downtimes experienced in one analysis. Really supported the idea that it is way easier to fix something while it is starting to go bad than when it was a smoking wreck in a crater.
We try to measure same day closures and similar metrics to evaluate performance but it is difficult to tease out the impact of outside influences. And ultimately what really matters is if the customer is happy, not how fast you do things.
-Garin