DX Unified Infrastructure Management

  • 1.  Measuring the Success of your team/tools

    Posted Jan 17, 2017 05:09 PM

    I am interesting in hearing how my peers are measuring and reporting on the success of the teams that operate your monitoring tools.  We talk about incident avoidance but I have never had a good feeling on how to measure and report on that.  Starting to do some work with MTTR of alarms (time of closure - time of origination) but feels more a relevant measure of the team(s) responding to an event than my team.  So I am interested in how others my be leveraging tools like UIM to report metrics that demonstrate the success of the team.



  • 2.  Re: Measuring the Success of your team/tools

    Posted Jan 18, 2017 07:43 PM

    We measure the success or time savings of monitoring. We had ten years worth of case history from manual problem detection and we picked about a hundred or so problem classifications and then analyzed the length of time it took from the point the problem was determined to begin happening to resolution. We then went through the same exercise after monitoring was in place for a long enough period of time to support analysis. Where one of our hurdles is acceptance of automated monitoring, we needed to be able to show value. From this set of numbers we then had a baseline where we could show a theoretical number of downtime minutes avoided because of the presence and attention to monitoring. 

     

    We actually found that there was close to a factor of eight difference between the downtimes experienced in one analysis. Really supported the idea that it is way easier to fix something while it is starting to go bad than when it was a smoking wreck in a crater.

     

    We try to measure same day closures and similar metrics to evaluate performance but it is difficult to tease out the impact of outside influences. And ultimately what really matters is if the customer is happy, not how fast you do things.  

     

    -Garin