I've been trying to write a blog for ages about something Keith Evans mentioned in a conversation about extreme application monitoring. I'm not quite there yet, so this is very much a draft and a collection of thoughts. Basically it's about a more top down extreme application and service monitoring approach more surgical and precise than deduction from a massive body of evidence; any way here it is...
I’ve been thinking a lot about something a colleague said to me recently. Basically, that in the pursuit of knowledge and truth, deduction is good but being a witness is better. What does that really mean? Well let’s take a look at the eminent, although fictional detective, Sherlock Holme for a moment. His approach is to examine in great detail, every aspect of a case using a great mind and acquired knowledge. Only after all the evidence has been collected and processed does he deduce the ‘Butler did it; although experts would argue this is rare and the young independent governess is more likely. Even so, who ever ‘dun it’ the process requires the same components, lots of data a great mind and knowledge. Roll forward some 100 years to the case of Colin Pitchfork, he was the first person to be identified and convicted by a simple test. Motive opportunity all previous forensic evidence had failed to single him out yet his DNA trail and a simple test offered him up as the perpetrator.
All very interesting but what’s that got to do with IT and in particular Enterprise Management. Well, there are some very obvious parallels. The methods we use to localise problems and identify issues offer the same challenges. I often talk to my customers about only monitoring that which requires monitoring and only that which offers a useful outcome. In my experience, some 80% of all monitoring data collected is pointless ballast and with the onset of cheaper storage, this is going up at an alarming rate. We often talk of big data as being the norm these day but that should be about lots and lots of useful stuff that offers a valuable outcome and should not be considered a ‘just store everything and hope we catch what we think we might be looking for’ solution. In the case of Sherlock Holmes yes, he does gather large amounts of data but he does it in a very guided and forensic way and he gets results because he knows how to process it. Can we honestly say when we are gathering monitoring data that we know what we are doing to do with it all?
If we take this one step further, why not just use a simple test to identify the problem. In practical terms, where can we be more like Alec Jefferys and his simple, but very clever, test and less like Sherlock with the need for data and complex processing and intrinsic knowledge. For me its two-fold.
First take a very close look and ask yourself what you are trying to monitor? Break down what you need to know, find out what’s going to be useful, then go and get it. Second rather than try to correlate ten or twenty events to deduce whether something is going wrong or broken just devise a simple test to see if it’s the case. For instance, you can measure 100 different metrics relating to a platform and from them deduce the customer experience is poor. It may take you many hours to devise and maintain the correlation and this could be an ongoing and onerous task in a very dynamic environment, new challenges of the same ilk will come up every week or month as your business demands new features or capability. Or you could just measure the customer experience and see straight away there is a problem. A simple test or measurement takes away the need to employ an expensive detective and replaces it with a low cost and ultimately more accurate alternative.
So where does this all take us, well for me, forensic examination of applications and simple testing and measurement of the same should be your focus. Let the motive opportunity and the detail come later, cut to the chase and work out 'whodunit' first. Make Application Performance monitoring and analytics with tools such as CA APM and AXA your first port of call to catch the perp, start at the top and work your way down. Find out what’s hurting your business and who is affected first and manage it. Don’t spend hours trying to deduce your key applications and services are down or impacted, test them and be sure if they are.
Next week…Removing the beard of complexity with Occums Razor!