Thank you Michael. I've tried to apply the concepts from your "APM best Practices" book, but the APM took a direction that APM has been hammered, sort of the square peg into a round hole.
The prospective of the context and questions certainly helps try to bring this into focus and find a target.
Currently the majority of the alert structures within our APM is based on the custom epagent plugin metrics (df/netstat/vmstat/lparstat/free/lsps/etc) with maybe the average response time being alerted on. With this the APM isn't really used by the business folks, application development, or the product support folks but the system groups, websphere, Unix. Which is, in my opinion caused the square peg in the round hole issue.
The next issue with our APM implementation is the company's aversion to using anything that is not contractually supported. So the field packs that provide transaction linkage between different points, not allowed. Customization to an agent to tie the transaction id to the next layer, nope. So we typically have one maybe two jumps within a JVM to the databases or a socket before the trace stops. Hopefully with our pending upgrade to 10.5 I can get the cross JVM with RMI calls to provide a couple more trace steps.
- What have been our first priorities, that we employ APM visibility for?
Over 90% of our current APM position (alerts/dashboards) is for the middle tier, websphere, MQ and a bit down to the databases. Over the last few years, tried to provide more application and services (SOA/webservices) metrics and alerts but very little traction on that field.
- what visibility gaps remain?
Back in 9.6 when TIM was only supported on RedHat, we shut down TIM/CEM since at that time we did not have a RedHat support contract but even when we did have it, we couldn't get enough business attention to configure, identify, and provide useful business metrics. From the end user to the front door, with the browser agent or the mobile agent both of which we do not have deployed, would provide coverage to those points. Next would be the distance between the application and the physical/virtual world of CPU/RAM/NIC/Disk. We have been developing plugins to the epagent to provide coverage. Through a long story we have CA UIM also but got blocked due to the robots having to be installed as root. So with that, we keep expanding the epagents.
- what business value is enhanced by bringing APM visibility (all your tools) to more people?
Only in the last few months have I been able to really engage more than the mid and database tier groups. So really working toward providing directed, focused dashboards and alerts to directed product owners. The three or four meetings I have had with one of the business product owners was fairly successful. The base dashboard has as the center-piece the mainframe CPU since when the CPU approaches high levels, the average response times of the primary service shoots up and end users start to complain. Then from there, the services five KPI (average response, concurrent, responses, stall, error) with alerts on the average response time since that is typically the driver of the application's issues.
Now the new kid on the block is coming onto the court, Splunk. I know very little about Splunk in the grand scheme of things but really doesn't help when I get cornered with, well today was Splunk Glass table and having to try to understand if they just saw a bunch of pretty lights and really don't have a business need or case then talk them away from the edge decision of buying a disco ball or first learn to dance.