Thomas Johnston is exhausted after spending a good part of the weekend battling an APM outage. After making some configuration adjustments and an EM/agent restart, peace is restored and he can get some overdue sleep. All is good until the next issue needs resolving. But a wise person not knowing much about technology at the time told them, "Be grateful for the problem. You will learn something out of it and will work hard to enure that it never happens again."
In this blog, I discuss how APM problems and outages are really opportunities for improvement. This can take place in several phases.
Going through a "lessons learned" group exercise is key for ongoing success. This includes reviewing
- What was supposed to happen?
- What actually did happen?
- What went well during the problem-solving phase?
- What did not go well?
- What can be better done differently next problem?
If there are improvements to processes and resources as a result, then this is time well spent.
The same thing should be done at a personal level. Questions that can be asked are:
- How quickly did it take to determine what the problem was? What steps were important in achieving this?
- Were there some questions that I could ask or things that I could have done to find the issue quicker?
- Are there subject areas that need to brush up on?
- Are there third-party troubleshooting tools/field packs that I should consider using?
2: Short and Mid Term Fixes
Short-,mid-, and long-term stabilization measures should be investigated. This may include
- EM but not agent upgrades to interim releases.
- Adding new hardware to replace or supplement the current cluster load
- Doing ongoing health checks, oil change and architectural reviews to evaluate needed changes and optimize the environment
- Create or evaluate an APM personal training plan on new features, performance, troubleshooting, and optimization.
- Clean up network traffic/private keys for TIM.
3. Longer-Term fixes
Most sites have a longer-term plan. It includes:
- Cluster upgrades to major/more current releases including agents.
- Ongoing cluster health trend analysis and capacity planning
- Having lifecycle APM cluster environments to act as a sandbox for suggested perforamnce changes and to test future releases
Hopefully, this has been a helpful discussion. I would love to hear your comments on how past problems became opportunities for improvements