CA Tuesday Tip: Six Additional APM General Troubleshooting Approaches
In a prior Tuesday Tip, I reviewed two different approaches that can be used to troubleshoot integration or any type of APM issue. These are the Visual Inspection (i.e. See what metrics and functionality is showing up or not.) and Functional Workflow approaches. (I.e. Which steps were completed in a workflow? Which ones failed? Which servers are involved between the last successful step and the next one?) But in thinking it through, there are at least six other general approaches that can be used together or separately.
1. Recent Change/No Known Event approach
Knowing the date that an issue first happened can be a big clue as to root cause. If determined, then the configuration and other files created, changed, or deleted after that date may need a review. Also note if logs are not created or updated after that date. Other things to check are third-party (Such as virus checkers) patches and permission changes around that timeframe. If there is a log book of administrative changes for a server(s), that can be studied as well.
2. Other Server approach
Application, Directory, Database and other servers can impact APM operations. If there is a known time period, these logs can be checked as well as for their impact on APM.
3. Event Correlation approach
An outage took place or unexpected behavior happened after an installation or upgrade, This approach combines somewhat the Recent Change and Other Server approaches but focuses on changes and logs for a specific time period. The focus is not necessarily on environmental changes but what happened around the time of an event. Some overlooked steps are seeing if the same behavior/set of errors occurred before the event and ignoring what was going on with other servers at that time.
4. New area/old area approach
A new type of application is monitored or functionality is implemented and is not working as expected or at all. Is there anything that can be learned from already monitored or existing functionality that can help? Do the differences between these applications/functionality give a clue on what might not be working?
5. Eliminate network/performance/architecture approach
The more complex the APM cluster(s), the more things need to be eliminated. If the entire cluster is impacted, then it often ties to network data quality issue, a performance issue due to an unoptimized environment, or a non-scalable and limited environment. Quickly eliminating these three as a cause can save time and frustration in troubleshooting.
6. What's different approach.
Customers have two or more different environments or agents. One or more are working. One or more are not. Having a file and directory differences comparison helps in zeroing in on the root cause. The problem with this approach is that different environments have different characteristics such as load, versions, etc. that can result in misleading conclusions.
1. Which of these seven approaches are you using? What type of situations do you use each of them?
2. Do you have your own APM troubleshooting approach that you wish to share?
3. Are there other topics that you wish to see covered?