AnsweredAssumed Answered

APM Tech Tip: Cascading Issues

Question asked by Hallett_German Employee on Dec 6, 2014

CA Tech Tip: Cascading APM Problems


Introduction
 
While the present and forthcoming APM troubleshooting covers common problems and resolution, it doesn't always review the complex area of
cascading issues. I am defining this term as the following:

 

A chain of usually sequential events across multiple servers resulting in various problems and symptoms. These may reoccur over time.


For APM CE (CEM), this is one typical sequence

- A TIM Collector either underpowered or having Introscope Agents connect to it, becomes overloaded
- The Tim Collector stops communicating with TIM and gets a 4xx/5xx error on the Monitors tab
-  Defects, btstats (RTTM), stats and other files are backing up on the TIM in /etc/wily/cem/tim/data/out/...
-  CEM Reports are not being produced
- A call comes to APM Support

 

  For Introscope, there are similar sequences between MOMs, Collectors, Agents, and other components. (Such as load balancing/Overloading EM issues.)

 


How to work this issue


- Although there are various approaches that can be used, I like the functional-workflow approach that I have described in earlier tips.
   If I know the function of a server and the APM components it corresponds with, then typically I can quickly hone in on an issue.


- By getting the logs across all the impacted server, one can perform an event correlation on determine which events were happening on each server.
- Breaking into multiple issues and prioritizing them. For this case, I would break into two tickets/issues:

   * Get the files off the TIM by restarting the TIM Collector or disabling/re-enabling the TIM object
   * Clean up the Stats Aggregation issues
  
If relevant, I would include performance/architecture recommendations that should be addressed by the customer. This could include upgrades/hot fixes
that usually resolve the issue. By addressing these concerns, the issue should less likely happen in the future.

 

Questions for Discussion:
1) What cascading issues have you/are you encountering
2) Which overall approach did you use to resolve them?
3) What other troubleshooting topics would you like to be covered in tech Tips?

Outcomes