CA Tuesday Tip:Java Agent Hangs/Crashes, High Overhead/CPU - Checklist

Discussion created by SergioMorales Employee on Apr 12, 2011
Latest reply on Apr 12, 2011 by Chris_Hackett
CA Wily Tuesday Tip by Sergio Morales - Principal Support Engineer for 04/12/2011

This is another set of similar and common Agent problems reported to support and in the majority of cases they are related to the following causes:

1. Unsupported configuration
2. Known agent memory overhead introduced by architectural change to the 8.x Agent. (fixed in 8.2 and onwards versions)
3. JVM bug exposed when Wily is enabled but NOT a wily bug.
4. Instrumentation/Metric explosion

How to troubleshoot these types of problems?

1. Make sure that the configuration is supported.

Officially we only provide tech support for products that are on our compatibility guides. For these we have test environments, Sustaining, QA and Development back up. Unofficially, if you are testing a product functionality, and submit an incident for an unsupported environment, we will take our best shot at it and a lot of the times we do get these solved. However, there is a limit to what we can do. If you need a specific configuration to be supported, please open an enhacenment request with Wily Support. Also, please note that the compatibility guides are now available from CSO.

2. Upgrade to 8.2 or higher version and make sure you set introscope.agent.reduceAgentMemoryOverhead=true in the IntroscopeAgent.profle

3. Find out if the problem is related to the instrumentation or a JVM bug:

- Stop the Appserver
- Open the IntroscopeAgent profile and set introscope.autoprobe.enable=false
- Start the appserver again:

If the problem persists, then, it will confirm that the problem is NOT related to the Wily instrumentation:
- Try switching from -javaagent to -Xbootclasspath
- Upgrade to latest JVM version or use an alternate JVM (Sun Java 6)
- Open a support incident with the JVM vendor.

If you are using Websphere with IBM J9 jvm and Introscope 7.x or 8.x , please refer to KB#

4. Temporarily, reduce the amount of instrumentation, this will help you identify the culprit.

- Stop the Appserver
- Open the IntroscopeAgent profile and set introscope.autoprobe.enable=true

a. Some applications use a very high amount of unique SQL statement strings, especially if the SQL is constructed dynamically. This leads to an explosion in SQLAgent metrics.

Disable SQLAgent by removing the AGENT_HOME/ext/SQLAgent.jar out of the AGENT directory. If this is not possible, set introscope.agent.sqlagent.sql.maxlength=120 (default value 990). There is no limit on the length of the SQL statements other than whatever limits the database itself imposes, maxlength allows truncating the length of SQL statements. The intention for doing this is to prevent a SQL Metric Explosion.
If this is not suitable for your case, you might want to use the new RegexSqlNormalizer feature, which uses regex patterns and replace formats to normalize the sql in a customized way, for more information, please refer to the Agent Userguide.

b. Disable Platform monitor: move the appropriate platform monitor files from the /wily/ext directory to another directory.

c. If applicable, disable JMX collection by setting introscope.agent.jmx.enable=false.
If this is not possible, test the issue using the default filters provided by the Agent. Remember polling lots of JMX metrics is CPU intensive. Do not set<empty>, an agent can produce thousands of JMX metrics.

d. Turn off tracers for network, filesystem and System File Metrics in toggles PBD.
These tracers are not recommended to be enabled in Production. We have found that metric explosion in the smartstor db is most of the time is coming from either JMX, SQL or sockets:

#TurnOn: SocketTracing / #TurnOn: ManagedSocketTracing
#TurnOn: UDPTracing
#TurnOn: FileSystemTracing

e. Disable any additional Introscope Agent Addon, such as: ChangeDetector, Leakhunter, Powepacks.
Addons and powerpacks provides great metrics, but they generate lots that are overhead intensive.

f. Disable any recent additional customer pdb/instrumentation you have added to your configuration.
Avoid the use of the directives: TraceAllMethodsOfClass and TraceComplexMethodsOfClass, choose carefully which methods to monitor.

Remember these are temporal changes, If the problem does not occur with the base agent, you must then introduce back each component one by one until you reproduce the problem, however If the problem persists, you should contact support and provide the following information:

- zipped content of AGENT_HOME/logs
- IntroscopeAgent.profile
- generate a series of 5 thread dumps on the application server when the overhead/OOM/hang/high CPU occurs spaced 5 -10 seconds apart.
- In case of an overhead, generate a heapdump. For example, for sun, add the following jvm switch: -XX:+HeapDumpOnOutOfMemoryError
- Enable GC log - i.e: add the following jvm switches: -Xloggc:<filename>.log -XX:+PrintGCDetails
- Appserver logs.
- listing content of the AGENT_HOME directory
- Full core dump, if applicable.

Please make sure to move all existing introscope log files to another location before starting the tests so we have a clear picture of your latest tests.

Thank you,