HI Muru,
Without looking at the logs, database and some supportability metrics it is impossible to know what is the root cause.
Here is my list of suggestions / checklist:
1. Ensure you have allocated enough memory. Look at the EM_HOME/logs/perflog,txt, rename it to csv and open it using excel, check the Performance.JVM.FreeMemory column, see https://support.ca.com/us/knowledge-base-articles.tec1230732.html
Make sure to set the initial heap size (-Xms) equal to the maximum heap size (-Xmx) in the Introscope Enterprise Manager.lax or EMService.conf. Since no heap expansion or contraction occurs, this can result in significant performance gains in some situations.
2. Management modules should only be deployed in the MOM. Make sure the collectors do not start with any MM to prevent any unnecessary extra load in the cluster.
3. Improved GC by switching from Concurrent GC to G1GC, If using jvm 1.8, try G1 GC option, it reported by other customers to greatly improve performances:
The G1 collector is recommended for applications requiring large heaps with limited GC latency requirements, replace the below settings in the Introscope_Enterprise_Manager.lax > lax.nl.java.option.additional property
XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=50
With just this:
-XX:+UseG1GC
4. If the MOM is running on UNIX, ensure lax.stdin.redirect in the Introscope_Enterprise_Manager.lax is unset.
From https://docops.ca.com/ca-apm/10-5/en/administrating/configure-enterprise-manager/start-or-stop-the-enterprise-manager#StartorStoptheEnterpriseManager-RuntheEnterpriseManagerinnohupModeonUNIX
“Note: Only run the Enterprise Manager in nohup mode after you configure it. Otherwise, the Enterprise Manager might not start, or can start and consume excessive system resources.”
5. Make sure DEBUG logging is disabled in the IntroscopeEnterpriseManager.properties, depending on the user queries it will affect the EM performance
6. Use the “Status console” to quickly check the Cluster’s health, it will help you identify:
- connectivity problem between MOM and collectors
- Possible EMs’ clamps being reached hence preventing you to view the metrics in the Investigator.
Revisit any change you have made to the apm-events-thresholds-config.xml, for example any increase to the clamp for historical and live metrics.
7. In a cluster environment, collector system clocks must be within 3 seconds of MOM clock setting. Ensure MOM and collectors synchronize their system clocks with a time server such as an NTP server otherwise EMs will disconnect
8. Too many calculators: Calculators are both high CPU and memory intensive. Try to connect to the collector directly
If possible, try to disable all the management modules, this will help you identify if the issue is related to the MMs.
9. Find out f the problem occurs when running CLW, JDBC and Top-N queries, if this the case, try clamp the historical queries to 100k for example to prevent huge queries from increasing the memory footprint of the collector & mom. By default, it is unlimited, you can suggest to start with:
introscope.enterprisemanager.query.datapointlimit=100000
introscope.enterprisemanager.query.returneddatapointlimit=100000
10.I don’t think the problem could be related to traces, but in just case open the collector’s perflog, check if “Performance.Transactions.Num.Traces”. If it is higher than 800K or increasing rapidly, I suggest you to limit the amount of incoming traces:
- lower “introscope.enterprisemanager.transactionevents.storage.max.data.age”.
- try to clamp Transaction Traces sent by agent to EM Collectors using EM property file apm-events-thresholds-config.xml and the clamp “introscope.enterprisemanager.agent.trace.limit” reduced from 1000 -> 50.
- increase auto tracing triggering threshold, set introscope.enterprisemanager.baseline.tracetrigger.variance.intensity.trigger=30. This is a hot property, will reduce auto genarated traces from DA.
-if you are using CEM, reduce the length of the transaction trace session from CEM UI > Setup > Introscope Settings” page
11 Check in the logs for known error messages, here are my keywords and some examples.
-capacity: [WARN] [master clock] [Manager] EM load exceeds hardware capacity. Timeslice data is being aggregated into longer periods.
-too many: [WARN] [Harvest Engine Pooled Worker] [Manager.Agent] [The EM has too many historical metrics reporting
-reached : [WARN] [Raw Data Stash] [Manager] Timed out adding to outgoing message queue. Limit of 8000 reached
-slowly: [VERBOSE] [PO:main Mailman 6] [Manager.Cluster] Outgoing message queue is moving slowly
-outofmemory: java.lang.OutOfMemoryError: GC overhead limit exceeded
-skewed : Collector clock is skewed from MOM clock by
-cannot keep : [WARN] [PO:WatchedAgentPO Mailman 1] [Manager.TransactionTracer] The Enterprise Manager cannot keep up with incoming event data
-difference : Caught exception trying to get the difference between MOM and this Collector's harvest time
-StackOverflowError
-corrupted
-Outgoing message queue is not moving
-No space left on device
-reported Metric clamp hit
12. Read https://docops.ca.com/ca-apm/10-5/en/troubleshooting
I hope this helps, you can also open a support case,
Thanks,
Sergio