SergioMorales

CA Tuesday Tip:MARCH 2013 - Top 20 common causes for EM performance issues

Discussion created by SergioMorales Employee on Mar 29, 2013
Latest reply on Sep 27, 2013 by dorol01
CA Wily Tuesday Tip by Sergio Morales, Principal Support Engineer for 3/29/2013

Hi Everyone,
Here is an update of my prevoius post sent last 2011. Below a checklist of the points you must review whenever you see: Performance issue, Missing data points in graph and dasbhoard, frequent MOM/Collector/Agent/Workstation disconnections, OutOfMemory, logging takes long time or Clock-sync issues:

Checklist:

1.
Outgoing message delivery queue/thread pool size needs to be increased:
Make sure the following settings are set in ALL EMs (MOM and Collectors) properties files:
transport.outgoingMessageQueueSize=6000
transport.override.isengard.high.concurrency.pool.min.size=10
transport.override.isengard.high.concurrency.pool.max.size=10

A restart of the EMs is required for the changes to take effect.
Increasing the outgoing message queue allows you to have a bigger buffer. Increasing the thread pool size allows you to have more worker threads to send outgoing messages. These important adjustments are required when, sending messages, usually between collector and MOM, becomes a bottle neck for performance.

2.
Make sure to set the initial heap size (-Xms) equal to the maximum heap size (-Xmx) in the Introscope Enterprise Manager.lax or EMService.conf. Since no heap expansion or contraction occurs, this can result in significant performance gains in some situations.

3.
EM heap Sizing:
a) Not enough heap allocated to Collector, especially when serving CEM services
b) Not enough heap allocated to MOM, especially when with huge amount of MMs, calculators, alerts.

4.
If EM is running on UNIX: Make sure nohup mode has been configured correctly. The property "lax.stdin.redirect" in Enterprise Manager.lax file should be empty. From ConfigAdminGuide.pdf:

" Do not run the Enterprise Manager in nohup mode without performing the configuration described above. Otherwise, the Enterprise Manager might not start, or might start and consume excessive system resources."

5.
Make sure DEBUG logging is disabled in the IntroscopeEnterpriseManager.properties, depending on the queries you perform, it could cause serious performance issue to the Introscope EM.

6.
Make sure smatstor db is pointing to a dedicated hd/disk controller. Once smartstor is reconfigured to have its own disk, you should change the EM property introscope.enterprisemanager.smartstor.dedicatedcontroller=true which allows the EM to fully utilize this setting. From SizingGuide.pdf

“When the dedicated controller property is set to false, the Collector assumes that there is one disk for all Enterprise Manager operations, and therefore uses one disk-writing lock. This means that only one area at a time is written. For example, the Collector will write only to SmartStor or only to the heuristics database that supports the Investigator Overview dashboard.
Performance disadvantages to having the dedicated controller property set to false are:
a.
Only one I/O task can be running at a time.
b.
SmartStor writes are in shorter segments.
c.
The disk's seek pointer is invalidated after each context switch.
If there is a second disk for SmartStor, but the property is set to false, there is no performance gain by having a second disk for SmartStor.
d.
Collector sizing recommendations are reduced by 50%.”

7.
Huge metadata causing EM performance problem or new metrics not showing up. Check the "Custom Metric Host (Virtual) | Custom Metric Process (Virtual) | Custom Metric Agent (Virtual) | Enterprise Manager | Data Store | Smartstor | Metadata | Metrics with Data" supportability metric and verify if it is higher than 300K for v8.x and 600K for v9.x. Solutions:
a) Historical metric count limit on EM can be increased or
b) SmartStor data can be pruned, use the Smartstor Tool utility and reduce the historical metric count.

8.
Are you running multiple collectors on the same server? From SizingGuide:

“a) Run the OS in 64-bit mode to take advantage of a large file cache.
The file cache is important for the Collectors when doing SmartStor maintenance, for example spooling and reperiodization. File cache resides in the physical RAM, and is dynamically adjusted by the OS during runtime based on the available physical RAM. CA Wily recommends having 3 to 4 GB RAM per Collector.
b) There should not be any disk contention for SmartStor, meaning you use a separate physical disk for each SmartStor instance. If there is contention for SmartStor write operations, the whole system can start to fall behind, which can result in poor performance such as combined time slices and dropped agent connections.

c) The Baseline.db and traces.db files from up to four Collectors can reside on a separate single disk. In other words, up to four Collectors can share the same physical disk to store all of their baseline.db and traces.db files.”

9.
Check if virtual agents have been defined, if so, disable them from the EM\config\agentdomains.xml .

10.
Are the Collectors and MOM on the same subnet? From SizingGuide:

“Whenever possible, a MOM and its Collectors should be in the same data centre; preferably in the same subnet. Even when crossing through a firewall or passing through any kind of router, the optimal response time is difficult to maintain. If the MOM and Collector are across a router or, worse yet, a packet-sniffing firewall protection router, response time can slow dramatically.”

For transatlantic agent->Em connections or any frequently interrupted networks, HTTP would work better. You should configure Agent->EM communications to use HTTP tunnelling instead.

11.
If you use SAN for SmartStor storage, then each logical unit number (LUN) requires a dedicated physical disk. If you have configured two or more LUNs to represent partitions or subsets of the same physical disk, this does not meet the requirements needed for SmartStor dedicated disk.

12.
Check how big is the tracers database. Rename the perflog.txt to change its extention from txt to csv and open it using excel, review the "Performance.Transactions.Num.Traces" column. If the value is higher than 500K and increasing rapidly , then this could be the cause of the problem. If possible start the EM with a fresh new Tracers database to isolate the problem, disable transaction sampling on the EM side by setting introscope.agent.transactiontracer.sampling.perinterval.count=0 and set introscope.enterprisemanager.transactionevents.storage.max.data.age=1.

13.
Incorrect or bad Management Module definition:
For testing purpose, start the EM without any Management module(MM): Rename EM_HOME\config\modules to modules-original, restart the EM.
This will allow us to confirm if the problem is related to an incorrect design of one of the MMs.
If the problem doesn’t persist, you will need to re-introduce the Management modules 1 by 1 until you identify the problematic one(s).

14.

Is the Introscope EM configured with a different JVM version?
EM with a supported JVM version: For v8, JVM 1.6u15. For v9, it is recommended to use 1.6u34 or later.

15.
If the problem only applies when connecting to the MOM and not to the collector, it's most likely caused by some feature specific to Workstation, and specific to MOM. Try disabling the new v9 feature AppMap by adding introscope.apm.feature.enabled=false to the IntroscopeEnterpriseManager.properties and restart EM.

16.
If SOA Performance Management is enabled:

a) SOA Deviation Calculator needs to be turned off to prevent hourly harvest duration spikes:
Set com.wily.introscope.soa.deviation.enable=false in all the EMs (collector and MOM). If this change resolves the issue, the problem could be related to bug# 76056, we have partially fixed this issue in latest 9.1 releases, we are planning to this isue in the next major release. For now, you can try also to lower the refresh rate and mean days:
com.wily.introscope.soa.deviation.dependency.refreshrate=24
com.wily.introscope.soa.deviation.mean.days=1
com.wily.introscope.soa.deviation.datapoints.cached.mean=240

b) EM/WS OOM, caused by 8.x - 9.x Agent compatibility: - see Bug# 74797 – Fix in 9.1.2. To enable the fix new SOA caller name nominalization property needs to be turned on in all collectors: com.wily.introscope.soa.dependencymap.normalizecallername.enable=true.
c) SOA boundary tracing can be turned off (too many traces or too huge trace sent to EM can cause the Agent OOM sometimes, EM crash/OOM): com.wily.introscope.agent.transactiontrace.boundaryTracing.enable=false

17.
Query returned/retrieved data points can be clamped (too huge historical query can cause EM OOM):
If you notice the “memory in use” starting to increase and the collector became unusable, you can try setting the clamps for historical queries to 100k to prevent huge queries from increasing the memory footprint of the collector & mom:
introscope.enterprisemanager.query.datapointlimit=100000
introscope.enterprisemanager.query.returneddatapointlimit=100000

18.
Poor network performance:
In a cluster the “Ping Time” on the MOM is an indicator of a:
a) Poor network times between the MOM and collectors, or
b) Overloaded collectors unable to respond to the ping request.
To view the ping metric, use the Search tab to view the metric named "ping" in the supportability metric section of the Investigator tree. You will find a ping metric reported for each Collector. If the ping time exceeds 60 seconds, the MOM disconnects from the Collector. This is normal and prevents the entire cluster from hanging but indicates a network issue.

19.
Collector disconnects from MOM throwing java.nio.channels.CancelledKeyException: This problem is related to a random communication problem on EM without obvious causes. Nio transportation can be turned off on EM by adding following in EM properties file transport.enable.nio=false. You need to restart.
20.
“Collector clock is skewed from MOM clock by” messages in the EM logs:
a) Set up the clustered systems so that machines running Enterprise Managers synchronize their system clocks with a time server such as an NTP server
b) VMware should be tuned up to avoid clock skew: If virtual environment, please note that there are some known clock-sync issues with VMWare, especially with Linux. The below docs from VMWare site describe the issues:
http://www.vmware.com/pdf/vmware_timekeeping.pdf
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1006427
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1318
c) This could be due to a Sun JVM bug. Add the following JVM flag: -XX:+ForceTimeHighResolution
Refer to the links below for more information: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6464007


What to do if the problem persists:

Collect the following information from ALL Introscope EMs (MOM and collectors) and open an incident with CA Support.
1.
Zipped content of EM_HOME\logs
2.
EM_HOME\config\agentdomains.xml – will help us confirm if there are virtual agents defined.
3.
Hardware specs of the servers and a general overview of the implementation indicating where the collectors and MOM are
4.
Screenshot of the "Custom Metric Host (Virtual) | Custom Metric Process (Virtual) | Custom Metric Agent (Virtual) | Enterprise Manager | Data Store | Smartstor | Metadata | Metrics with Data” supportability metric from all Collectors.
5.
From the investigator, use the Search tab to view the metric named “ping”in the supportability metrics section of the investigator tree. You will find a ping metric reported for each Collector, take a screenshot.

Make sure to remove all existing introscope log files to another location before starting the tests.

Regards,
Sergio

Outcomes