SergioMorales

CA Tuesday Tip: Top 10 common causes for EM performance issues - Checklist

Discussion created by SergioMorales Employee on Mar 22, 2011
Latest reply on May 30, 2013 by SergioMorales
CA Wily Tuesday Tip by Sergio Morales, Principal Support Engineer for 3/22/2011

Hi Everyone,

For this week, I am sharing with you a list of the top 10 common causes for performance issues, which can show differently in your environment:
- Missing datapoints in graphs, dashboards
- frequently agents disconnections
- in a cluster environment, frequently collectors disconnections
- Slowness when connecting to the Introscope EM using workstation or when running CLW.
- Out Of Memory messages.

The most important point I want to emphasize is for you to make sure you follow our recommendations and if after applying all of them the problem persists, then please collect the information mentioned at the bottom and contact us.

There is a high volume of performance issues coming to support and following the below recommendations has solved most of them.

1. Make sure the following settings are set correctly in ALL EMs (MOM and Collectors) properties files:

-transport.outgoingMessageQueueSize=6000
-transport.override.isengard.high.concurrency.pool.min.size=10
-transport.override.isengard.high.concurrency.pool.max.size=10

A restart of the EMs is required for the changes to take effect.

NOTE:
- Increasing the outgoing message queue allows you to have a bigger buffer
- Increasing the thread pool size allows to have more worker threads to send outgoing messages.
These important adjustments are required when, sending messages, usually between collector and MOM, becomes a bottle neck for performance.


2. Make sure to set the initial heap size (-Xms) equal to the maximum heap size (-Xmx) in the Introscope Enterprise Manager.lax or EMService.conf.
Since no heap expansion or contraction occurs, this can result in significant performance gains in some situations.

Set heap size in collectors to at least 1.5 GB and MOM with 12GB in order for Introscope to support 1,000,000 metrics on the MOM. This MOM should be configured with a 64bit jvm.

For v8.x you need to use jre 1.5. Our EM has been tested and is officially supported with update 15. You can download a copy from http://java.sun.com/products/archive/
For v9.x onwards 1.6 is supported.


3. If EMs are running on UNIX: Make sure nohup mode has been configured correctly:

The property "lax.stdin.redirect" in Enterprise Manager.lax file should be <blank>

From ConfigAdminGuide.pdf:
"Note Do not run the Enterprise Manager in nohup mode without performing the configuration described above. Otherwise, the Enterprise Manager
might not start, or might start and consume excessive system resources."


4. Make sure smatstor db is pointing to a dedicated hd/disk controller.

Once smartstor is reconfigured to have its own disk, you should change the EM property introscope.enterprisemanager.smartstor.dedicatedcontroller=true which allows the EM to fully utilize this setting.

When the dedicated controller property is set to false, the Collector assumes that there is one disk for all Enterprise Manager operations, and therefore uses one disk-writing lock. This means that only one area at a time is written. For example, the Collector will write only to SmartStor or only to the heuristics database that supports the Investigator Overview dashboard.
Performance disadvantages to having the dedicated controller property set to false are:
- Only one I/O task can be running at a time.
- SmartStor writes are in shorter segments.
- The disk's seek pointer is invalidated after each context switch.
If there is a second disk for SmartStor, but the property is set to false, there is no performance gain by having a second disk for SmartStor.
- Collector sizing recommendations are reduced by 50%.


5. Check the "Enterprise Manager | Smartstor | Metadata | Metrics with Data" supportability metric and verify if it is higher than 300K.

If so, there is probably a historical metric explosion in your EM/Collector. You need to use the Smartstor Tool utility to fix this problem. Once the historical metric count is below 300K, the Enterprise Manager will run smoothly.

Metric explosions can be caused by a number of factors, such as a large number of unique SQL statements, JMX, sockets being opened on random ports, etc


6.Are you running multiple collectors on the same server?

- Run the OS in 64-bit mode to take advantage of a large file cache.
The file cache is important for the Collectors when doing SmartStor maintenance, for example spooling and reperiodization. File cache resides in
the physical RAM, and is dynamically adjusted by the OS during runtime based on the available physical RAM. CA Wily recommends having 3 to 4 GB RAM per Collector.

- There should not be any disk contention for SmartStor, meaning you use a separate physical disk for each SmartStor instance.
If there is contention for SmartStor write operations, the whole system can start to fall behind, which can result in poor performance such as combined
time slices and dropped agent connections.

- The Baseline.db and traces.db files from up to four Collectors can reside on a separate single disk. In other words, up to four Collectors can share the same physical disk to store all of their baseline.db and traces.db files.


7. Find out, how the agents are balanced

Are the agents configured to connect to the MOM, collector or a combination of both? If you are using a combination, this could be one of the reasons for a cluster instability.


8. Are the Collectors and MOM on the same subnet?

Whenever possible, a MOM and its Collectors should be in the same data centre; preferably in the same subnet. Even when crossing through a firewall or passing through any kind of router, the optimal response time is difficult to maintain. If the MOM and Collector are across a router or, worse yet, a packet-sniffing firewall protection router, response time can slow dramatically.


9. Is there a SAN configured involved?

If you plan to use SAN for SmartStor storage, then each logical unit number (LUN) requires a dedicated physical disk. If you have configured two or more LUNs to represent partitions or subsets of the same physical disk, this does not meet the requirements needed for SmartStor dedicated disk.

However, we have identified a performance EM issue (Bug#59755) for customers using Linux/Solaris and SAN configuration.
If from the perfog.txt you see the "harvest duration" spikes every 5 minute with no other obvious reason like a script/calculator running every 5 minute, then that is probably it. The spikes are very obvious and could be very bad with high load. With this problem, the EM could have very unstable performance, and therefore causing other problems like the queue limit. If this problem does apply, upgrading to 8.2.3 and adding the below property should resolve these spikes:

introscope.enterprisemanager.supportability.volumespace.enable=false

If this property is set to false, we will not poll the volume space to avoid the harvest duration spikes every 5 minute on some systems.
The side effect is that the volume space supportability metrics will not be available:
Enterprise Manager|Data Store|Volume Space Free:xxxx


10. Check how big is your tracers database.

From the Collector's perflog.txt, review the "Performance.Transactions.Num.Traces" column
If the value is too higher than 500K and increasing rapidly or reaching a 1million, then this could be the cause of the problem.
If possible start the EM with a fresh new Tracers database to isolate the problem,

If MQ PP is involved in the implementation, there are 2 active bugs affecting the tracers database v8.2: EM (Bug# 61432) and Agent(Bug#61363).
Contact CA Support to request a fix and if an upgrade is not possible, disable transaction sampling on the EM side by setting introscope.agent.transactiontracer.sampling.perinterval.count=0 and for now introscope.enterprisemanager.transactionevents.storage.max.data.age=1 (by default traces information will be store for 14 days)


If the problem persists after applying all the above changes, you should contact support and make sure to provide the following information:
- zipped content of EM_HOME/logs
- EM_HOME/config/Introscope Enterprose Manager.properties
- EM_HOME/EMService.conf
- Hardware specs and a general overview of the implementation indicating where the collectors and MOM are
- screenshot of the following supportability metrics:
1) EM | Smartstor | Metadata | Metrics with Data
2) EM | Internal | Number of Connection Tickets
3) EM | Internal | Number of Virtual Metrics
4) EM | Tasks | Harvest Duration
5) EM | Tasks | Smartstor Duration
6) EM | GC Heap | Bytes in Use
7) EM | GC Heap | GC Duration

Please make sure to remove all existing introscope log files to another location before starting the tests so we have a clear picture of your latest tests.

Regards,

Sergio

Outcomes