DX Application Performance Management

  • 1.  While trying to integrate collector to MOM, MOM experiences 100% CPU

    Broadcom Employee
    Posted Sep 20, 2017 10:06 PM

    Hi

     

    For one of our customer, we are trying to integrate the collector to MOM and MOM experiences 100% CPU (continuously). This collector has been running for more than 6 months and was not connected to MOM. So, once we try to integrate it to MOM, MOM has to do all these indexing and because of high volume of data e.g. 550MB, it cannot cope and went to up 100% CPU.

     

    Tried to replicate this in UAT. Some facts we know

     1) In UAT, they disconnected the collector from MOM

    2) One agent was sending data to collector for 3 days. This generated about 150MB file

    3) Collector was then integrated to UAT MOM and then CPU spiked to 42% for a while and then went down.

     

    Later, we did the following steps. 

    1) We created more agents in UAT. This is to create more data in short time. Data collection is done with the collector disconnected from MOM

    2) Collector network card modified to allow only 10Mbps instead of 100 Mbps (to simulate network latency)

    3) Unfortunately, we could not simulate the same issue in UAT again

     

    So, we ran out of reasons on why this is happening in PROD. Have you seen such issue / experienced a similar problem?. 

     

    Regards

    Muru



  • 2.  Re: While trying to integrate collector to MOM, MOM experiences 100% CPU
    Best Answer

    Posted Sep 21, 2017 12:33 AM

    I'll let someone else help you to realize how bad of an idea this was - and no, it's not a bug!

    You can get back to healthy status pretty easily.

    "Drop the SmartStor"

    1. stop the Collector

    2. rename the smartstor /data directory to /data_old stuff_20170921 (or equivalent)

    3. immediately restart the collector - and it will reattach without any high CPU.

       - this should not take more than 5 minutes, to complete the change
    OPTIONAL

    4. install an EM collector on the laptop of whoever is anxious for this data

    5. copy the /data_old stuff_20170921 to that laptop

    6. change the em.properties file to point to the </data_old stuff_20170921>

    7. set the smartstor tier-3 to five-nines == 99999 (more better!)

       - this will prevent any of the data from getting aged out for another 270 years...

    It is a minor inconvenience to keep smartstor data online in this fashion but it has the benefit of 'always being there', taking no CPU overhead (until someone queries against it, and then it is minor), and having no agents attached means it will never grow in size.

    And there is that priceless, added benefit of NOT bringing down your production monitoring...



  • 3.  Re: While trying to integrate collector to MOM, MOM experiences 100% CPU

    Broadcom Employee
    Posted Sep 22, 2017 04:26 AM

    HI Muru,

    Without looking at the logs, database and some supportability metrics it is impossible to know what is the root cause.

    Here is my list of suggestions / checklist:

     

    1. Ensure you have allocated enough memory. Look at the EM_HOME/logs/perflog,txt, rename it to csv and open it using excel, check the Performance.JVM.FreeMemory column, see https://support.ca.com/us/knowledge-base-articles.tec1230732.html
    Make sure to set the initial heap size (-Xms) equal to the maximum heap size (-Xmx) in the Introscope Enterprise Manager.lax or EMService.conf. Since no heap expansion or contraction occurs, this can result in significant performance gains in some situations.

     

    2. Management modules should only be deployed in the MOM. Make sure the collectors do not start with any MM to prevent any unnecessary extra load in the cluster.

     

    3. Improved GC by switching from Concurrent GC to G1GC, If using jvm 1.8, try G1 GC option, it reported by other customers to greatly improve performances:
    The G1 collector is recommended for applications requiring large heaps with limited GC latency requirements, replace the below settings in the Introscope_Enterprise_Manager.lax > lax.nl.java.option.additional property


    XX:+UseConcMarkSweepGC -XX:+UseParNewGC  -XX:CMSInitiatingOccupancyFraction=50
    With just this:
    -XX:+UseG1GC

     

    4. If the MOM is running on UNIX, ensure lax.stdin.redirect in the Introscope_Enterprise_Manager.lax is unset.  
    From https://docops.ca.com/ca-apm/10-5/en/administrating/configure-enterprise-manager/start-or-stop-the-enterprise-manager#StartorStoptheEnterpriseManager-RuntheEnterpriseManagerinnohupModeonUNIX

     

    “Note: Only run the Enterprise Manager in nohup mode after you configure it. Otherwise, the Enterprise Manager might not start, or can start and consume excessive system resources.”

     

    5. Make sure DEBUG logging is disabled in the IntroscopeEnterpriseManager.properties, depending on the user queries it will affect the EM performance

     

    6. Use the “Status console” to quickly check the Cluster’s health, it will help you identify:


    - connectivity problem between MOM and collectors
    - Possible EMs’ clamps being reached hence preventing you to view the metrics in the Investigator.

     

    Revisit any change you have made to the apm-events-thresholds-config.xml, for example any increase to the clamp for historical and live metrics.

     

    7. In a cluster environment, collector system clocks must be within 3 seconds of MOM clock setting. Ensure MOM and collectors synchronize their system clocks with a time server such as an NTP server otherwise EMs will disconnect

     

    8. Too many calculators:  Calculators are both high CPU and memory intensive. Try to connect to the collector directly
    If possible, try to disable all the management modules, this will help you identify if the issue is related to the MMs.

     

    9. Find out f the problem occurs  when running CLW, JDBC and Top-N queries, if this the case, try clamp the historical queries to 100k for example to prevent huge queries from increasing the memory footprint of the collector & mom. By default, it is unlimited, you can suggest to start with:
    introscope.enterprisemanager.query.datapointlimit=100000
    introscope.enterprisemanager.query.returneddatapointlimit=100000

     

    10.I don’t think the problem could be related to traces, but in just case open the collector’s perflog, check if “Performance.Transactions.Num.Traces”. If it is higher than 800K or increasing rapidly, I suggest you to limit the amount of incoming traces:
    - lower “introscope.enterprisemanager.transactionevents.storage.max.data.age”.
    - try to clamp Transaction Traces sent by agent to EM Collectors using EM property file apm-events-thresholds-config.xml and the clamp “introscope.enterprisemanager.agent.trace.limit” reduced from 1000 -> 50.
    - increase auto tracing triggering threshold, set introscope.enterprisemanager.baseline.tracetrigger.variance.intensity.trigger=30. This is a hot property, will reduce auto genarated traces from DA.
    -if you are using CEM, reduce the length of the transaction trace session from CEM UI > Setup > Introscope Settings” page

     

    11 Check in the logs for known error messages, here are my keywords and some examples.

     

    -capacity: [WARN] [master clock] [Manager] EM load exceeds hardware capacity. Timeslice data is being aggregated into longer periods.
    -too many: [WARN] [Harvest Engine Pooled Worker] [Manager.Agent]  [The EM has too many historical metrics reporting
    -reached : [WARN] [Raw Data Stash] [Manager] Timed out adding to outgoing message queue. Limit of 8000 reached
    -slowly: [VERBOSE] [PO:main Mailman 6] [Manager.Cluster] Outgoing message queue is moving slowly
    -outofmemory: java.lang.OutOfMemoryError: GC overhead limit exceeded
    -skewed : Collector clock is skewed from MOM clock by
    -cannot keep : [WARN] [PO:WatchedAgentPO Mailman 1] [Manager.TransactionTracer] The Enterprise Manager cannot keep up with incoming event data
    -difference : Caught exception trying to get the difference between MOM and this Collector's harvest time
    -StackOverflowError
    -corrupted
    -Outgoing message queue is not moving
    -No space left on device
    -reported Metric clamp hit

     

    12. Read https://docops.ca.com/ca-apm/10-5/en/troubleshooting

     

    I hope this helps, you can also open a support case,

     

    Thanks,
    Sergio