DX Application Performance Management

Expand all | Collapse all

CA APM Slow Performance

  • 1.  CA APM Slow Performance

    Posted Oct 25, 2017 08:25 PM

    Hi,

     

    We are experiencing sluggish performance in APM workstation. When we navigate through the metrics, it takes 30 - 45 seconds to display the data. Also the dashboards take time to load. I see there are active clamps on both the MOM and collector. I see this Clamp on both MOM and Collector

     

    introscope.enterprisemanager.transactionevents.storage.max.disk.usage at 1097

    introscope.enterprisemanager.agent.error.limit  10 (Collector Only)

     

    I am not sure if the clamp is causing the slowness. Any help will be greatly appreciated. We are running version 10.5.1

     

    Thanks,

     

    Arnab

     

     

     



  • 2.  Re: CA APM Slow Performance

    Broadcom Employee
    Posted Oct 25, 2017 09:07 PM

    Looks like you're being overwhelmed by traces. Start by looking at your agent metrics to see which ones are generating the most.

    Check to see if those particular agents are running into a lot of error events.



  • 3.  Re: CA APM Slow Performance

    Posted Oct 26, 2017 09:21 AM

    Hi Davis,

     

    Thank you for your reply Is there a way we can limit the traces which are being reported in APM? Is there a setting within the MOM or the Collector where it limits the traces for only a specific time period? (Last 24 hrs)

     

    I appreciate your help with this.

     

    Thanks,

     

    Arnab



  • 4.  Re: CA APM Slow Performance
    Best Answer

    Posted Oct 26, 2017 02:16 PM

    There are four elements within the APM that will cause the APM cluster to slow down.

    1.  Number of Metrics (live and historic)

    2. Number of agents & applications

    3. Number of traces

    4.  EM Resources (number of EMs/CPU/Memory/Disk access/Network)

     

    You mentioned that you have clamps, what is being clamped?  

    From your inclusion of the transaction trace settings, let us assume that it is the traces causing issues.  Traces occur at random, and on error.  Errors can be an exception case or a stalled transaction (transaction longer than 30 seconds).  Look at the traces of the agent that was transaction trace clamped and see if you can find either a pattern or hints on why the traces are occurring.  

     

    On our system we found that we were having transactions take more than 10 minutes each since it was an service that was being used for large batch processing.  With that we changed our stall threshold from 30 seconds to 4200 seconds which helped cut down on the transaction traces on those agents.

     

    Now, in align for the resources, we tend to cut the number of days the traces are kept from the default of 14 days to 7 and then increase the disk space from 1 GB to 4 GB.  But we only did that because we saw "Out of trace space" messages in the APM Status console.

     

    With Number of metrics and number of agents, increases with either would usually start to appear in the form of the smartstor duration or the harvest duration.  

     

    So, instead of jumping around guessing what your issue might be, could you provide more details in your environment?

     

    Number of collectors

    Host/Servers - physical or virtual 

    Collector resources (CPU cores, RAM, OS, disk type (NAS, DASD, SAN, physical raid)

     

    Number of agents (application agents (java and .NET)

     

    Total Number of metrics

     

    Number of applications

     

    And then a history lesson....when did this slowness start?  Has it always been this way?

    If your APM just one day started to behave poorly, we need to back trace to see if there was a dramatic increase in metrics/agents/application/traces or if some shared resources (cpu/memory/disk/network, virtual hosts) are being over taxed.

     

     You can also pull the perf log from the enterprise managers and see if you can see an increase in the harvest duration or metrics that might help narrow down this issue.

     

    Also in the <EM_HOME>/example there are two management modules that I highly suggest you deploy and customize for your APM environment.  

       MOM_Infra_Monitoring_MM.jar

       Collector_1.jar  (copy this jar for each of your collectors and customize the MOM dashboards/alerts to include all of your collectors)

     

    There is also a "Supportability.jar" that is useful since it give you a more of an over view of the APM cluster.



  • 5.  Re: CA APM Slow Performance

    Posted Oct 26, 2017 06:56 PM

    Hi bwcole,

     

    Thank you for your response. We recently upgrade our environment to 10.5.1 and it seems we started having issues since then.

     

    Here is the detail about our environment.

     

    MOMS - RHEL7 (18 GB Memory)

    I also have the memory allocated as below

    lax.nl.java.option.additional=-Xms8192m -Xmx8192m -Djava.awt.headless=true -Dmail.mime.charset=UTF-8 -Dorg.owasp.esapi.resources=./config/esapi -XX:+UseConcMarkSweepGC -XX:+UseParNewGC  -Xss512k

     

     

    Collector - Windows 2012 (We have one collector in our environment)

                   - Dedicated smartstore db

    lax.nl.java.option.additional=-Xms1024m -Xmx1024m -Djava.awt.headless=true -Dmail.mime.charset=UTF-8 -Dorg.owasp.esapi.resources=./config/esapi -XX:+UseConcMarkSweepGC -XX:+UseParNewGC  -Xss512k

     

    From Supportability Management Module Editor.

           Number of Metrics - 72.1 K

           Number of MOM Metrics - 65.8 K

           Number of Agents - 28

           Disk Usage = 16K (MB)

           Enterprise Manager Overalll Capacity = 13%

     

    These are the clamp messages I am seeing.

     

    introscope.enterprisemanager.transactionevents.storage.max.disk.usage at 1097

    introscope.enterprisemanager.agent.error.limit  10 (Collector Only)

     

    Any help will be greatly appreciated.

     

    Thanks and Regards,

     

    Arnab



  • 6.  Re: CA APM Slow Performance

    Posted Oct 26, 2017 11:47 PM

    Arnab

     

    First of all increase collector heap value from 1 GB to at least 6GB. If you have 16 GB physical increase it to 12 GB.

    Second reduce number of transaction trace days from 14 to 7

    introscope.enterprisemanager.transactionevents.storage.max.data.age=14 --> 7

    finally increase transaction trace disk space from 1 GB to at least 4GB.

    hope it will fix the problem



  • 7.  Re: CA APM Slow Performance

    Broadcom Employee
    Posted Oct 27, 2017 01:46 AM

    If you're planning on using heap sizes > 8GB, you'll want to consider switching to G1GC for better performance.



  • 8.  Re: CA APM Slow Performance

    Posted Oct 27, 2017 11:51 PM

    Hi junwah,

     

    Thank you for your reply.

     

    I see that the Collector was already configured to use 12 GB Windows 2012 (EMService.conf)

    # Initial Java Heap Size (in MB)
    wrapper.java.initmemory=12288

    # Maximum Java Heap Size (in MB)
    wrapper.java.maxmemory=12288

     

    Also, I updated this property on both MOM and Collector.

    introscope.enterprisemanager.transactionevents.storage.max.data.age=14 --> 7

     

     can you provide me the property for the config below?

     

    finally increase transaction trace disk space from 1 GB to at least 4GB.

    Thanks and Regards,

     

    Arnab



  • 9.  Re: CA APM Slow Performance

    Posted Oct 29, 2017 12:46 PM

    go to <APM_HOME>/config> vi apm-events-thresholds-config.xml

    following is the property used for transaction trace storage.

     

    <clamp id="introscope.enterprisemanager.transactionevents.storage.max.disk.usage">
                <description>
                    The maximum desired disk usage in MB for the trace storage. If this maximum is exceeded,
                    then the daily historical trace stores will be deleted starting with the oldest first
                    until the total historical trace storage size is below this value.
                                    The current days trace store actively storing traces will not be deleted even if
                                    its size exceeds this property value. The size of the trace index is not considered
                                    when determining what historical trace stores to delete.
                </description>
                <threshold value="1024"/>
            </clamp>

     

    change

    <threshold value="1024"/>

    to

    <threshold value="4096"/>



  • 10.  Re: CA APM Slow Performance

    Posted Oct 30, 2017 03:09 PM

    Thank you Junwah, after making these changes I see a big difference in performance. I have also confirmed the same with the other users. I appreciate all your help, support and time with this issue.



  • 11.  Re: CA APM Slow Performance

    Posted Oct 27, 2017 07:18 AM

    Arnab,

     

    The suggestions from junwah are very good to help get you through this issue but, you need to investigate the transaction traces.   

     

     - Dedicated smartstore db

    I'm guessing that this means that the Windows server hosting your collector has a single physical drive that houses the smart store <em_home>/data and <em_home>/traces.

     

     

     

    Addressing your Transaction Trace Issue:

       Do you use the transaction traces? 

          There are two types of traces that are stored in the <em_home>/traces directory, random traces and error traces.  For random traces, the number is dependent on the number of applications being captured.  For error traces, which may have a larger number since it is based on errors or related transaction stalls within your environment.

          In both cases, I highly suggest you review your traces, especially if they are error traces and get the errors corrected to help limit the traces to a supportable number.  If the traces are more application random traces, then you need to increase your trace space and lower the number of days to keep traces.

       What version did you upgrade from?

       How many application are being captured?

       

     

    Addressing your capacity issue

    Please review the "Hardware Sizing and Performance" section in the documentation:

    https://docops.ca.com/ca-apm/10/en/ca-apm-sizing-and-performance/hardware-sizing-and-performance

     

    Your collector is way under resourced.  You may want to create another RHEL7 server and run the collector from that. 

    The MOM is basically your GUI and front ends (if you are running webview from the MOM), the collector is the EM that actually doing the hard work of gathering metrics, storing them and then tracing through the metrics to create the transaction traces. 

     

       Do you have CEM?

     

       What does your loadbalance.xml look like?  Is your collector only to host the CEM processes or is it an agent collector? 

          If you are running CEM, highly suggest you have two collectors, one for agent metrics and the second for the CEM processes, both on RHEL with 6 GB RAM and the EM allocated 4 GB of that.

     

       Is your MOM doing double duty with being an agent collector and interface host?

     

    In our APM v10.0 environment we are running a MOM with seven collectors with 746 agents and 894,400 metrics.  Our MOM has about 17,900 metrics reporting to it.  All of our EMs are RHEL 7.3.  MOM 16GB/10GB for the EM/3 GB WebView/4CPU core, Collectors 8GB/6GB/2 CPU Cores.

     

    Hope this helps,

     

    Billy



  • 12.  Re: CA APM Slow Performance

    Posted Oct 28, 2017 12:38 AM

    Hi bwcole,

     

    Thank you for your detailed reply. We have CEM installed, however it is not being used. We don't have any business transactions defined.

     

    Also my smartstore data is stored on E Drive whereas the application and traces data is on the D Drive. Should I move my traces also to the E Drive. There is nothing else stored/installed on the E Drive?

     

    Thanks,

     

    Arnab



  • 13.  Re: CA APM Slow Performance

    Posted Oct 30, 2017 06:58 AM

    Arnab,

     

    https://docops.ca.com/ca-apm/10/en/ca-apm-sizing-and-performance/hardware-sizing-and-performance

    Disk I/O SubsystemThe disk I/O subsystem restrictions apply to all available storage choices such as local disks and external storage solutions such as SAN.
    The OS resides on a separate physical disk from CA APM data.
    Each SmartStor database resides on a dedicated physical disk.
    The Enterprise Manager heuristics database (variance.db) and Transaction Event database (traces.db) files can reside on the same physical disk. However, these databases cannot be on the same disk as SmartStor to avoid I/O conflicts with SmartStor.
    CA Technologies recommends disk drive speeds of 10,000 RPMs or faster.

     

    If you are not having SmartStor duration issues, and you move the variance.db or traces to the e drive, really good chance you will start having SmartStor duration issues.

     

    junwah provided the setting in the apm-events-thresholds-config.xml. 

     

    Once that is all done, and you restarted your EMs, now to figure out what is clogging up your transaction traces.

     

    1. Open a new "Historical Query Viewer"

    2. In the Query type in "type traces"

    3. Click on the "Go" button

    This will give you the last hour of traces that were ran.  In the left most column it has type "R" regular, "E" for Error.

     

    Now sort through and see if there are any patterns, such as specific hosts, specific type of agent, specific durations or all grouped within a time range.

     

    If you only see regular traces, then changing the trace storage to 7 days, and 4 GB should help your issues. If you see that 7 day's worth of traces is greater than 4 GB, then you need to decide how useful the traces are and if you should increase the space or decrease the number of days.

     

    If you see error traces, then you need to triage those errors and try to clear them.  One thing we did was to increase the stall time from 30 seconds to hours for a batch process that took regular and hour to hour-and-half.  But that is after we triaged the process to determine what the impact of the transaction was on the whole system and if it was a sign that there was a resource or performance issue.



  • 14.  Re: CA APM Slow Performance

    Posted Oct 30, 2017 03:17 PM

    Hi bwcole,

     

    Thank you for your reply and I appreciate all the suggestions you have provided me. I see a difference in performance after making some of these changes.

     

    I tried to run the query, however it does not return anything. I have attached a screenshot. Could you please let me know what I might be missing?

     

    Thanks,

     

    Arnab



  • 15.  Re: CA APM Slow Performance

    Posted Oct 30, 2017 03:18 PM

    Historical Query Viewer



  • 16.  Re: CA APM Slow Performance

    Posted Oct 31, 2017 10:40 AM

    Take the quotes out and then you can change the time range to a day or so to get more traces.